UMBC CMSC 491/691-I Fall 2002 |
[
Home
|
News
|
Syllabus
|
Project
]
Last updated: 19 September 2002 |
Assignment: Write a program that measures the rate at which a vocabulary grows as documents are read from a collection. Write a second program that estimates storage requirements for an inverted index of the collection.
Goal: To further explore corpus statistics and their implications for Phase I of the project.
Due Date: Tuesday, September 17, 2002.
This assignment has two parts. You can accomplish both using the program(s) you wrote for HW1 as a starting point.
First, you will measure how vocabulary grows as documents are read from the Reuters-21578 collection. To do this, make your program output the number of bytes read and the number of unique words found so far after each document is read. For example:
$ ./growth.pl 1 3054 260 (Doc, #bytes so far, #words so far) 2 3684 305 3 4207 335 4 7288 520 5 8661 584 6 10220 651 7 10850 681 8 11607 717 9 12151 739 10 13644 808 ...Use this output to make two graphs showing vocabulary growth as a function of the number of bytes and documents read.
Second, you will estimate the size of an in-memory index generated for the Reuters-21578 collection. Imagine that the lexicon is represented by an array of "lexicon-entry" structures, each pointing to a linked list of postings. Assume the following parameters:
For the moment, we will assume the lexicon array has no wasted space (as it would were it a hash table), and ignore storage for the mapping of document identifiers to document locations.
Write a program to count unique words in the collection, as well as the number of documents each word occurs in. These parameters will allow you to estimate the in-memory indexing requirements under the assumptions above. Does linuxserver1 (or your home PC) have enough memory to index Reuters-21578? What about a larger collection?
You will turn in a listing of your program(s), your graphs, and your estimate of the memory needed to index Reuters-21578. Everything should be handed in as hard-copy. Please make sure your name is on every page and that everything is stapled securely together.
Homework is due at the beginning of class. No late homework will be accepted.
Feel free to write one large program, several small programs, and to extend the programs in your previous homework.