UMBC CMSC 491/691-I Fall 2002 |
[
Home
|
News
|
Syllabus
|
Project
]
Last updated: 24 September 2002 |
Assignment: Write a program (possibly based on your solution to Homework 2) that computes the size of the uncompressed and compressed indices for the umbc-crawl collection.
Goal: To understand the mechanics and implications of index compression for Phase I of the project.
Due Date: Tuesday, October 1, 2002.
First, adapt your Homework 2 solution to compute the size of an uncompressed index of the umbc-crawl collection. Use the same space assumptions as you made in Homework 2. This will necessitate re-examining many assumptions you were able to make in Reuters-21578 about the quality and layout of the collection:
Because of these differences, your program will need to report the number of documents and unique words.
There are several files at the top of
/data/nicholas2/ian/umbc-crawl
which you may find
helpful:
README
: a description of the crawl mechanism.
crawl-files
: a listing of all the files in the
collection, one per line.
types
: a count of the file types in the collection.
types-mime
: a count of the MIME types of the files in
the collection. (both of these as reported by file(1)).
After you have done the above, modify your program to compute the size of the index as if it were compressed. Do this for two index compression schemes:
Counting offsets is optional, but recommended if you're planning on storing word offsets in your project!
Note that the size of the lexicon will not change, only the size of the inverted file.
You will turn in a HARD-COPY listing of your program(s) and your output giving the estimate of the space needed for these umbc-crawl indices. Don't forget to report the number of unique words and documents, since depending on which files you index this may vary from person to person. Please make sure your name is on every page and that everything is stapled securely together.
Homework is due at the beginning of class. No late homework will be accepted.