UMBC CMSC 491/691-I Fall 2002 |
[
Home
|
News
|
Syllabus
|
Project
]
Last updated: 26 November 2002 |
Phase I of your project will read a set of documents, parse them into documents and terms, and produce an inverted index and associated data structures which will be stored on disk and used in Phase II.
This phase will have two major components: the lexical analyzer and the inverter. For the lexer, you might choose to use code you wrote for Homework 1. You will need to make explicit what assumptions you make about the structure and words in the documents. You might choose to have your lexer configurable at run-time; the configuration file would then specify how to segment terms, what tag indicates the start of a new document, how to treat numbers, etc.
For the part of your program produces the inverted file, you will want to think carefully through your choice of data structures and algorithms. Use the material from the textbooks and readings distributed in lecture for this. You may choose to use the common libraries available on the UNIX systems; if you do, make sure you say so in your documentation!
Your program will need to save several data structures to disk. At the minimum, these will include the lexicon (the table of words occurring in the text, appropriate metadata, and pointers into the inverted file), the document list (a table indicating where to find the files on disk at retrieval time), and the inverted file itself. You do not need to store the documents locally unless you wish to.
Here are some resources you might find useful in Phase I:
Porter's Snowball Project has several stemmers in other languages. Snowball is a language for writing stemmers.
Which collection are you using for your project?
Report the following collection statistics. If you compute these separately from your indexing process, make sure that both use the same document segmentation, term selection, etc.
Index the collection, and report: