UMBC CMSC 491/691-I Fall 2002 |
[
Home
|
News
|
Syllabus
|
Project
]
Last updated: 26 November 2002 |
For Phase II, you will write a program which will accept queries from the user and search for documents using the data structures produced in Phase I. You will choose a retrieval model from those discussed in class (e.g. boolean, vector space, probabilistic), and implement the inverted search algorithm using the model to rank the documents.
Your search interface must allow the user to:
You may implement any model you choose, I only ask that you plan your approach with an eye toward effective retrieval. If you choose to do a Boolean approach, you may want to consider how you might choose to rank the result set that satisfies the query expression. For a vector space model, you need to consider carefully which weighting function to use. For probabilistic and some vector space models, there are tuning parameters which need to be set for your collection. You will want to test several queries of your own to get a sense of how well your algorithm is performing. Feel free to refer to the papers cited in class or in the reading for tips.
Take a query of at least five words interactively from the user, retrieve the top 100 documents from the collection, show the top 20 document identifiers to the user with scores, and let the user choose one to display. Report the amount of time needed for the entire interaction, from after the user first enters their query to when they can see a full document on the screen.
Choose a topic from the topics file for your chosen collection. From that topic, create one query for each of (a) the title section, (b) the description section, and (c) the narrative section of the topic. Your queries should be as complete as possible given the information in that topic section and your query language (Boolean operators, +/- clauses, range operators, etc). You will record the time required to rank 100 documents for each query, using either a stopwatch or timing functions within your code.
Use your program to automatically index the topics for your collection, or you may create queries by hand if you wish. Your system should rank the top 100 documents for each query and collect them into a "TREC top results file", the format of which is described in the handout on trec_eval from class as well as the file trec_eval.README in the data directory. You will then run the trec_eval program on your "top results" file and the qrels file for your collection to produce evaluation measurements on your results.
Submit this file with the name "phase2.results.ian", replacing "ian" with your username, using the Blackboard "digital drop box" facility.