UMBC CMSC 491/691-I Spring '01 |
[
CSEE
|
491/691-I
|
News
|
Syllabus
|
Lectures
|
Project
]
Last updated: 7 Mar 2001 |
Week | Topic | Readings | Assignment |
---|---|---|---|
1: 1/29 |
Introduction to IR, course overview User vs. System-centered IR |
Ch. 1 and papers | HW 1 out (due week 2) |
2: 2/5 | Text processing | Sec. 7.1-2 and papers | Project Phase I out |
3: 2/12 | Inverted indices | Sec. 8.1-2 and papers | |
4: 2/19 | Compression of indices and text | Rest of Ch. 7 and papers | |
5: 2/26 | IR Models: Boolean, Vector space | Ch. 2 and papers | |
6: 3/5 | IR Models: Vector Space (continued), Probabilistic | Ch. 2 and papers | Project Phase II out |
7: 3/12 | Probabilistic (cont), Inference nets | ||
8: 3/19 | SPRING BREAK | ||
9: 3/26 | Evaluation, Test Collections | Ch. 3, TREC web site | HW 2 out, Project Phase III out |
10: 4/2 | Relevance feedback | Ch. 5, papers | |
11: 4/9 | Software Agents and IR | Papers | |
12: 4/16 | Information filtering; collaborative filtering | Papers | |
13: 4/23 | Web Search | Ch. 13, papers | |
14: 4/30 | User interfaces for IR | Ch. 10, papers | |
15: 5/7 | Presentations | Project due | |
16: 5/14 | Presentations |
The primary text for the course is Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto. This is a new text that covers a lot of ground and is pretty up-to-date. To fill in the details, there will also be a number of papers assigned; they will be made available either from this web page (accessible from UMBC only) or handed out in class. These papers are either seminal reading on the topic, or cutting-edge results from recent conferences and journals. They are required reading for graduate students, and recommended reading for undergraduates.
Homework assignments will be few, and will mostly ask you to write a short program to demonstrate something discussed in class, run it on some sample data and look at the results. Homeworks require an afternoon or less to complete.
Homework 1 is to generate some corpus statistics. The solutions are here.
Homework 2 is to evaluate the retrieval effectiveness of two web search engines.
In-depth information on the course project is here. A brief description follows.
For the course project you will design and implement your own information retrieval system. The project will have three phases. In Phase I, you will build the indexing component, which will take a large collection of text and produce a searchable, persistent data structure. In Phase II, you will add the searching component, according to one of the models discussed in class. In Phase III, you will add some advanced functionality of your choice. The project is due at week 15.
Phases I and II are required. They will consist of a small set of milestones which you should tackle in order. To encourage you to not put off the project until the last week of class (and who would do that?), there will be periodic benchmarks, where you can test your system against everyone else's using some standard data sets. A specific project specification will be given out at each phase.
Phase III is required of graduate students, and "encouraged", but not strictly required for undergraduates (You'll see why once you get started! Phases I and II are a lot of work.)
You will be graded on what you complete, given that Phases I and II are required.
Project | 70% |
(each phase in equal proportion) | |
Homework | 20% |
Class Participation | 10% |
The project is due on May 9th (the second session in the second-to-last week of class). That doesn't leave much time for lateness; if you're planning an incomplete, please see me before this point or I may not allow it.
Homework is generally due on the Monday after it is assigned. Late homework will not be accepted without prior arrangement. If you know you will miss a class, check the web page to make sure nothing is due. If you are unexpectedly absent, email me.Week 1 |
The Seven Ages
of Information Retrieval, by Michael Lesk (1995). A brief,
systems-oriented history of IR. As We May Think, by Vannevar Bush (1945). A visionary article that makes a lot of interesting predictions.
Information
Retrieval, 2nd Edition, Chapter 1, by Keith van Rijsbergen (1979).
Along with Salton's 1983 book, this was the textbook
on IR for a long time. Dr. van Rijsbergen has made it accessible on
the web since it's out of print. Chapter 1 gives an interesting
counterpoint to our textbook's introduction. |
Week 2 |
Chapters 7 and 8 from
Information Retrieval: Data Structures and Algorithms, by
W. Frakes and R. Baeza-Yates, were handed out in class to
supplement the material in chapters 7 and 8 of our textbook. The
link above is the book homepage, which has all the source code from
the book. For example, you can download Frakes' DFA generator from
there and use it in your project. "An algorithm for suffix stripping", by M. F. Porter, is the original paper on the eponymous Porter Stemmer. You can implement the stemmer from his paper, if you like, or you can download some code (in C, Java, Perl, and BCPL!) from Porter's own page. I will make a version available that fixes several known idiosyncracies.
|
Week 3 |
Chapter 3 from Frakes and Baeza-Yates. This chapter, by Harman et al., discusses B-trees, inverted index construction, and a sophisticated inversion algorithm. Ternary Search Trees (Dr. Dobbs Journal, April 1998) by Bentley and Sedgewick, is a very nice article on ternary digital search trees with code examples. It has a link to their homepage where you can find the longer symposium paper, which will be of interest to graduate students. TSTs are a compromise between (n-ary) tries and binary search trees. "Prefix B-Trees", by Bayer and Unterauer. |
Week 4 |
A section from Managing Gigabytes, Chapter 3, on Huffman coding and computing canonical Huffman codes. |
Week 5 |
"Ranking Algorithms" (first half), by Donna Harman, is chapter 14 in the Frakes and Baeza-Yates book. It gives a good discussion of the tradeoffs and choices among different term-weighting strategies. |
Week 6 |
Please read Okapi at TREC-3, by Roberston et al. This presents the "BM25" weighting scheme, which is currently the best performing "classical" probabilistic ranking algorithm. Their TREC-5 paper gives some later perspective on the parameter settings. These papers give concrete performance numbers for the "2-Poisson" paper handed out in class. |