UMBC CMSC 491/691-I Spring '01

Syllabus - Information Retrieval

Week	Topic	Readings	Assignment
1: 1/29	Introduction to IR, course overview User vs. System-centered IR	Ch. 1 and papers	HW 1 out (due week 2)
2: 2/5	Text processing	Sec. 7.1-2 and papers	Project Phase I out
3: 2/12	Inverted indices	Sec. 8.1-2 and papers
4: 2/19	Compression of indices and text	Rest of Ch. 7 and papers
5: 2/26	IR Models: Boolean, Vector space	Ch. 2 and papers
6: 3/5	IR Models: Vector Space (continued), Probabilistic	Ch. 2 and papers	Project Phase II out
7: 3/12	Probabilistic (cont), Inference nets
8: 3/19	SPRING BREAK
9: 3/26	Evaluation, Test Collections	Ch. 3, TREC web site	HW 2 out, Project Phase III out
10: 4/2	Relevance feedback	Ch. 5, papers
11: 4/9	Software Agents and IR	Papers
12: 4/16	Information filtering; collaborative filtering	Papers
13: 4/23	Web Search	Ch. 13, papers
14: 4/30	User interfaces for IR	Ch. 10, papers
15: 5/7	Presentations		Project due
16: 5/14	Presentations

Readings

The primary text for the course is Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto. This is a new text that covers a lot of ground and is pretty up-to-date. To fill in the details, there will also be a number of papers assigned; they will be made available either from this web page (accessible from UMBC only) or handed out in class. These papers are either seminal reading on the topic, or cutting-edge results from recent conferences and journals. They are required reading for graduate students, and recommended reading for undergraduates.

Homework

Homework assignments will be few, and will mostly ask you to write a short program to demonstrate something discussed in class, run it on some sample data and look at the results. Homeworks require an afternoon or less to complete.

Homework 1 is to generate some corpus statistics. The solutions are here.

Homework 2 is to evaluate the retrieval effectiveness of two web search engines.

Project

In-depth information on the course project is here. A brief description follows.

For the course project you will design and implement your own information retrieval system. The project will have three phases. In Phase I, you will build the indexing component, which will take a large collection of text and produce a searchable, persistent data structure. In Phase II, you will add the searching component, according to one of the models discussed in class. In Phase III, you will add some advanced functionality of your choice. The project is due at week 15.

Phases I and II are required. They will consist of a small set of milestones which you should tackle in order. To encourage you to not put off the project until the last week of class (and who would do that?), there will be periodic benchmarks, where you can test your system against everyone else's using some standard data sets. A specific project specification will be given out at each phase.

Phase III is required of graduate students, and "encouraged", but not strictly required for undergraduates (You'll see why once you get started! Phases I and II are a lot of work.)

You will be graded on what you complete, given that Phases I and II are required.

Grading

Project	70%
(each phase in equal proportion)
Homework	20%
Class Participation	10%

Late Policy

The project is due on May 9th (the second session in the second-to-last week of class). That doesn't leave much time for lateness; if you're planning an incomplete, please see me before this point or I may not allow it.

Homework is generally due on the Monday after it is assigned. Late homework will not be accepted without prior arrangement. If you know you will miss a class, check the web page to make sure nothing is due. If you are unexpectedly absent, email me.

Papers to Read

Week 1	The Seven Ages of Information Retrieval, by Michael Lesk (1995). A brief, systems-oriented history of IR. As We May Think, by Vannevar Bush (1945). A visionary article that makes a lot of interesting predictions. Information Retrieval, 2nd Edition, Chapter 1, by Keith van Rijsbergen (1979). Along with Salton's 1983 book, this was the textbook on IR for a long time. Dr. van Rijsbergen has made it accessible on the web since it's out of print. Chapter 1 gives an interesting counterpoint to our textbook's introduction.
Week 2	Chapters 7 and 8 from Information Retrieval: Data Structures and Algorithms, by W. Frakes and R. Baeza-Yates, were handed out in class to supplement the material in chapters 7 and 8 of our textbook. The link above is the book homepage, which has all the source code from the book. For example, you can download Frakes' DFA generator from there and use it in your project. "An algorithm for suffix stripping", by M. F. Porter, is the original paper on the eponymous Porter Stemmer. You can implement the stemmer from his paper, if you like, or you can download some code (in C, Java, Perl, and BCPL!) from Porter's own page. I will make a version available that fixes several known idiosyncracies.
Week 3	Chapter 3 from Frakes and Baeza-Yates. This chapter, by Harman et al., discusses B-trees, inverted index construction, and a sophisticated inversion algorithm. Ternary Search Trees (Dr. Dobbs Journal, April 1998) by Bentley and Sedgewick, is a very nice article on ternary digital search trees with code examples. It has a link to their homepage where you can find the longer symposium paper, which will be of interest to graduate students. TSTs are a compromise between (n-ary) tries and binary search trees. "Prefix B-Trees", by Bayer and Unterauer.
Week 4	A section from Managing Gigabytes, Chapter 3, on Huffman coding and computing canonical Huffman codes.
Week 5	"Ranking Algorithms" (first half), by Donna Harman, is chapter 14 in the Frakes and Baeza-Yates book. It gives a good discussion of the tradeoffs and choices among different term-weighting strategies.
Week 6	Please read Okapi at TREC-3, by Roberston et al. This presents the "BM25" weighting scheme, which is currently the best performing "classical" probabilistic ranking algorithm. Their TREC-5 paper gives some later perspective on the parameter settings. These papers give concrete performance numbers for the "2-Poisson" paper handed out in class.

Ian Soboroff

Last modified: Wed Mar 28 15:27:44 EST 2001

Syllabus - Information Retrieval

Readings

Homework

Project

Grading

Late Policy

Papers to Read

Week 1

Week 2

Week 3

Week 4

Week 5

Week 6