Syllabus - Information Retrieval

(subject to change, return often!)

Week	Topic	Readings	Assignment
1: 8/29	Introduction to IR, course overview	Ch. 1 and papers
2: 9/3	User vs. System-centered IR; Text processing	Sec. 7.1-2 and papers	Homework 1 Project Phase I released
3: 9/10	Inverted indices	Sec. 8.1-2 and papers	Homework 2
4: 9/17	Compression of indices and text	Rest of Ch. 7 and papers	Homework 3
5: 9/24	IR Models: Boolean	Ch. 2
6: 10/1	IR Models: Vector Space	Ch. 2 and papers	Project Phase II out
7: 10/8	IR Models: Probabilistic Model	Ch. 2 and papers
8: 10/15	Evaluation, Test Collections	Ch. 3, TREC web site	Homework 4 Project Phase III out
9: 10/22	Relevance feedback, Latent Semantic Indexing	Ch. 5, papers	Homework 5
10: 10/29	Web Search	Ch. 13, Papers	Phase III Proposals due (10/31)
11: 11/5	Information Filtering; Collaborative Filtering	Papers	Homework 6
12: 11/12	Distributed IR; Agent-based IR	Ch. 9, Papers
13: 11/18	User interfaces for IR	Ch. 10
14: 11/26	Presentations (11/28 - Thanksgiving)
15: 12/3	Presentations		Project due
16: 12/10	Presentations (12/10 - last day of class, 12/12 - exam day, 6-8pm)

Readings

The primary text for the course is Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto. This is a fairly new text that covers a lot of ground. An additional recommended text, Managing Gigabytes, by Ian Wittan, Alistair Moffat, and Tim Bell, focuses on the details of implementing a search system.

To fill in the details, there will also be a number of papers assigned; they will be made available either from these web pages or handed out in class. (Some papers from the web may require you to access them from UMBC, such as those from the ACM Digital Library or JASIS.) These papers are either seminal reading on the topic, or cutting-edge results from recent conferences and journals. They are required reading for graduate students, and recommended reading for undergraduates.

Grading

Project	70%	(each phase in equal proportion)
Homework	20%	HW 1 \| HW 2 \| HW 3 \| HW 4 \| HW 5 \| HW 6
Class Participation	10%

As you can see, the project forms most of the grade for this course. The project consists of multiple phases, each contributing to the final grade. Homework assignments will be smaller-scale explorations of the course material. Class participation means contributing to class discussions, asking intelligent questions, and (possibly) an in-class presentation either on material outside the lecture or on an interesting aspect of your project.

Late Policy

The project is due on 12/3. That doesn't leave much time for lateness; if you're planning an incomplete, please see me before this point or I may not allow it.

Homework is generally due on the Monday after it is assigned. Late homework will not be accepted without prior arrangement. If you know you will miss a class, check the web page to make sure nothing is due. If you are unexpectedly absent, email me.

Academic Integrity

Students are expected to adhere to the highest standards of academic integrity. It is vital that your work be your own. Cheating, fabrication, plagiarism, and other forms of academic dishonesty will not be tolerated. The consequences include a failing grade, notification of the university, and possible expulsion from the university.

Papers to Read

Week 1	As We May Think, by Vannevar Bush (1945). A visionary article that made a lot of interesting predictions. The Seven Ages of Information Retrieval, by Michael Lesk (1995). A brief, systems-oriented history of IR. Information Retrieval, 2nd Edition, Chapter 1, by Keith van Rijsbergen (1979). Along with Salton's 1983 book "Modern Information Retrieval", this was the textbook on IR for a long time. Dr. van Rijsbergen has made it accessible on the web since it's out of print. Chapter 1 gives an interesting counterpoint to our textbook's introduction.
Week 2	Chapters 7 and 8 from Information Retrieval: Data Structures and Algorithms, by W. Frakes and R. Baeza-Yates, supplements the material in chapters 7 and 8 of our textbook. The link above is the book homepage. All the source code from the book is also available from Frakes' web pages. For example, you can download Frakes' DFA generator from there and use it in your project. "An algorithm for suffix stripping", by M. F. Porter, is the original paper on the eponymous Porter Stemmer. You can implement the stemmer from his paper, if you like, or you can download some code (in C, Java, Perl, Python, C#, and Common Lisp!) from Porter's own page.
Week 3	Chapter 3 from Frakes and Baeza-Yates. This chapter, by Harman et al., discusses B-trees, inverted index construction, and a sophisticated inversion algorithm. Ternary Search Trees (Dr. Dobbs Journal, April 1998) by Bentley and Sedgewick, is a very nice article on ternary digital search trees with code examples. It has a link to their homepage where you can find the longer symposium paper, which will be of interest to graduate students. TSTs are a compromise between (n-ary) tries and binary search trees. Prefix B-Trees, by Bayer and Unterauer. (ACM TODS 2/1, Mar 1977)
Week 4	Managing Gigabytes, Chapter 3, covers on Huffman coding and computing canonical Huffman codes for compressing text. Compression of inverted indexes For fast query evaluation, by Scholer et al, is a concise presentation of various index compression schemes, and particularly on the effects of compression on retrieval speed (SIGIR 2002, may only be available from UMBC).
Week 6	"Ranking Algorithms" (first half), by Donna Harman, is chapter 14 in the Frakes and Baeza-Yates book. It gives a good discussion of the tradeoffs and choices among different term-weighting strategies. Exploring the Similarity Space by Zobel and Moffat (SIGIR Forum, Spring 1998) is a huge treatment of similarity functions, including term weights and document normalization schemes from a number of models and papers. They conducted many many experiments on modern, large collections.
Week 7	Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, by Robertson and Walker, discusses the theory behind the BM11 weighting scheme, the direct ancestor to the BM25 used in their later TREC papers. BM25 is currently the best performing "classical" probabilistic ranking algorithm. Okapi at TREC-3, by Roberston et al. shows how well it works in practice, and how to set the parameters. Their TREC-5 paper gives some later perspective on the parameter settings. These two papers give concrete performance numbers for the "2-Poisson" paper.
Week 9	Improving Retrieval Performance by Relevance Feedback, by Salton and Buckley. (JASIS 41/4, June 1990; may only be accessible from UMBC) Indexing by Latent Semantic Analysis, by Deerwester et al. (JASIS 41/6, Sep 1990; may only be accessible from UMBC)
Week 10	There are no required readings aside from the textbook this week. However, in preparing the lecture I made use of several papers which you might find useful for details on algorithms, architectures, or implementations: "Graph Structure on the Web" by Broder et al is one of the best recent papers on web characterization. "The Connectivity Server" describes the first architecture of the they used to discover the bow-tie model. Jon Kleinberg's home page has links to his papers on the HITS ("hubs and authorities") ranking algorithm. "The Anatomy of a Large-Scale Hypertextual Web Search Engine" describes the precommercial Google system, including a small section on the PageRank algorithm.
Week 11	Information Filtering and Information Retrieval: Two sides of the same coin? by Nicholas Belkin and Bruce Croft. (CACM 35/12, Dec 1992) Social Information Filtering: Algorithms for Auotmating 'Word of Mouth', by Upendra Shardnand and Pattie Maes (CHI'95) The Science of the Sleeper", by Malcolm Gladwell (The New Yorker, 4 Oct 1999)
Week 12	Distributed Information Retrieval by Jamie Callan (2000), is Chapter 5 of Bruce Croft's Advances in Information Retrieval and has an excellent set of references. Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies. by Gravano et al. (Stanford University Tech Note STAN-CS-TN-95-21, May 1995). Methodologies for Distributed Information Retrieval, by de Krester et al. (ICDCS 1998)

Syllabus - Information Retrieval

Readings

Grading

Late Policy

Academic Integrity

Papers to Read

Week 1

Week 2

Week 3

Week 4

Week 6

Week 7

Week 9

Week 10

Week 11

Week 12