UMBC CMSC 491/691-I Fall 2002 |
[
Home
|
News
|
Syllabus
|
Project
]
Last updated: 27 October 2002 |
Week | Topic | Readings | Assignment |
---|---|---|---|
1: 8/29 | Introduction to IR, course overview | Ch. 1 and papers | |
2: 9/3 | User vs. System-centered IR; Text processing | Sec. 7.1-2 and papers | Homework 1 Project Phase I released |
3: 9/10 | Inverted indices | Sec. 8.1-2 and papers | Homework 2 |
4: 9/17 | Compression of indices and text | Rest of Ch. 7 and papers | Homework 3 |
5: 9/24 | IR Models: Boolean | Ch. 2 | |
6: 10/1 | IR Models: Vector Space | Ch. 2 and papers | Project Phase II out |
7: 10/8 | IR Models: Probabilistic Model | Ch. 2 and papers | |
8: 10/15 | Evaluation, Test Collections | Ch. 3, TREC web site | Homework 4 Project Phase III out |
9: 10/22 | Relevance feedback, Latent Semantic Indexing | Ch. 5, papers | Homework 5 |
10: 10/29 | Web Search | Ch. 13, Papers | Phase III Proposals due (10/31) |
11: 11/5 | Information Filtering; Collaborative Filtering | Papers | Homework 6 |
12: 11/12 | Distributed IR; Agent-based IR | Ch. 9, Papers | |
13: 11/18 | User interfaces for IR | Ch. 10 | |
14: 11/26 | Presentations (11/28 - Thanksgiving) | ||
15: 12/3 | Presentations | Project due | |
16: 12/10 | Presentations (12/10 - last day of class, 12/12 - exam day, 6-8pm) |
The primary text for the course is Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto. This is a fairly new text that covers a lot of ground. An additional recommended text, Managing Gigabytes, by Ian Wittan, Alistair Moffat, and Tim Bell, focuses on the details of implementing a search system.
To fill in the details, there will also be a number of papers assigned; they will be made available either from these web pages or handed out in class. (Some papers from the web may require you to access them from UMBC, such as those from the ACM Digital Library or JASIS.) These papers are either seminal reading on the topic, or cutting-edge results from recent conferences and journals. They are required reading for graduate students, and recommended reading for undergraduates.
Project | 70% | (each phase in equal proportion) |
Homework | 20% | HW 1 | HW 2 | HW 3 | HW 4 | HW 5 | HW 6 |
Class Participation | 10% | |
As you can see, the project forms most of the grade for this course. The project consists of multiple phases, each contributing to the final grade. Homework assignments will be smaller-scale explorations of the course material. Class participation means contributing to class discussions, asking intelligent questions, and (possibly) an in-class presentation either on material outside the lecture or on an interesting aspect of your project.
The project is due on 12/3. That doesn't leave much time for lateness; if you're planning an incomplete, please see me before this point or I may not allow it.
Homework is generally due on the Monday after it is assigned. Late homework will not be accepted without prior arrangement. If you know you will miss a class, check the web page to make sure nothing is due. If you are unexpectedly absent, email me.
Students are expected to adhere to the highest standards of academic integrity. It is vital that your work be your own. Cheating, fabrication, plagiarism, and other forms of academic dishonesty will not be tolerated. The consequences include a failing grade, notification of the university, and possible expulsion from the university.
Week 1 |
As
We May Think, by Vannevar Bush (1945). A visionary article that
made a lot of interesting predictions. The Seven Ages of Information Retrieval, by Michael Lesk (1995). A brief, systems-oriented history of IR. Information Retrieval, 2nd Edition, Chapter 1, by Keith van Rijsbergen (1979). Along with Salton's 1983 book "Modern Information Retrieval", this was the textbook on IR for a long time. Dr. van Rijsbergen has made it accessible on the web since it's out of print. Chapter 1 gives an interesting counterpoint to our textbook's introduction.
|
Week 2 |
Chapters 7 and 8 from
Information Retrieval: Data Structures and Algorithms, by
W. Frakes and R. Baeza-Yates, supplements the material in chapters
7 and 8 of our textbook. The link above is the book homepage.
All the source code from the book is also available from
Frakes' web pages. For example, you can download Frakes' DFA
generator from there and use it in your project. "An algorithm for suffix stripping", by M. F. Porter, is the original paper on the eponymous Porter Stemmer. You can implement the stemmer from his paper, if you like, or you can download some code (in C, Java, Perl, Python, C#, and Common Lisp!) from Porter's own page.
|
Week 3 |
Chapter 3 from Frakes and Baeza-Yates. This chapter, by Harman et al., discusses B-trees, inverted index construction, and a sophisticated inversion algorithm. Ternary Search Trees (Dr. Dobbs Journal, April 1998) by Bentley and Sedgewick, is a very nice article on ternary digital search trees with code examples. It has a link to their homepage where you can find the longer symposium paper, which will be of interest to graduate students. TSTs are a compromise between (n-ary) tries and binary search trees. Prefix B-Trees, by Bayer and Unterauer. (ACM TODS 2/1, Mar 1977) |
Week 4 |
Managing Gigabytes, Chapter 3, covers on Huffman coding and computing canonical Huffman codes for compressing text. Compression of inverted indexes For fast query evaluation, by Scholer et al, is a concise presentation of various index compression schemes, and particularly on the effects of compression on retrieval speed (SIGIR 2002, may only be available from UMBC). |
Week 6 |
"Ranking Algorithms" (first half), by Donna Harman, is chapter 14 in
the Frakes and Baeza-Yates book. It gives a good discussion of the
tradeoffs and choices among different term-weighting strategies.
Exploring the Similarity Space by Zobel and Moffat (SIGIR Forum, Spring 1998) is a huge treatment of similarity functions, including term weights and document normalization schemes from a number of models and papers. They conducted many many experiments on modern, large collections. |
Week 7 |
Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval, by Robertson and Walker, discusses the theory behind the BM11 weighting scheme, the direct ancestor to the BM25 used in their later TREC papers. BM25 is currently the best performing "classical" probabilistic ranking algorithm. Okapi at TREC-3, by Roberston et al. shows how well it works in practice, and how to set the parameters. Their TREC-5 paper gives some later perspective on the parameter settings. These two papers give concrete performance numbers for the "2-Poisson" paper. |
Week 9 |
Improving Retrieval Performance by Relevance Feedback, by Salton
and Buckley. (JASIS 41/4, June 1990; may only be accessible from UMBC)
Indexing by Latent Semantic Analysis, by Deerwester et al. (JASIS 41/6, Sep 1990; may only be accessible from UMBC) |
Week 10 |
There are no required readings aside from the textbook this week.
However, in preparing the lecture I made use of several papers which
you might find useful for details on algorithms, architectures, or
implementations: "Graph Structure on the Web" by Broder et al is one of the best recent papers on web characterization. "The Connectivity Server" describes the first architecture of the they used to discover the bow-tie model. Jon Kleinberg's home page has links to his papers on the HITS ("hubs and authorities") ranking algorithm. "The Anatomy of a Large-Scale Hypertextual Web Search Engine" describes the precommercial Google system, including a small section on the PageRank algorithm. |
Week 11 |
Information Filtering and Information Retrieval: Two
sides of the same coin? by Nicholas Belkin and Bruce Croft. (CACM
35/12, Dec 1992) Social Information Filtering: Algorithms for Auotmating 'Word of Mouth', by Upendra Shardnand and Pattie Maes (CHI'95) The Science of the Sleeper", by Malcolm Gladwell (The New Yorker, 4 Oct 1999) |
Week 12 |
Distributed
Information Retrieval by Jamie Callan (2000), is Chapter 5 of Bruce
Croft's Advances in Information Retrieval and has an excellent
set of references. Generalizing GlOSS to Vector-Space Databases and Broker Hierarchies. by Gravano et al. (Stanford University Tech Note STAN-CS-TN-95-21, May 1995). Methodologies for Distributed Information Retrieval, by de Krester et al. (ICDCS 1998) |