Resume
Work Experiences
Course Projects
Source Codes
Relate Links
Contact Me
 
Copyright © 2004
Guang Huang
Hit Counter
Logo
 HomeResumeWork ExperiencesCourse ProjectsSource CodesLinkContact
 

HOME > Course Projects > CMSC 698 > Information Retrieval System Design


Summary
Description
...System Configuire, Enviroments & Resources
Main Ideas
...Using Sub Data Collection
...Data Structure of posting list

Download the codes from HERE new!

Summary

The system will include
1) basic information retrieval system component module -- indexing component module, searching component module.
2) Dynamic Data Collection IR functionilities
• Adding data file to IR System
• Remoing data files from IR System
• Coalesce IR System (under debuging)
3) Other functinilities
• GUI for System
• Dynamic selected TF/IDF during query stage
• select query goups during query stage

Description

System Configure, Environments & Resources
Data Collections: Subset of Reuters Collection Volume 1 (RCV1)
1) All the documents in RCV1 have been coded for topic region and industry sector. The topic codes represent the subject area(s) of the each document. They are organized into four hierarchical groups with 4 top-level nodes.
2)Index each group as the sub data collection.
organized Reuters Corpus into 5 sub data collection:
--- Corporate/Industrial(CCAT)
--- Economics (ECAT)
--- Government/Social (GCAT)
--- Markets (MCAT)
--- Others (Othres)
Stemming Algorithm (Java Version)
1) Porter Stemming Algorithm with some modifications
2) The interface (method) modification
3) The performance for java language modification
Stop Lists: Stop list from Van Rijsbergen's textbook with some modifications
XML/SGML analysis package
1) The SGML process package from http://www.gkrueger.com/java/ with some modifications.
2) For the error processing, in this package, if the file has the SGML error, it will stop processing.
Compress
1) None
VSM model (Vector Space Model)

Main Ideas

Using Sub Data Collection
Each sub collection is the group of the specific topic. Each sub collection has its own inverted index file, the keyword index file and the document info file.
Data Structure of posting list (for the update of the document)
Redundance space + overflow list
(Always keep 2% - 5% of length of posting list reduance space)
1) Add the new document
a) Parse the document and get the triples (word, docid, freq).
b) Insert the (docid, freq) to posting list (keyword is “word”)
i. If the reduance space is enough to put the pair, put it to the posting list
ii. If the reduance space is too small, allot the overfloat space (5% more of the posting list) for this posting list, and insert it to the overfloat space)
iii. For some new keyword, only need to allot the posting list and append it to the inverted index file
iv. Append this new overfloat list to the inverted index file and correct the pointer.
2) Delete the document
a) Parse the document and get the pares (word, docid).
b) Delete the (docid, freq) from posting list (keyword is “word”)
c) Do not need to shrink
3) Modify the document (can be implemented by the delete and add operations)