Home Page --- Guang Huang

• Technical Skills

• Professional Experiences

• Education/Honors

• Selected Projects

• Blue Matrix Software Ltd. Co. Beijing, China

• Institute of Software of Chinese Academy of Sciences

• Hua-Wei Telecom Company, China

• Dynamic Data Collection IR System new!

• Advance Database

• Advance Operating System

• Advance Computer Architecture

• Multimedia Network

• Information Retrieval

• Artificial Intelligence

• Dynamic Data Collection IR System new!

• E-Commerce Online-BookStore (JSP)

• Distributed File System (JAVA/RMI)

• VLSA design: n-bytes BCD Adder (VLSA)

• Tomasulo's algorithm(dynamic scheduling) simulation

• Information Retrieval Engine (JAVA)

• Game Player Agent(Empire-Builder)

• Related Links

• Address

• Phone Number

• Email Address

HOME > Course Projects > CMSC 698 > Information Retrieval System Design

Summary
Description
...System Configuire, Enviroments & Resources
Main Ideas
...Using Sub Data Collection
...Data Structure of posting list

Download the codes from HERE new!

TOP

Summary

The system will include
1) basic information retrieval system component module -- indexing component module, searching component module.
2) Dynamic Data Collection IR functionilities
• Adding data file to IR System
• Remoing data files from IR System
• Coalesce IR System (under debuging)
3) Other functinilities
• GUI for System
• Dynamic selected TF/IDF during query stage
• select query goups during query stage

TOP

Description

System Configure, Environments & Resources
• Data Collections: Subset of Reuters Collection Volume 1 (RCV1)
1) All the documents in RCV1 have been coded for topic region and industry sector. The topic codes represent the subject area(s) of the each document. They are organized into four hierarchical groups with 4 top-level nodes.
2)Index each group as the sub data collection.
organized Reuters Corpus into 5 sub data collection:
--- Corporate/Industrial(CCAT)
--- Economics (ECAT)
--- Government/Social (GCAT)
--- Markets (MCAT)
--- Others (Othres)
• Stemming Algorithm (Java Version)
1) Porter Stemming Algorithm with some modifications
2) The interface (method) modification
3) The performance for java language modification
• Stop Lists: Stop list from Van Rijsbergen's textbook with some modifications
• XML/SGML analysis package
1) The SGML process package from http://www.gkrueger.com/java/ with some modifications.
2) For the error processing, in this package, if the file has the SGML error, it will stop processing.
• Compress
1) None
• VSM model (Vector Space Model)

TOP

Main Ideas

Using Sub Data Collection
Each sub collection is the group of the specific topic. Each sub collection has its own inverted index file, the keyword index file and the document info file.
Data Structure of posting list (for the update of the document)
Redundance space + overflow list
(Always keep 2% - 5% of length of posting list reduance space)
1) Add the new document
a) Parse the document and get the triples (word, docid, freq).
b) Insert the (docid, freq) to posting list (keyword is “word”)
i. If the reduance space is enough to put the pair, put it to the posting list
ii. If the reduance space is too small, allot the overfloat space (5% more of the posting list) for this posting list, and insert it to the overfloat space)
iii. For some new keyword, only need to allot the posting list and append it to the inverted index file
iv. Append this new overfloat list to the inverted index file and correct the pointer.
2) Delete the document
a) Parse the document and get the pares (word, docid).
b) Delete the (docid, freq) from posting list (keyword is “word”)
c) Do not need to shrink
3) Modify the document (can be implemented by the delete and add operations)

TOP