Resume
Work Experiences
Course Projects
Source Codes
Relate Links
Contact Me
 
Copyright © 2004
Guang Huang
Hit Counter
Logo
 HomeResumeWork ExperiencesCourse ProjectsSource CodesLinkContact
 

HOME > Course Projects > Information Retrieval > Information Retrieval System Design


Summary
Description
...System Configuire, Enviroments & Resources
...Architecture
......Indexing Component
......Search Component
...Parallel Searching --- Distributed Information Retrieval System
Implementation
...Lexical analyzer
...Inverter
...Searching Component
Graph User Interface
Compression
Vector Space Model (TF/ITF)

Download the codes from HERE

Summary

The system will include the basic information retrieval system component module -- indexing component module, searching component module. For the advance functionalities in this information retrieval system, it will include the graphic user interface, parallel search, compression, and distributed data storage.

Description

System Configure, Environments & Resources
Data Collections: Subset of Reuters Collection Volume 1 (RCV1)
Stemming Algorithm (Java Version)
1) Porter Stemming Algorithm with some modifications
2) The interface (method) modification
3) The performance for java language modification
Stop Lists: Stop list from Van Rijsbergen's textbook with some modifications
XML/SGML analysis package
1) The SGML process package from http://www.gkrueger.com/java/ with some modifications.
2) For the error processing, in this package, if the file has the SGML error, it will stop processing.
Compress
1) Using the Variable Compress for the posting file. Compress the gap of document and frequency
VSM model (Vector Space Model)

Architecture
There are 2 types of basic component modules in the Information Retrieval System: Indexing & Searching. The relationship of them is showed as following: The indexing component generates the inverted index files and other utility files that will be used by searching component.

Architecture

Indexing Component
The functionalities for the indexing component is to generate the inverted index file and some associated files.
It includes:
• Lexical analyzer: Performs lexical analysis of the data collections
• Inverter: Generate the Inverted Index file.

Index Component


Lexical analyzer

Inverter

Searching Component
Information Retrieval Model: Vector Space Model (also implement the Boolean Model except the sorted of the relevant of the result document)
Input: the search words separated by the space
Output: The sorted files that satisfies the Vector Space Model expression
Comments:
o In this design, we can change the model and do not modify the codes a lot.
o In Vector Space Model model, we can use the parallel searching every item.

Parallel Searching --- Distributed Information Retrieval System
In this scheme, the data collections, inverted index files and associated files distribute in the multiple servers. There are different models for this scheme. In this project, I don’t consider the data collection portioning.
The distributed information retrieval system:
Center data collections: data collections are stored in one server.
Multi copies of the inverted index files and associated files.

Distributed IR

Implementation

Lexical analyzer



Lexical Analysiser

StemmingI: stemming interface, every specific stemming algorithm can derive form this interface
StopListI: stop list interface, very specific Stop list algorithm can derive from this interface
TokenizerI: tokenizer interface, every specific tokenizer can derive from this interface
DefaultStemming: default stemming algorithm, not stemming
DefaultStopList: default stop list algorithm, not using any stop list
DefaultTokenizer: using the StreamTokenizer implement the tokenizier
RijsbergenStopList: using the Rijsbergen stop list
RegExpTokenizer: using the pattern method to implement the tokenizer
BasicLexicalAnalysis: Basic lexical analysis. Every specific lexical analysiser can derive from this class. The class only implement the getNextToke
LexicalAnalysisForProject: Lexical Analysis For project

Inverter

Inverter

InvertedListHelperI: the interface using for inverting the basic inverted List file to the indexed inverted list file and posting file and other utilities function
InvertedListHelperFactory: manage the InvertedListHelper
InvertedIndexMaker: the simple Inverter wrapper for inverted the indexed list file

Searching Component


Search Component

Searcher: search component
SearchUI: the GUI of the search functionality.

Graph User Interface

Normal Search Dialog:
Remove Data Collection Server Setting:

Table Columns View Setting:

Menu:

Document Show Dialog:

Compression

Compression

In this project, using the Variable Compression scheme.
CompressI: the basic compress interface
CompressResult: the result of the compression
Delta: Delta compression.
Gamma: gamma compression
Unary: unary compression
Variable: Variable Compression

Vector Space Model (TF/ITF)


TF/ITF


IDFI: The interface for calculating the IDF
TFI: The interface for calculating the TF
VSMFactory: The class that manage the IDF & TF. Every class wants get the instance of the IDFI & TFI must be access this class and get them.
DefaultIDF: log (1 + N / Ti)
DefaultTF: 1 + log ( 1 + Fi)
• In this scheme, we can change the different TF & IDF that can not affect other modules.
• For the large data collection (like whole RCV1), may use different algorithm to implement select the top relevant document. Because in the Cosine algorithm (taught in class), the space is extended when the data collection is large. In this scheme, we can create the file that indexes every document which includes the whole keywords. Using the variable compress, it is little smaller than the posting file.