HOME > Course
Projects > Information Retrieval > Information
Retrieval System Design
Summary
Description
...System Configuire, Enviroments & Resources
...Architecture
......Indexing Component
......Search Component
...Parallel Searching --- Distributed Information Retrieval
System
Implementation
...Lexical analyzer
...Inverter
...Searching Component
Graph User Interface
Compression
Vector Space Model (TF/ITF)
Download the codes from HERE
Summary
The system will include the basic information retrieval system component
module -- indexing component module, searching
component module. For the advance functionalities in this
information retrieval system, it will include the graphic
user interface, parallel search, compression,
and distributed data storage.
Description
System Configure, Environments
& Resources
• Data Collections: Subset of Reuters
Collection Volume 1 (RCV1)
• Stemming Algorithm (Java Version)
1) Porter Stemming Algorithm with some modifications
2) The interface (method) modification
3) The performance for java language modification
• Stop Lists: Stop list from Van Rijsbergen's
textbook with some modifications
• XML/SGML analysis package
1) The SGML process package from http://www.gkrueger.com/java/
with some modifications.
2) For the error processing, in this package, if the file has the SGML
error, it will stop processing.
• Compress
1) Using the Variable Compress for the posting file. Compress the gap
of document and frequency
• VSM model (Vector Space Model)
Architecture
There are 2 types of basic component modules in the Information Retrieval
System: Indexing & Searching.
The relationship of them is showed as following: The indexing component
generates the inverted index files and other utility files that will be
used by searching component.
Indexing Component
The functionalities for the indexing component is to generate the inverted
index file and some associated files.
It includes:
• Lexical analyzer: Performs lexical analysis of the data collections
• Inverter: Generate the Inverted Index file.

Searching Component
• Information Retrieval Model: Vector
Space Model (also implement the Boolean Model except the sorted of the
relevant of the result document)
• Input: the search words separated by
the space
• Output: The sorted files that satisfies
the Vector Space Model expression
• Comments:
o In this design, we can change the model and do not modify the codes
a lot.
o In Vector Space Model model, we can use the parallel searching every
item.
Parallel
Searching --- Distributed Information Retrieval System
In this scheme, the data collections, inverted index files and associated
files distribute in the multiple servers. There are different models for
this scheme. In this project, I don’t consider the data collection
portioning.
The distributed information retrieval system:
• Center data collections: data collections
are stored in one server.
• Multi copies of the inverted index files and
associated files.
Implementation
Lexical
analyzer

• StemmingI: stemming
interface, every specific stemming algorithm can derive form this interface
• StopListI: stop list interface, very
specific Stop list algorithm can derive from this interface
• TokenizerI: tokenizer interface, every
specific tokenizer can derive from this interface
• DefaultStemming: default stemming
algorithm, not stemming
• DefaultStopList: default stop list
algorithm, not using any stop list
• DefaultTokenizer: using the StreamTokenizer
implement the tokenizier
• RijsbergenStopList: using the Rijsbergen
stop list
• RegExpTokenizer: using the pattern
method to implement the tokenizer
• BasicLexicalAnalysis: Basic lexical
analysis. Every specific lexical analysiser can derive from this class.
The class only implement the getNextToke
• LexicalAnalysisForProject: Lexical
Analysis For project
Inverter

• InvertedListHelperI:
the interface using for inverting the basic inverted List file to the
indexed inverted list file and posting file and other utilities function
• InvertedListHelperFactory: manage the
InvertedListHelper
• InvertedIndexMaker: the simple Inverter
wrapper for inverted the indexed list file
Searching
Component
• Searcher: search component
• SearchUI: the GUI of the search functionality.
Graph User Interface
• Normal Search Dialog:
• Remove Data Collection Server Setting:

• Table Columns View Setting:

• Menu:

• Document Show Dialog:
Compression

In this project, using the Variable Compression scheme.
• CompressI: the basic compress interface
• CompressResult: the result of the compression
• Delta: Delta compression.
• Gamma: gamma compression
• Unary: unary compression
• Variable: Variable Compression
Vector Space Model (TF/ITF)

• IDFI: The interface for calculating
the IDF
• TFI: The interface for calculating the
TF
• VSMFactory: The class that manage the
IDF & TF. Every class wants get the instance of the IDFI & TFI
must be access this class and get them.
• DefaultIDF: log (1 + N / Ti)
• DefaultTF: 1 + log ( 1 + Fi)
• In this scheme, we can change the different TF & IDF that
can not affect other modules.
• For the large data collection (like whole RCV1), may use different
algorithm to implement select the top relevant document. Because in the
Cosine algorithm (taught in class), the space is extended when the data
collection is large. In this scheme, we can create the file that indexes
every document which includes the whole keywords. Using the variable compress,
it is little smaller than the posting file.
|