| HOME > Course 
      Projects > CMSC 698 > Information 
      Retrieval System Design
 
 Summary
 Description
 ...System Configuire, Enviroments & Resources
 Main Ideas
 ...Using Sub Data 
        Collection
 ...Data Structure of posting 
        list
 
 Download the codes from HERE 
        new! SummaryThe system will include 1) basic information retrieval system component module -- indexing 
        component module, searching component module.
 2) Dynamic Data Collection IR functionilities
 • Adding data file to IR System
 • Remoing data files from IR System
 • Coalesce IR System (under debuging)
 3) Other functinilities
 • GUI for System
 • Dynamic selected TF/IDF during query stage
 • select query goups during query stage
 DescriptionSystem Configure, Environments 
        & Resources• Data Collections: Subset of Reuters 
        Collection Volume 1 (RCV1)
 1) All the documents in RCV1 have been coded for topic region and industry 
        sector. The topic codes represent the subject area(s) of the each document. 
        They are organized into four hierarchical groups with 4 top-level nodes.
 2)Index each group as the sub data collection.
 organized Reuters Corpus into 5 sub data collection:
 --- Corporate/Industrial(CCAT)
 --- Economics (ECAT)
 --- Government/Social (GCAT)
 --- Markets (MCAT)
 --- Others (Othres)
 • Stemming Algorithm (Java Version)
 1) Porter Stemming Algorithm with some modifications
 2) The interface (method) modification
 3) The performance for java language modification
 • Stop Lists: Stop list from Van Rijsbergen's 
        textbook with some modifications
 • XML/SGML analysis package
 1) The SGML process package from http://www.gkrueger.com/java/ 
        with some modifications.
 2) For the error processing, in this package, if the file has the SGML 
        error, it will stop processing.
 • Compress
 1) None
 • VSM model (Vector Space Model)
 Main Ideas Using Sub 
        Data CollectionEach sub collection is the group of the specific topic. Each sub collection 
        has its own inverted index file, the keyword index file and the document 
        info file.
 Data Structure of posting 
        list (for the update of the document)
 Redundance space + overflow list
 (Always keep 2% - 5% of length of posting list reduance space)
 1) Add the new document
 a) Parse the document and get the triples (word, docid, freq).
 b) Insert the (docid, freq) to posting list (keyword is “word”)
 i. If the reduance space is enough to put the pair, put it to the posting 
        list
 ii. If the reduance space is too small, allot the overfloat space (5% 
        more of the posting list) for this posting list, and insert it to the 
        overfloat space)
 iii. For some new keyword, only need to allot the posting list and append 
        it to the inverted index file
 iv. Append this new overfloat list to the inverted index file and correct 
        the pointer.
 2) Delete the document
 a) Parse the document and get the pares (word, docid).
 b) Delete the (docid, freq) from posting list (keyword is “word”)
 c) Do not need to shrink
 3) Modify the document (can be implemented by the delete and add operations)
    |