System Summary and Timing
  Organization Name: CLARITECH Corporation
  List of Run ID's: CLARTF, CLARTN

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures  

    - Length (in words) of the stopword list: N/A 
    - Controlled Vocabulary? : N/A 
    - Stemming Algorithm:   N/A              
      - Morphological Analysis:  A comprehensive inflectional morphology is 
        used to produce word roots. Participles are retained in surface forms. 
        (Although normalization is possible.) No derivational morphology is 
        used.
    - Term Weighting: 1) TF*IDF over phrases is used for retrieval   
      2) An importance coefficient is assigned to TF*IDF for query terms  
      3) a combination of statistics, linguistic-structure analysis, and 
         heuristics is used for thesaurus extraction; the statistical 
         measures include frequency and distribution.  
    -  Phrase Discovery? :  Yes.              
      - Kind of Phrase: simplex noun phrases -- not including post-nominal
        appositive, prepositional, or participial phrases or relative clauses 
      - Method Used (statistical, syntactic, other): A deterministic, rule-
        based parser nominates linguistic-constituent structure; a filter 
        retains only simplex noun phrases for indexing purposes.
    -  Syntactic Parsing? : Yes. (see above)
    -  Word Sense Disambiguation? : The parser grammar includes heuristics 
       for syntactic category disambiguation.
    -  Heuristic Associations (including short definition)? : No. 
    -  Spelling Checking (with manual correction)? : No. 
    -  Spelling Correction? :  No. 
    -  Proper Noun Identification Algorithm? : Words not identified in the 
       lexicon (about 100,000 root forms of English) are assumed to be 
       "candidate proper nouns". The grammar accommodates structure that 
       includes proper names; this technique does not require case-sensitive 
       clues (e.g., capitalization) to be effective.  etc. 
    -  Tokenizer? :  None.              
    -  Manually-Indexed Terms? :  No.  
    -  Other Techniques for building Data Structures: 
       1) Thesaurus discovery -- which we use for query-vector augmentation -- 
          involves the identification of core characteristic terminology over 
          a document set. The process ranks all terms according to scores
          along several parameters and then selects the subset of terminology 
          that optimizes the scores.  2) Document Windows -- documents are 
          decomposed into overlapping windows of 50 terms each. These windows 
          are applied in thesaurus extraction and as a component in the 
          document similarity calculation. 

    Statistics on Data Structures built from TREC Text

    - Inverted index  In the version of the system configured for 
      experimentation,  we do not build an inverted index for the whole corpus,
      but simply index the formal query structures with expanded statistical 
      information from the full corpus. This allows us to quickly change 
      parameters at all levels of the system.              
      - Run ID :  CLARTF
      - Total Storage (in MB): 5.1 


      - Total Computer Time to Build (in hours):  10.5 (It takes approximately 
        8.75 minutes to invert the full set of 50 query vectors and merge in 
        the global corpus statistics. The figure of 10.5 hours represents the 
        time required to collect term distribution statistics for the entire 
        corpus, but this only needs to be performed once for all of the runs.) 
      - Automatic Process? (If not, number of manual hours): Yes.
      - Use of Term Positions? :  No.
      - Only Single Terms Used? :  all terms used. 
      - Run ID :  CLARTN
      - Total Storage (in MB): 5.1 
      - Total Computer Time to Build (in hours):  10.5 (For the CLARTN run, 
        it took approximately 9.25 minutes to process the query vectors. For 
        more information on index creation, see above.) 
      - Automatic Process? (If not, number of manual hours): Yes.
      - Use of Term Positions? :  No.
      - Only Single Terms Used? :  all terms used. 
    - Clusters None.             
    - N-grams, Suffix arrays, Signature Files None.             
    - Knowledge Bases  None.             
      - Use of Manual Labor                    
    - Special Routing Structures N/A             
    - Other Data Structures built from TREC text             
      - Run ID :  CLARTF
      - Type of Structure:  Automatic Feedback Thesaurus
      - Total Storage (in MB):  0.14
      - Total Computer Time to Build (in hours):  4.51 (This includes 4.33 
        hours to perform an initial retrieval pass, and .19 hours to construct 
        thesauri from those results. Note that the initial retrieval pass has 
        to be executed only once for both the feedback and distractor thesauri.) 
    - Other Data Structures built from TREC text             
      - Run ID :  CLARTF and CLARTN
      - Type of Structure:  Automatic Distractor Thesaurus
      - Total Storage (in MB):  2.7
      - Total Computer Time to Build (in hours): 6.97 (This includes 4.33 hours
        to peform the initial retrieval pass and 2.64 hours for extraction of 
        the thesauri. See Automatic Feedback Thesaurus, above. Since the 
        initial query vectors from CLARTN and CLARTF are identical, the two 
        separate runs are able to use the same distractor thesauri.)    
      - Brief Description of Method: A very large, first-order thesaurus is 
        constructed based on a large sample of relatively high-scoring, but 
        probably non-relevant document windows. (The discovered terms are 
        regarded as "found distractors.") For the purpose of TREC 4, we used 
        a sample of 500 windows from ranks 37,500 - 39,000 based on the 
        initial retrieval over the corpus. The thesaurus was extracted at the 
        70 percent threshold.      
    - Other Data Structures built from TREC text             
      - Run ID :  CLARTN
      - Type of Structure:  Automatic Feedback Thesaurus
      - Total Storage (in MB): 0.17
      - Total Computer Time to Build (in hours): 4.66 (This includes 4.33 hours
        to perform an initial retrieval pass, and .33 hours to construct 
        thesauri from those results. Note that the initial retrieval pass from 
        the CLARTF pass is re-used here.)     
      - Automatic Process? (If not, number of manual hours):  Yes.
      - Brief Description of Method: A thesaurus is constructed as described 
        for the CLARTF run, but no required terms filter is applied. The top-
        scoring 50 document windows are submitted directly to thesaurus 
        extraction. 


    - Other Data Structures built from TREC text             
      - Run ID :  CLARTN and CLARTF 
      - Type of Structure:  Sampled Distractor Terms
      - Total Storage (in MB):  0.15
      - Total Computer Time to Build (in hours):  0.01
      - Automatic Process? (If not, number of manual hours):  Yes.
      - Brief Description of Method: The top-2,000 most-frequent words from 
        the entire retrieval corpus were taken as a representative sample 
        for purposes of initial, ad-hoc querying.             

    Data Built from Sources Other than the Input Text

    -  Internally-built Auxiliary File              
      - Type of File (thesaurus, knowledge base, lexicon, etc.): lexicon
      - Total Storage (in MB):  1.75
      - Number of Concepts Represented: 90,057 root forms 
      - Type of Representation:  root/syntactic-category pairs
      - Total Computer Time to Build (in hours):  N/A
      - Total Manual Time to Build (in hours): The CLARIT lexicon was manually 
        constructed using word-lists extracted from on-line sources during 
        early phases of the CLARIT research project. (1988--89)
      - Total Manual Time to Modify for TREC (if already built): No 
        modification was required.
      - Use of Manual Labor                     
        - Mostly Manually Built using Special Interface: Yes.
    -  Externally-built Auxiliary File              

  Query construction

    Manually Constructed Queries (Ad-Hoc)

    - Topic Fields Used:  title and description
    - Average Time to Build Query (in Minutes): 15 min./query
    - Type of Query Builder            
      - Domain Expert:  No.
      - Computer System Expert: Yes.
    - Tools used to Build Query            
      - Word Frequency List? : No. 
      - Knowledge Base Browser? :  No.                 
      - Other Lexical Tools? :  No.                
      - CLARIT Retrieval System : Candidate query vectors are suggested by 
        automatic query construction using CLARIT parsing; these vectors are 
        manually refined by the user. Initial query performance is evaluated 
        on other databases, including data from TREC disk 1. 
    - Method used in Query Construction            
      - Term Weighting? : Topic terms are weighted using TF*IDF * Importance, 
        where the Importance coefficient for query source terminology is 
        assigned manually. 

  Searching

    Search Times

      - Computer Time to Search (Average per Query, in CPU seconds): The 
        querying program processed all 50 topics (either routing or ad-hoc) in 
        parallel on five machines, without the use of any inverted index or 
        postings file. Also, this simple prototype architecture required two 
        passes: one for full documents and one for document windows.  


        Processing took approximately 6 hours. 

    Machine Searching Methods

      - Vector Space Model? : Yes -- using a cosine distance measure. 
      Furthermore, the vector space does not fix the document vector length 
      component of the cosine distance formula. Rather, the length of any 
      document vector is allowed to vary depending on the terms present in the 
      query; only the terms in the query vector are considered to be 'active' 
      for any given distance calculation, and all other terms in the document 
      are ignored. Since query vectors may include zero-coefficient distractor 
      terms, each query can specify its  own list of 'active terminology', 
      thereby allowing query-specific control of the document representation.         
      - Other:  Document windows used for feedback in the CLARTF run are 
        subjected to a regular expression search for required terms, as 
        described above.

    Factors in Ranking

      - Term Frequency? :  TF_AUG = (0.5 + 0.5 * TF / MAX_TF)
      - Inverse Document Frequency? : IDF = Log_2(Number of docs in the 
        corpus / Number of docs containing term) + 1  
      - Other Term Weights? :  An importance coefficient is manually assigned 
        for initial query terms.  (see above)
      - Semantic Closeness? :  No. 
      - Position in Document? :  No. 
      - Syntactic Clues? :  No. 
      - Proximity of Terms? :  No. 
      - Information Theoretic Weights? :  No. 
      - Document Length? : Only as this is implicitly captured in the cosine 
        distance measure. 
      - Percentage of Query Terms which match? : Only as this is implicitly 
        captured in the cosine distance measure. 
      - N-gram Frequency? : No.
      - Word Specificity? :  No.
      - Word Sense Frequency? : No. 
      - Cluster Distance? :  No.
      - Other: The final score for a document is computed as the weighted 
        average of the score for the highest scoring window in the document, 
        and the score for the complete document vector: score = ((20 * 
        document_score + max_window_score) / 21) 

    Machine Information

    - Machine Type for TREC Experiment: 4 x DEC 3000/400, 1 x DEC 3000/600  
      (Alpha/AXP) running DEC OSF/1
    - Was the Machine Dedicated or Shared: Dedicated 
    - Amount of Hard Disk Storage (in MB):  15,000
    - Amount of RAM (in MB):  3 @ 128 meg, 2 @ 64 meg
    - Clock Rate of CPU (in MHz):  4 @ 133.33 MHz, 1 @ 175 MHz

    System Comparisons 

    - Amount of "Software Engineering" which went into the Development of the 
      System: The system used for TREC-4 processing was configured for  
      flexibility, rather than speed. Essentially, it is designed to store 
      almost all intermediate results and allow for new modules to be added 
      in a dynamic fashion. 


    - Given appropriate resources              
      - Could your system run faster? :  Yes, by simply pinning down a set of 
        configuration options and using the normal CLARIT retrieval modules.
      - By how much (estimate)? : The standard CLARIT system indexes at a rate 
        of 100 megabytes/hour (on a 100-MHz processor); queries of 30-50 terms 
        over 1-2 gigabytes of data typically require < 5 secs. of cpu. 
    - Features the System is Missing that would be beneficial: The CLARIT-
      TREC-4 System did not take advantage of several processing options that 
      may have given improved results, including feedback term re-weighting, 
      tokenization, sub-lexicon discovery over trainings sets, and EQ-class
      discovery for thesaural terms.