System Summary and Timing
  Organization Name: IBM T. J. Watson Research Center, Human Language 
Technologies
  List of Run ID's: ibms96a, ibms96b

  Construction of Indices, Knowledge Bases, and other Data Structures

    Methods Used to build Data Structures

    - Length (in words) of the stopword list: 940  
    - Controlled Vocabulary?: No 
    - Stemming Algorithm: Yes              
      - Morphological Analysis: based on POS tagger  
    - Term Weighting: Yes  
    -  Phrase Discovery?: yes              
      - Kind of Phrase: bigrams 
      - Method Used (statistical, syntactic, other): all bigrams within fixed 
window and with words in query vocabulary 
    -  Syntactic Parsing?: No  
    -  Word Sense Disambiguation?: No 
    -  Heuristic Associations (including short definition)?: No 
    -  Spelling Checking (with manual correction)?: No 
    -  Spelling Correction?: No  
    -  Proper Noun Identification Algorithm?: No 
    -  Tokenizer?: yes              
      - Patterns which are tokenized:  words with punctuations attached 
    -  Manually-Indexed Terms?: No 

    Statistics on Data Structures built from TREC Text

    - Inverted index             
      - Run ID: ibms96a, ibms96b 
      - Total Storage (in MB): 697  
      - Total Computer Time to Build (in hours): 2 
      - Automatic Process? (If not, number of manual hours): Yes 
      - Use of Term Positions?: No 
      - Only Single Terms Used?:  Yes 
    - Clusters             
    - N-grams, Suffix arrays, Signature Files             
      - Run ID:  ibms96a, ibms96b 
      - Total Storage (in MB):  9 
      - Total Computer Time to Build (in hours): 10 
      - Automatic Process? (If not, number of manual hours): Yes 
      - Brief Description of Method: Bigrams within 6-word window 
    - Knowledge Bases             
      - Use of Manual Labor                  
        - Mostly Manually Built using Special Interface:                  
        - Mostly Machine Built with Manual Correction:                  
    - Special Routing Structures             
    - Other Data Structures built from TREC text             
      - Run ID: ibms96b  
      - Type of Structure: histograms  
      - Total Storage (in MB): 3,000 
      - Total Computer Time to Build (in hours):  24 
      - Automatic Process? (If not, number of manual hours): Yes  

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used:  Description 
    - Average Computer Time to Build Query (in cpu seconds): 3 
    - Method used in Query Construction           
      - Term Weighting (weights based on terms in topics)?: Yes  
      - Phrase Extraction from Topics?: yes 
      - Syntactic Parsing of Topics?: No 
      - Word Sense Disambiguation?: No 
      - Proper Noun Identification Algorithm?: No 
      - Tokenizer?: yes                
        - Patterns which are Tokenized: words with punctuations attached 
      - Heuristic Associations to Add Terms?: No 
      - Expansion of Queries using Previously-Constructed Data Structure?: Yes
        -  Structure Used: unigrams from top 10 docs of initial ranking 

  Searching

    Search Times

      - Run ID: ibms96a  
      - Computer Time to Search (Average per Query, in CPU seconds): 230 
      - Run ID: ibms96b  
      - Computer Time to Search (Average per Query, in CPU seconds): 900 

    Machine Searching Methods

      - Vector Space Model?: Yes  
      - Probabilistic Model?: Yes 
      - N-gram Matching?: Yes  

    Factors in Ranking

      - Term Frequency?: Yes  
      - Inverse Document Frequency?: Yes  
      - Document Length?: Yes  
      - Percentage of Query Terms which match?: no 
      - N-gram Frequency?: Yes 

    Machine Information

    - Machine Type for TREC Experiment: RS6000  
    - Was the Machine Dedicated or Shared: Shared 
    - Amount of Hard Disk Storage (in MB): 100,000  
    - Amount of RAM (in MB): 200  
    - Clock Rate of CPU (in MHz): 130  

    System Comparisons

    - Amount of "Software Engineering" which went into the Development of the 
System: This is a customized system requiring large amount of software 
engineering. 
    - Given appropriate resources  
      - Could your system run faster?: Yes  
      - By how much (estimate)?: 2  
    - Features the System is Missing that would be beneficial: Word-sense 
disambiguation, syntactic parsing