System Summary and Timing
  Organization Name: GSI-Erli
  List of Run ID's: erliR1, erliA1

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list:  502 
    - Controlled Vocabulary?: no 
    - Stemming Algorithm:  no             
      - Morphological Analysis: no 
    - Term Weighting: no 
    -  Phrase Discovery?:              
    -  Tokenizer?:              

    Statistics on Data Structures built from TREC Text

    - Inverted index           
      - Run ID: erliR1 
      - Total Storage (in MB): 239 
      - Total Computer Time to Build (in hours): 6:24 (elapsed)
      - Automatic Process? (If not, number of manual hours): yes
      - Use of Term Positions?: yes
      - Only Single Terms Used?: yes 
    - Inverted index           
      - Run ID: erliA1 
      - Total Storage (in MB): 1,187 
      - Total Computer Time to Build (in hours): 37:37 (elapsed)
      - Automatic Process? (If not, number of manual hours): yes 
      - Use of Term Positions?: yes
      - Only Single Terms Used?: yes 
    - Clusters           
      - Automatic Process? (If not, number of manual hours):  
    - N-grams, Suffix arrays, Signature Files           
      - Automatic Process? (If not, number of manual hours):  
    - Knowledge Bases            
      - Automatic Process? (If not, number of manual hours): 
      - Use of Manual Labor                  
        - Mostly Manually Built using Special Interface:
    - Special Routing Structures           
      - Automatic Process? (If not, number of manual hours):  
    - Other Data Structures built from TREC text           
      - Automatic Process? (If not, number of manual hours):  

    Data Built from Sources Other than the Input Text

    -  Internally-built Auxiliary File            
      - Domain (independent or specific): domain-independent
      - Type of File (thesaurus, knowledge base, lexicon, etc.) : lexicon 
      - Total Storage (in MB): 4 
      - Number of Concepts Represented: 45,000 lemmas 
      - Type of Representation: morphosyntactic information 
      - Total Manual Time to Build (in hours): thousands
      - Total Manual Time to Modify for TREC (if already built) : 12 hours
      - Use of Manual Labor                   
        - Mostly Manually Built using Special Interface: yes
        - Mostly Machine Built with Manual Correction:  
    -  Externally-built Auxiliary File            
      - Type of File (Treebank, WordNet, etc.): part of WordNet 
      - Total Storage (in MB): 4
      - Number of Concepts Represented: 36,000 
      - Type of Representation: semantic net 

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used: 'topics' used for automatic building of queries are 
manually built plain English sentences (cf 'manually built queries' section 
below)
    - Average Computer Time to Build Query (in cpu seconds): 5.35 (elapsed, we 
don't have CPU times) 
    - Method used in Query Construction          
      - Term Weighting (weights based on terms in topics)?: yes 
      - Phrase Extraction from Topics?: yes
      - Syntactic Parsing of Topics?: yes
      - Tokenizer?:                 
      - Expansion of Queries using Previously-Constructed Data Structure?: yes              
        -  Structure Used: lexicon and semantic net  
      - Automatic Addition of Boolean Connectors or Proximity Operators?: yes

    Automatically Built Queries (Routing)

    - Topic Fields Used: 'topics' used for automatic building of queries are 
manually built plain English sentences (cf 'manually built queries' section 
below)
    - Average Computer Time to Build Query (in cpu seconds): 7.12 (elapsed, we 
don't have CPU times) 
    - Method used in Query Construction          
      - Terms Selected From            
        - Topics: yes 
      - Term Weighting with Weights Based on terms in            
        - Topics: yes 
      - Phrase Extraction from            
        - Topics: yes 
      - Syntactic Parsing            
        - Topics: yes 
      - Word Sense Disambiguation using            
      - Proper Noun Identification Algorithm from            
      - Tokenizer             
      - Heuristic Associations to Add Terms from            
      - Expansion of Queries using Previously-Constructed Data Structure: yes
        -  Structure Used: lexicon and semantic net 
      - Automatic Addition of Boolean connectors or Proximity Operators using 
information from             
        - Topics: yes (result of syntactic parse) 

    Manually Constructed Queries (Ad-Hoc)

    - Topic Fields Used: mainly title and description 
    - Average Time to Build Query (in Minutes): 5
    - Type of Query Builder          
      - Computer System Expert: yes
    - Tools used to Build Query          
      - Knowledge Base Browser?:                 
      - Other Lexical Tools?:                
    - Method used in Query Construction          
      - Addition of Terms not Included in Topic?:               
      - Other: building of a plain English sentence close to a realistic 
(i.e. short) natural language query 

    Manually Constructed Queries (Routing)

    - Topic Fields Used: mainly title and description  
    - Average Time to Build Query (in Minutes): 10
    - Type of Query Builder          
      - Computer System Expert: yes
    - Tools used to Build Query          
      - Knowledge Base Browser?:                 
      - Other Lexical Tools?:               
    - Data Used for Building Query from           
    - Method used in Query Construction          
      - Addition of Terms not Included in Topic?:               
      - Other: building of a plain English sentence close to a realistic 
(i.e. short) natural language query  

  Searching

    Search Times

      - Run ID: erliR1 
      - Computer Time to Search (Average per Query, in CPU seconds): 
64  (elapsed, we don't have CPU times)
      - Component Times: 11% for query generation, 89% for search 
    -  Search Times             
      - Run ID: erliA1 
      - Computer Time to Search (Average per Query, in CPU seconds): 
312  (elapsed, we don't have CPU times)
      - Component Times: 2% for query generation, 98% for search  

    Machine Searching Methods

      - Boolean Matching?: yes 

    Factors in Ranking

      - Other Term Weights?: yes; phrases built via syntactic analysis are 
weighted higher than words; words are weighted according to their polysemy 
count 

    Machine Information

    - Machine Type for TREC Experiment: Sun sparcStation 4 
    - Was the Machine Dedicated or Shared: shared 
    - Amount of Hard Disk Storage (in MB): 4,000+4,000 (two disks) 
    - Amount of RAM (in MB): 32 
    - Clock Rate of CPU (in MHz): 70 

    System Comparisons 

    - Amount of "Software Engineering" which went into the Development of the 
System: 1 man/year (starting from previously built general purpose NLP 
resources) 
    - Given appropriate resources            
      - Could your system run faster?: most probably (depends on the search 
engine used: cf. 'Significant Areas of System' below; improvements can be 
obtained in query generation time, but that part of the process is a small 
one; more powerful hardware could help too 
    - Features the System is Missing that would be beneficial: use of semantic 
closeness to refine weights improves precision (tests were made after TREC 
deadline) 

    Significant Areas of System

    - Brief Description of features in your system which you feel impact the 
system and are not answered by above questions: the system is merely query-
side, and is a linguistic add-on to various search engines (that are in charge 
of full-text indexing and search); it performs linguistic analysis of the 
query, then builds an expanded boolean query with word-forms, operators and 
weights generated automatically; the query is thus passed to the search 
engine; Topic (tm of Verity Inc.) was used for the TREC experiment