System Summary and Timing Organization Name: GSI-Erli List of Run ID's: erliR1, erliA1 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 502 - Controlled Vocabulary?: no - Stemming Algorithm: no - Morphological Analysis: no - Term Weighting: no - Phrase Discovery?: - Tokenizer?: Statistics on Data Structures built from TREC Text - Inverted index - Run ID: erliR1 - Total Storage (in MB): 239 - Total Computer Time to Build (in hours): 6:24 (elapsed) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: yes - Only Single Terms Used?: yes - Inverted index - Run ID: erliA1 - Total Storage (in MB): 1,187 - Total Computer Time to Build (in hours): 37:37 (elapsed) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: yes - Only Single Terms Used?: yes - Clusters - Automatic Process? (If not, number of manual hours): - N-grams, Suffix arrays, Signature Files - Automatic Process? (If not, number of manual hours): - Knowledge Bases - Automatic Process? (If not, number of manual hours): - Use of Manual Labor - Mostly Manually Built using Special Interface: - Special Routing Structures - Automatic Process? (If not, number of manual hours): - Other Data Structures built from TREC text - Automatic Process? (If not, number of manual hours): Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): domain-independent - Type of File (thesaurus, knowledge base, lexicon, etc.) : lexicon - Total Storage (in MB): 4 - Number of Concepts Represented: 45,000 lemmas - Type of Representation: morphosyntactic information - Total Manual Time to Build (in hours): thousands - Total Manual Time to Modify for TREC (if already built) : 12 hours - Use of Manual Labor - Mostly Manually Built using Special Interface: yes - Mostly Machine Built with Manual Correction: - Externally-built Auxiliary File - Type of File (Treebank, WordNet, etc.): part of WordNet - Total Storage (in MB): 4 - Number of Concepts Represented: 36,000 - Type of Representation: semantic net Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: 'topics' used for automatic building of queries are manually built plain English sentences (cf 'manually built queries' section below) - Average Computer Time to Build Query (in cpu seconds): 5.35 (elapsed, we don't have CPU times) - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: yes - Syntactic Parsing of Topics?: yes - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: yes - Structure Used: lexicon and semantic net - Automatic Addition of Boolean Connectors or Proximity Operators?: yes Automatically Built Queries (Routing) - Topic Fields Used: 'topics' used for automatic building of queries are manually built plain English sentences (cf 'manually built queries' section below) - Average Computer Time to Build Query (in cpu seconds): 7.12 (elapsed, we don't have CPU times) - Method used in Query Construction - Terms Selected From - Topics: yes - Term Weighting with Weights Based on terms in - Topics: yes - Phrase Extraction from - Topics: yes - Syntactic Parsing - Topics: yes - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: yes - Structure Used: lexicon and semantic net - Automatic Addition of Boolean connectors or Proximity Operators using information from - Topics: yes (result of syntactic parse) Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: mainly title and description - Average Time to Build Query (in Minutes): 5 - Type of Query Builder - Computer System Expert: yes - Tools used to Build Query - Knowledge Base Browser?: - Other Lexical Tools?: - Method used in Query Construction - Addition of Terms not Included in Topic?: - Other: building of a plain English sentence close to a realistic (i.e. short) natural language query Manually Constructed Queries (Routing) - Topic Fields Used: mainly title and description - Average Time to Build Query (in Minutes): 10 - Type of Query Builder - Computer System Expert: yes - Tools used to Build Query - Knowledge Base Browser?: - Other Lexical Tools?: - Data Used for Building Query from - Method used in Query Construction - Addition of Terms not Included in Topic?: - Other: building of a plain English sentence close to a realistic (i.e. short) natural language query Searching Search Times - Run ID: erliR1 - Computer Time to Search (Average per Query, in CPU seconds): 64 (elapsed, we don't have CPU times) - Component Times: 11% for query generation, 89% for search - Search Times - Run ID: erliA1 - Computer Time to Search (Average per Query, in CPU seconds): 312 (elapsed, we don't have CPU times) - Component Times: 2% for query generation, 98% for search Machine Searching Methods - Boolean Matching?: yes Factors in Ranking - Other Term Weights?: yes; phrases built via syntactic analysis are weighted higher than words; words are weighted according to their polysemy count Machine Information - Machine Type for TREC Experiment: Sun sparcStation 4 - Was the Machine Dedicated or Shared: shared - Amount of Hard Disk Storage (in MB): 4,000+4,000 (two disks) - Amount of RAM (in MB): 32 - Clock Rate of CPU (in MHz): 70 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: 1 man/year (starting from previously built general purpose NLP resources) - Given appropriate resources - Could your system run faster?: most probably (depends on the search engine used: cf. 'Significant Areas of System' below; improvements can be obtained in query generation time, but that part of the process is a small one; more powerful hardware could help too - Features the System is Missing that would be beneficial: use of semantic closeness to refine weights improves precision (tests were made after TREC deadline) Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: the system is merely query- side, and is a linguistic add-on to various search engines (that are in charge of full-text indexing and search); it performs linguistic analysis of the query, then builds an expanded boolean query with word-forms, operators and weights generated automatically; the query is thus passed to the search engine; Topic (tm of Verity Inc.) was used for the TREC experiment