System Summary and Timing Organization Name: IBM T. J. Watson Research Center, Human Language Technologies List of Run ID's: ibms96a, ibms96b Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 940 - Controlled Vocabulary?: No - Stemming Algorithm: Yes - Morphological Analysis: based on POS tagger - Term Weighting: Yes - Phrase Discovery?: yes - Kind of Phrase: bigrams - Method Used (statistical, syntactic, other): all bigrams within fixed window and with words in query vocabulary - Syntactic Parsing?: No - Word Sense Disambiguation?: No - Heuristic Associations (including short definition)?: No - Spelling Checking (with manual correction)?: No - Spelling Correction?: No - Proper Noun Identification Algorithm?: No - Tokenizer?: yes - Patterns which are tokenized: words with punctuations attached - Manually-Indexed Terms?: No Statistics on Data Structures built from TREC Text - Inverted index - Run ID: ibms96a, ibms96b - Total Storage (in MB): 697 - Total Computer Time to Build (in hours): 2 - Automatic Process? (If not, number of manual hours): Yes - Use of Term Positions?: No - Only Single Terms Used?: Yes - Clusters - N-grams, Suffix arrays, Signature Files - Run ID: ibms96a, ibms96b - Total Storage (in MB): 9 - Total Computer Time to Build (in hours): 10 - Automatic Process? (If not, number of manual hours): Yes - Brief Description of Method: Bigrams within 6-word window - Knowledge Bases - Use of Manual Labor - Mostly Manually Built using Special Interface: - Mostly Machine Built with Manual Correction: - Special Routing Structures - Other Data Structures built from TREC text - Run ID: ibms96b - Type of Structure: histograms - Total Storage (in MB): 3,000 - Total Computer Time to Build (in hours): 24 - Automatic Process? (If not, number of manual hours): Yes Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: Description - Average Computer Time to Build Query (in cpu seconds): 3 - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: Yes - Phrase Extraction from Topics?: yes - Syntactic Parsing of Topics?: No - Word Sense Disambiguation?: No - Proper Noun Identification Algorithm?: No - Tokenizer?: yes - Patterns which are Tokenized: words with punctuations attached - Heuristic Associations to Add Terms?: No - Expansion of Queries using Previously-Constructed Data Structure?: Yes - Structure Used: unigrams from top 10 docs of initial ranking Searching Search Times - Run ID: ibms96a - Computer Time to Search (Average per Query, in CPU seconds): 230 - Run ID: ibms96b - Computer Time to Search (Average per Query, in CPU seconds): 900 Machine Searching Methods - Vector Space Model?: Yes - Probabilistic Model?: Yes - N-gram Matching?: Yes Factors in Ranking - Term Frequency?: Yes - Inverse Document Frequency?: Yes - Document Length?: Yes - Percentage of Query Terms which match?: no - N-gram Frequency?: Yes Machine Information - Machine Type for TREC Experiment: RS6000 - Was the Machine Dedicated or Shared: Shared - Amount of Hard Disk Storage (in MB): 100,000 - Amount of RAM (in MB): 200 - Clock Rate of CPU (in MHz): 130 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: This is a customized system requiring large amount of software engineering. - Given appropriate resources - Could your system run faster?: Yes - By how much (estimate)?: 2 - Features the System is Missing that would be beneficial: Word-sense disambiguation, syntactic parsing