System Summary and Timing Organization Name: Rank Xerox Research Centre (RXRC) List of Run ID's: base.xerox, simple.xerox, join.xerox, join-short.xerox, cmwe.xerox, cjoin.xerox (NLP); xerox-spS, xerox-spP, xerox-spT, xerox-spD (SPANISH/SP) Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 619 (NLP), by part of speech (SP) - Controlled Vocabulary?: no - Stemming Algorithm: morphology - Morphological Analysis: yes, inflectional (English/Spanish) - Term Weighting: sqrt(tf)*idf/sqrt(doc-length) - Phrase Discovery?: yes, some runs - Kind of Phrase: adjacent pairs, syntactic pairs (NLP), adjacent noun pairs (SP) - Method Used (statistical, syntactic, other): syntactic, statistical - Syntactic Parsing?: yes, some runs - Word Sense Disambiguation?: no - Heuristic Associations (including short definition)?: no - Spelling Checking (with manual correction)?: no - Spelling Correction?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: standard SMART tokenizer - Manually-Indexed Terms?: no - Other Techniques for building Data Structures: no Statistics on Data Structures built from TREC Text - Inverted index - Run ID: base.xerox - Total Storage (in MB): 90 - Total Computer Time to Build (in hours): 0.13 (real) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Inverted index - Run ID: simple.xerox - Total Storage (in MB): 125 - Total Computer Time to Build (in hours): 0.3 (real) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no, stems and phrases - Inverted index - Run ID: join.xerox - Total Storage (in MB): 126 - Total Computer Time to Build (in hours): 0.3 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no, stems and syntactic pairs - Inverted index - Run ID: join-short.xerox - Total Storage (in MB): 126 - Total Computer Time to Build (in hours): 0.3 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no, stems, phrases and syntactic pairs - Inverted index - Run ID: cmwe.xerox - Total Storage (in MB): 126 - Total Computer Time to Build (in hours): 0.3 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no, stems and syntactic pairs - Inverted index - Run ID: cjoin.xerox - Total Storage (in MB): 126 - Total Computer Time to Build (in hours): 0.3 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no, stems, phrases and syntactic pairs - Inverted index - Run ID: xerox-spS - Total Storage (in MB): ??? - Total Computer Time to Build (in hours): ??? - Automatic Process? (If not, number of manual hours): ??? - Use of Term Positions?: no - Only Single Terms Used?: yes - Inverted index - Run ID: xerox-spP/spT/spD - Total Storage (in MB): 191 - Total Computer Time to Build (in hours): 0.5 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no, stems and phrases - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: title, desc, narr (not in join-short.xerox, xerox-spD) - Average Computer Time to Build Query (in cpu seconds): 9 - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: yes, some runs - Syntactic Parsing of Topics?: yes, some runs - Word Sense Disambiguation?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: SMART's standard tokenizer - Heuristic Associations to Add Terms?: no - Expansion of Queries using Previously-Constructed Data Structure?: ??? (tagging???) - Automatic Addition of Boolean Connectors or Proximity Operators?: no - Other: no Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: title, desc, narr - Average Time to Build Query (in Minutes): 10 - Type of Query Builder - Domain Expert: no - Computer System Expert: yes - Tools used to Build Query - Word Frequency List?: no - Knowledge Base Browser?: no - Other Lexical Tools?: no - Method used in Query Construction - Term Weighting?: yes - Boolean Connectors (AND, OR, NOT)?: no - Proximity Operators?: no - Addition of Terms not Included in Topic?: no - Other: no Searching Search Times - Run ID: averaged over all runs - Computer Time to Search (Average per Query, in CPU seconds): 2s Machine Searching Methods - Vector Space Model?: yes Factors in Ranking - Term Frequency?: yes - Inverse Document Frequency?: yes - Other Term Weights?: no - Semantic Closeness?: no - Position in Document?: no - Syntactic Clues?: no - Proximity of Terms?: no - Information Theoretic Weights?: no - Document Length?: yes - Percentage of Query Terms which match?: 1.09-1.80 (NLP) - N-gram Frequency?: no - Word Specificity?: no - Word Sense Frequency?: no - Cluster Distance?: no - Other: no Machine Information - Machine Type for TREC Experiment: SPARC Ultra I - Was the Machine Dedicated or Shared: shared - Amount of Hard Disk Storage (in MB): 9000 - Amount of RAM (in MB): 132 - Clock Rate of CPU (in MHz): 167 MHz System Comparisons - Given appropriate resources - Could your system run faster?: yes - Features the System is Missing that would be beneficial: Boolean matching, proximity information Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions : ???