System Summary and Timing Organization Name: CLARITECH Corporation List of Run ID's: CLARTF, CLARTN Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: N/A - Controlled Vocabulary? : N/A - Stemming Algorithm: N/A - Morphological Analysis: A comprehensive inflectional morphology is used to produce word roots. Participles are retained in surface forms. (Although normalization is possible.) No derivational morphology is used. - Term Weighting: 1) TF*IDF over phrases is used for retrieval 2) An importance coefficient is assigned to TF*IDF for query terms 3) a combination of statistics, linguistic-structure analysis, and heuristics is used for thesaurus extraction; the statistical measures include frequency and distribution. - Phrase Discovery? : Yes. - Kind of Phrase: simplex noun phrases -- not including post-nominal appositive, prepositional, or participial phrases or relative clauses - Method Used (statistical, syntactic, other): A deterministic, rule- based parser nominates linguistic-constituent structure; a filter retains only simplex noun phrases for indexing purposes. - Syntactic Parsing? : Yes. (see above) - Word Sense Disambiguation? : The parser grammar includes heuristics for syntactic category disambiguation. - Heuristic Associations (including short definition)? : No. - Spelling Checking (with manual correction)? : No. - Spelling Correction? : No. - Proper Noun Identification Algorithm? : Words not identified in the lexicon (about 100,000 root forms of English) are assumed to be "candidate proper nouns". The grammar accommodates structure that includes proper names; this technique does not require case-sensitive clues (e.g., capitalization) to be effective. etc. - Tokenizer? : None. - Manually-Indexed Terms? : No. - Other Techniques for building Data Structures: 1) Thesaurus discovery -- which we use for query-vector augmentation -- involves the identification of core characteristic terminology over a document set. The process ranks all terms according to scores along several parameters and then selects the subset of terminology that optimizes the scores. 2) Document Windows -- documents are decomposed into overlapping windows of 50 terms each. These windows are applied in thesaurus extraction and as a component in the document similarity calculation. Statistics on Data Structures built from TREC Text - Inverted index In the version of the system configured for experimentation, we do not build an inverted index for the whole corpus, but simply index the formal query structures with expanded statistical information from the full corpus. This allows us to quickly change parameters at all levels of the system. - Run ID : CLARTF - Total Storage (in MB): 5.1 - Total Computer Time to Build (in hours): 10.5 (It takes approximately 8.75 minutes to invert the full set of 50 query vectors and merge in the global corpus statistics. The figure of 10.5 hours represents the time required to collect term distribution statistics for the entire corpus, but this only needs to be performed once for all of the runs.) - Automatic Process? (If not, number of manual hours): Yes. - Use of Term Positions? : No. - Only Single Terms Used? : all terms used. - Run ID : CLARTN - Total Storage (in MB): 5.1 - Total Computer Time to Build (in hours): 10.5 (For the CLARTN run, it took approximately 9.25 minutes to process the query vectors. For more information on index creation, see above.) - Automatic Process? (If not, number of manual hours): Yes. - Use of Term Positions? : No. - Only Single Terms Used? : all terms used. - Clusters None. - N-grams, Suffix arrays, Signature Files None. - Knowledge Bases None. - Use of Manual Labor - Special Routing Structures N/A - Other Data Structures built from TREC text - Run ID : CLARTF - Type of Structure: Automatic Feedback Thesaurus - Total Storage (in MB): 0.14 - Total Computer Time to Build (in hours): 4.51 (This includes 4.33 hours to perform an initial retrieval pass, and .19 hours to construct thesauri from those results. Note that the initial retrieval pass has to be executed only once for both the feedback and distractor thesauri.) - Other Data Structures built from TREC text - Run ID : CLARTF and CLARTN - Type of Structure: Automatic Distractor Thesaurus - Total Storage (in MB): 2.7 - Total Computer Time to Build (in hours): 6.97 (This includes 4.33 hours to peform the initial retrieval pass and 2.64 hours for extraction of the thesauri. See Automatic Feedback Thesaurus, above. Since the initial query vectors from CLARTN and CLARTF are identical, the two separate runs are able to use the same distractor thesauri.) - Brief Description of Method: A very large, first-order thesaurus is constructed based on a large sample of relatively high-scoring, but probably non-relevant document windows. (The discovered terms are regarded as "found distractors.") For the purpose of TREC 4, we used a sample of 500 windows from ranks 37,500 - 39,000 based on the initial retrieval over the corpus. The thesaurus was extracted at the 70 percent threshold. - Other Data Structures built from TREC text - Run ID : CLARTN - Type of Structure: Automatic Feedback Thesaurus - Total Storage (in MB): 0.17 - Total Computer Time to Build (in hours): 4.66 (This includes 4.33 hours to perform an initial retrieval pass, and .33 hours to construct thesauri from those results. Note that the initial retrieval pass from the CLARTF pass is re-used here.) - Automatic Process? (If not, number of manual hours): Yes. - Brief Description of Method: A thesaurus is constructed as described for the CLARTF run, but no required terms filter is applied. The top- scoring 50 document windows are submitted directly to thesaurus extraction. - Other Data Structures built from TREC text - Run ID : CLARTN and CLARTF - Type of Structure: Sampled Distractor Terms - Total Storage (in MB): 0.15 - Total Computer Time to Build (in hours): 0.01 - Automatic Process? (If not, number of manual hours): Yes. - Brief Description of Method: The top-2,000 most-frequent words from the entire retrieval corpus were taken as a representative sample for purposes of initial, ad-hoc querying. Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Type of File (thesaurus, knowledge base, lexicon, etc.): lexicon - Total Storage (in MB): 1.75 - Number of Concepts Represented: 90,057 root forms - Type of Representation: root/syntactic-category pairs - Total Computer Time to Build (in hours): N/A - Total Manual Time to Build (in hours): The CLARIT lexicon was manually constructed using word-lists extracted from on-line sources during early phases of the CLARIT research project. (1988--89) - Total Manual Time to Modify for TREC (if already built): No modification was required. - Use of Manual Labor - Mostly Manually Built using Special Interface: Yes. - Externally-built Auxiliary File Query construction Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: title and description - Average Time to Build Query (in Minutes): 15 min./query - Type of Query Builder - Domain Expert: No. - Computer System Expert: Yes. - Tools used to Build Query - Word Frequency List? : No. - Knowledge Base Browser? : No. - Other Lexical Tools? : No. - CLARIT Retrieval System : Candidate query vectors are suggested by automatic query construction using CLARIT parsing; these vectors are manually refined by the user. Initial query performance is evaluated on other databases, including data from TREC disk 1. - Method used in Query Construction - Term Weighting? : Topic terms are weighted using TF*IDF * Importance, where the Importance coefficient for query source terminology is assigned manually. Searching Search Times - Computer Time to Search (Average per Query, in CPU seconds): The querying program processed all 50 topics (either routing or ad-hoc) in parallel on five machines, without the use of any inverted index or postings file. Also, this simple prototype architecture required two passes: one for full documents and one for document windows. Processing took approximately 6 hours. Machine Searching Methods - Vector Space Model? : Yes -- using a cosine distance measure. Furthermore, the vector space does not fix the document vector length component of the cosine distance formula. Rather, the length of any document vector is allowed to vary depending on the terms present in the query; only the terms in the query vector are considered to be 'active' for any given distance calculation, and all other terms in the document are ignored. Since query vectors may include zero-coefficient distractor terms, each query can specify its own list of 'active terminology', thereby allowing query-specific control of the document representation. - Other: Document windows used for feedback in the CLARTF run are subjected to a regular expression search for required terms, as described above. Factors in Ranking - Term Frequency? : TF_AUG = (0.5 + 0.5 * TF / MAX_TF) - Inverse Document Frequency? : IDF = Log_2(Number of docs in the corpus / Number of docs containing term) + 1 - Other Term Weights? : An importance coefficient is manually assigned for initial query terms. (see above) - Semantic Closeness? : No. - Position in Document? : No. - Syntactic Clues? : No. - Proximity of Terms? : No. - Information Theoretic Weights? : No. - Document Length? : Only as this is implicitly captured in the cosine distance measure. - Percentage of Query Terms which match? : Only as this is implicitly captured in the cosine distance measure. - N-gram Frequency? : No. - Word Specificity? : No. - Word Sense Frequency? : No. - Cluster Distance? : No. - Other: The final score for a document is computed as the weighted average of the score for the highest scoring window in the document, and the score for the complete document vector: score = ((20 * document_score + max_window_score) / 21) Machine Information - Machine Type for TREC Experiment: 4 x DEC 3000/400, 1 x DEC 3000/600 (Alpha/AXP) running DEC OSF/1 - Was the Machine Dedicated or Shared: Dedicated - Amount of Hard Disk Storage (in MB): 15,000 - Amount of RAM (in MB): 3 @ 128 meg, 2 @ 64 meg - Clock Rate of CPU (in MHz): 4 @ 133.33 MHz, 1 @ 175 MHz System Comparisons - Amount of "Software Engineering" which went into the Development of the System: The system used for TREC-4 processing was configured for flexibility, rather than speed. Essentially, it is designed to store almost all intermediate results and allow for new modules to be added in a dynamic fashion. - Given appropriate resources - Could your system run faster? : Yes, by simply pinning down a set of configuration options and using the normal CLARIT retrieval modules. - By how much (estimate)? : The standard CLARIT system indexes at a rate of 100 megabytes/hour (on a 100-MHz processor); queries of 30-50 terms over 1-2 gigabytes of data typically require < 5 secs. of cpu. - Features the System is Missing that would be beneficial: The CLARIT- TREC-4 System did not take advantage of several processing options that may have given improved results, including feedback term re-weighting, tokenization, sub-lexicon discovery over trainings sets, and EQ-class discovery for thesaural terms.