NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Appendix B: System Features Appendix National Institute of Standards and Technology D. K. Harman I[OCRerr] CONSTRUCTION OF INDICES, KNOWI[OCRerr]DGE BASES, AND OThER DATA STRUCURES - METhODS USED No Pre-indexing Automatic query generation only A comprehensive inflectional morphology is used to produce word roots. Participles are retained in surface forms (although normalization is possible). NE morphology is used. 1) IDF[UF over phrases for retrieval. 2) A combination of statistics, induding frequency and distribution, for thesaurus discovery. Simplex noun phrases -- not including prepositional phrases or relative clauses. A deterministic, rule-based parser that nominates noun phrases based on testing for phrase-boundary conditions. The parser grammer indudes heuristics tor syntactic category disambiguation. Words not identified in the lexicon (about 100,000 root forms of English) are assumed to be "candidate" proper nouns. This technique does not appeal te information, etc. 1) Thesaurus Discovery -- which we use for query vector augmentation -- involves the identification of core characteristic terminology over a document set. It according to several parameters, including frequency and distribution, and then selects the subset of terminology that optimizes for these scores. 2) Documents are broken into smaller, paragraph size units called "subdocuments." The subdocuments are the units from which statistics are drawn and over' is measured. Yes, "1tc"; log (term freq in doc) term-idf weight * document length normalization. The ISI[OCRerr]VD analysis of the term by document matrix. I[OCRerr]I takes a term[OCRerr]ocument matrix, transforms it by a user-specified weighting scheme (SMARTS'S "Ii experiments)9 and then calculates the best k[OCRerr]dimensional approximation to this matrix using singular value decomposition (SVD). The number of dimensions, k for TREC-2. All retrieval is done in this 2o(kiimensional 151 space rather than using raw term ovedap. A table of 528 manual and 13787 automatic 2-word phrases. When these are identified in adjacent posistions in documents or queries, they are used as additioi Kelly & Stone's Algorithm with minor modifications, using U)OCE as the accompanying lexicon for table look-ups. Stage 1: based on the text structure (discourse structure) of texts. Stage 2: implicitly with the conceptual graph matching/scoring scheme. Part[OCRerr]l-speech tagging9 phrase and clause bracketing, and special handlers in the RCD module. Stage 1: for all words in textL Stage 2: only when RIT codes are assigned. 1) assignment of subject field codes to individual words 2) SFC vector construction for documents 3) proper noun knowledge base construction 4) inverted index construction 5) assignment of RIT codes 6) conversion of text into conceptual graph representation