NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models chapter N. Fuhr C. Buckley National Institute of Standards and Technology Donna K. Harman Determining the coefficients for the factors in the single term run took about 1.7 hours (steps 2 + 3 above). Construction of the inverted file (steps 4+5) containing the factor weighted terms took about 6.4 hours from scratch. Note that this is only about 1.0 hours longer than construction of a normal (for SMART) if idf weighted inverted file. Query indexing and weighting used the normal SMART procedures and took about 1.5 seconds. The fields Topic, Nationality, Narrative, Concepts, Factors, Description were used to index the query, with no distinction made between fields. Retrieval plus ranking took 383 seconds. A.4 Phrase automatic ad-hoc run The phrases being used were tw[OCRerr]term SMART adjacency phrases. Phrases were adjacent non- stopwords, term components stemmed, that occurred at least 25 times in the D1 document set. The term components were put into alphabetical order, thus the text phrases "information retrieval" and "retrieving information" both mapped to the same phrase concept. The phrases were treated as a separate ctype within an indexed vector, and had their own dictionary and inverted file separate from those of the single terms. Determination of phrases took 5.8 hours, finding 4,700,000 phrases occurring in D1 at least once. Of those phrases 158,000 occurred at least 25 times. These phrases were then put into a dictionary and used as controlled vocabulary for phrases when doing the indexing of D1 (step 1) and D1 U D2 (step 4). The single term indexing remained exactly as it was in the single term run (terms occurring in phrases were not removed from the vector). The phrase term ad-hoc run used 8 factors (described in 2.2). * consi ant * is[OCRerr]single if logidf. irnaxif * is[OCRerr]single . if imaxif * is[OCRerr]sing1e . lo9idf * lognumierms . imaxif * is[OCRerr]phrase if. logidf imaxif * is[OCRerr]phrase if. imaxif * is[OCRerr]ph'[OCRerr]ase . logidf Determining the coefficients for the factors in the phrase run took about 2.4 hours (steps 2 + 3 above). Construction of the inverted file (steps 4+5) containing the factor weighted terms took about 12.6 hours from scratch. Note that this is only about 1.7 hours longer than construction of a normal (for SMART) if idf weighted inverted file with phrases. Query indexing and weighting used the normal SMART procedures and took about 2.7 seconds. The fields Topic, Nationality, Narrative, Concepts, Factors and Description were used to index the query, with no distinction made between fields. Retrieval plus ranking took 374 CPU seconds. (This was less than the single term run, but for no apparent reason. Perhaps the machine was less loaded.) A.5 Automatic routing run There was one automatic routing run done. It was totally unconnected to the factor weighting approach described above. Basically, it was very easy to implement and run so we decided we might as well submit it. The actual weighting function had to be programmed, but even so less than 3 person-days in total was spent on routing. 97