SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models
chapter
N. Fuhr
C. Buckley
National Institute of Standards and Technology
Donna K. Harman
Determining the coefficients for the factors in the single term run took about 1.7 hours (steps 2 + 3
above). Construction of the inverted file (steps 4+5) containing the factor weighted terms took about
6.4 hours from scratch. Note that this is only about 1.0 hours longer than construction of a normal
(for SMART) if idf weighted inverted file.
Query indexing and weighting used the normal SMART procedures and took about 1.5 seconds. The
fields Topic, Nationality, Narrative, Concepts, Factors, Description were used to index the query, with
no distinction made between fields.
Retrieval plus ranking took 383 seconds.
A.4 Phrase automatic ad-hoc run
The phrases being used were tw[OCRerr]term SMART adjacency phrases. Phrases were adjacent non-
stopwords, term components stemmed, that occurred at least 25 times in the D1 document set.
The term components were put into alphabetical order, thus the text phrases "information retrieval"
and "retrieving information" both mapped to the same phrase concept. The phrases were treated as
a separate ctype within an indexed vector, and had their own dictionary and inverted file separate
from those of the single terms.
Determination of phrases took 5.8 hours, finding 4,700,000 phrases occurring in D1 at least once. Of
those phrases 158,000 occurred at least 25 times. These phrases were then put into a dictionary and
used as controlled vocabulary for phrases when doing the indexing of D1 (step 1) and D1 U D2 (step
4). The single term indexing remained exactly as it was in the single term run (terms occurring in
phrases were not removed from the vector).
The phrase term ad-hoc run used 8 factors (described in 2.2).
* consi ant
* is[OCRerr]single if logidf. irnaxif
* is[OCRerr]single . if imaxif
* is[OCRerr]sing1e . lo9idf
* lognumierms . imaxif
* is[OCRerr]phrase if. logidf imaxif
* is[OCRerr]phrase if. imaxif
* is[OCRerr]ph'[OCRerr]ase . logidf
Determining the coefficients for the factors in the phrase run took about 2.4 hours (steps 2 + 3 above).
Construction of the inverted file (steps 4+5) containing the factor weighted terms took about 12.6
hours from scratch. Note that this is only about 1.7 hours longer than construction of a normal (for
SMART) if idf weighted inverted file with phrases.
Query indexing and weighting used the normal SMART procedures and took about 2.7 seconds. The
fields Topic, Nationality, Narrative, Concepts, Factors and Description were used to index the query,
with no distinction made between fields.
Retrieval plus ranking took 374 CPU seconds. (This was less than the single term run, but for no
apparent reason. Perhaps the machine was less loaded.)
A.5 Automatic routing run
There was one automatic routing run done. It was totally unconnected to the factor weighting
approach described above. Basically, it was very easy to implement and run so we decided we might
as well submit it. The actual weighting function had to be programmed, but even so less than 3
person-days in total was spent on routing.
97