NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) TIPSTER Panel -- HNC's MatchPlus System chapter S. Gallant R. Hecht-Nielson W. Caid K. Qing J. Carleton D. Sudbeck National Institute of Standards and Technology Donna K. Harman 2. Because document context vectors are normal- ized, we may simply find the document d that maximized the dot product with the query con- text vector, [OCRerr] max{Vd . [OCRerr]Q} d 3. It is easy to combine keyword match with con- text vectors. We first use the match as a filter for documents and return documents in order by closeness to the query vectors. If all matching documents have been retrieved, MatchPlus can revert to context vectors for finding the closest remaining document. 4. MatchPlus requires only about 300 multiplica- tions and additions to search a document. More- over it is easy to decompose the search for a cor- pus of documents with either parallel hardware or, less expensively several networked conven- tional machines (or chips). Each machine can search a subset of the document context vectors and return the closest distances and document numbers in its subset. The closest from among the distances returned by all the processors then determines the documents chosen for retrieval. We are also investigating a cluster tree prun- ing procedure that finds nearest neighbor docu- ment context vectors without having to compute dot products for all document context vectors. This data organization affects retrieval speed, but does not change the order in which docu- ments are retrieved. 3 Preliminary Results Our system is very `young'. It has been able to handle large corpora (1,000,000+ documents) only since July 1992. Nevertheless we have some promising results. In figure 3, we see that MatchPlus gives comparable performance to Salton's SMART system [5] on small, traditional IR test corpora when corresponding term weighting schemes are used. Salton reports signifi- cantly improved performance (10[OCRerr]50%) with other term weighting methods, and we are in the process of running the corresponding tests with MatchPlus. These experiments used fully automated boot- strapping with no hand entry of context vectors. Ex- periments on other corpora with a hand-entered set of core stems show 3% to 15% improvement, with larger improvement on smaller corpora. Bootstrapping for the tests in figure 3 was on the target corpus only, with maximum size being 3200 110 C's' CACM MED MatchPlus [OCRerr] .1749 T .5013 SMART .1410 .2535 .5062 Notes: MatchPlus: $Match(3) filter - SMART figures from Salton [5]. - Comparisons use classical idf term weighting for both systems. Figure 3: MatchPlus results are comparable to SMART system results on traditional IR corpora us- mg a corresponding term weighting method. documeuts. We have found a significant advantage to bootstrapping with larger corpora, as shown in figure 4. We also plan experiments where bootstrap- ping begins with stem context vectors generated from a larger corpus. Bootstrap corpus size 50,000 200,000 500,000 (# docs) Performance 36.5 39.0 39.9 Improvement 7% 9% Notes: - Performance was average relevant for 200 re- trievals using Tipster corpus. Figure 4: Improvement with size of bootstrap corpus. Experiments with the Tipster/TREC corpus are in progress. 4 Concluding Comments The key feature of MatchPlus is its uniform represen- tation of all objects by context vectors. This makes possible a large number of interesting experiments for the next year, such as * use of neural network learning algorithms to per- form fast automated query modification based upon user feedback * word sense disambiguation as described in [2]