SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
TIPSTER Panel -- HNC's MatchPlus System
chapter
S. Gallant
R. Hecht-Nielson
W. Caid
K. Qing
J. Carleton
D. Sudbeck
National Institute of Standards and Technology
Donna K. Harman
2. Because document context vectors are normal-
ized, we may simply find the document d that
maximized the dot product with the query con-
text vector, [OCRerr]
max{Vd . [OCRerr]Q}
d
3. It is easy to combine keyword match with con-
text vectors. We first use the match as a filter
for documents and return documents in order by
closeness to the query vectors. If all matching
documents have been retrieved, MatchPlus can
revert to context vectors for finding the closest
remaining document.
4. MatchPlus requires only about 300 multiplica-
tions and additions to search a document. More-
over it is easy to decompose the search for a cor-
pus of documents with either parallel hardware
or, less expensively several networked conven-
tional machines (or chips). Each machine can
search a subset of the document context vectors
and return the closest distances and document
numbers in its subset. The closest from among
the distances returned by all the processors then
determines the documents chosen for retrieval.
We are also investigating a cluster tree prun-
ing procedure that finds nearest neighbor docu-
ment context vectors without having to compute
dot products for all document context vectors.
This data organization affects retrieval speed,
but does not change the order in which docu-
ments are retrieved.
3 Preliminary Results
Our system is very `young'. It has been able to handle
large corpora (1,000,000+ documents) only since July
1992. Nevertheless we have some promising results.
In figure 3, we see that MatchPlus gives comparable
performance to Salton's SMART system [5] on small,
traditional IR test corpora when corresponding term
weighting schemes are used. Salton reports signifi-
cantly improved performance (10[OCRerr]50%) with other
term weighting methods, and we are in the process of
running the corresponding tests with MatchPlus.
These experiments used fully automated boot-
strapping with no hand entry of context vectors. Ex-
periments on other corpora with a hand-entered set of
core stems show 3% to 15% improvement, with larger
improvement on smaller corpora.
Bootstrapping for the tests in figure 3 was on the
target corpus only, with maximum size being 3200
110
C's' CACM MED
MatchPlus [OCRerr] .1749 T .5013
SMART .1410 .2535 .5062
Notes:
MatchPlus: $Match(3) filter
- SMART figures from Salton [5].
- Comparisons use classical idf term weighting
for both systems.
Figure 3: MatchPlus results are comparable to
SMART system results on traditional IR corpora us-
mg a corresponding term weighting method.
documeuts. We have found a significant advantage
to bootstrapping with larger corpora, as shown in
figure 4. We also plan experiments where bootstrap-
ping begins with stem context vectors generated from
a larger corpus.
Bootstrap corpus size 50,000 200,000 500,000
(# docs)
Performance 36.5 39.0 39.9
Improvement 7% 9%
Notes:
- Performance was average relevant for 200 re-
trievals using Tipster corpus.
Figure 4: Improvement with size of bootstrap corpus.
Experiments with the Tipster/TREC corpus are in
progress.
4 Concluding Comments
The key feature of MatchPlus is its uniform represen-
tation of all objects by context vectors. This makes
possible a large number of interesting experiments for
the next year, such as
* use of neural network learning algorithms to per-
form fast automated query modification based
upon user feedback
* word sense disambiguation as described in [2]