SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Recent Developments in Natural Language Text Retrieval
chapter
T. Strzalkowski
J. Carballo
National Institute of Standards and Technology
D. K. Harman
potentially lOSt, and strong associations could be pr[OCRerr]
duced where there weren't any. A way to improve things
is to oonsider different syntactic relations independently,
perhaps as independent soures Of evidence that could
lend support (or not) to cerain term similarity predic-
tions. We have started investigating this option during
ThEC-2, however, it has not been sufficiently tested yet.
One difficulty in obtaining head-modifier pairs of
highest accuracy is the notorious ambiguity of nominal
compounds. For example, the phrase natural language
processing should generate Ianguage+natural and
processing+language, while dynamic information pro-
cessing is expected to yield processing+dynamic and
processing+information. A still another case is executive
vice president where the association president+executive
may be stretching things a bit too far. Since our parser
has no knowledge about the text domain, and uses no
semantic preferences, it does not attempt to guess any
internal associations within such phrases. histead, this
task is passed to the pair extractor module which
processes ambiguous parse structures in two phases. In
phase one, all and only unambiguous head-modifier pairs
are extracted, and the frequencies of their occurrences
are recorded. `[1 phase two, frequency information about
pairs generated in the first pass is used to form associa-
tions from ambiguous structures. For example, if
language+natural has occurred unambiguously a
number of times in contexts such as parser for natural
language, white processing+natural has occurred
significantly fewer times or perhaps none at all, then we
will prefer the former association as valid. InThEC-2
phrase disambiguation was not used, instead we decided
to avoid ambiguous phrases altogether. While our disam-
big program worked generally satisfactorily, we could
resolve only a small fraction Of cases (about 7%) and
thus it's impact on the overall system's performance was
limited. However, query-level disambiguation may be
more importanL
TERM CORRELATIONS FROM TEXT
Head-modifier pairs form compound terms used in
database indexing. They also serve as occurrence con-
texts for smaller temis, including single-word terms. if
two terms tend to be modified with a number of common
modifiers and otherwise appear in few distinct contexts,
we assign them a similarity coefficient, a real number
between 0 and 1. The sinillarity is determined by com-
paring distribution characteristics for both terms within
the corpus: how much information content do they carry,
do their information contribution over contexts vary
greatly, are the common contexts in which these terms
occur specific enough? In general we will credit high-
content terms appearing in identical contexts, especially
127
if these contexts are not too commonplace.4
For JREC-2 runs we used a similarity formula
which, unlike the similarity formula used in [OCRerr][OCRerr]EC-1,
produces clusters of related words and phrases, but will
not generate uniform term simllarity ranking across clus-
ters. This new formula, however, appeared better suited
to handie the diverse subject matter of the WSJ database.
We used a (revised) variant of weighted Tanimoto's
measure described in (Grefenstette, 1992):
[OCRerr]MIN (W([x,aaj),W ([OCRerr],att])
SIM(x1,x2) = aU
LMAX (W ([x,att]),W (ly,att])
att
with
W ([x,y]) = GEW (x)*log [OCRerr]
log LnxY1 1
GEW(x)=1+[OCRerr][OCRerr] L[OCRerr]*l;[)Y) J
In the above, [OCRerr] stands for absolute frequency of pair
[x,y], [OCRerr] is the frequency of term y, and N is the number
of single-word terms. Sample clusters obtained from
approx. 250 MByte (42 million words) subset of WSJ
(years 19[OCRerr]1992) are given in Table 1.
In order to generate better similarities, we require
that words x1 and x2 appear in at least M distinct com-
mon contexts, where a common context is a couple of
pairs [x1,y] and [x2,y], or [OCRerr]xi] and Lyx2] such that they
each occerred at least three times. Thus, banana and Bal-
tic will not be considered for simllarity relation on the
basis Of their occurrences in the common context Of
republic, no matter how frequent, unless there is another
such common context comparably frequent (there wasn't
any in [OCRerr]I[OCRerr][OCRerr]EC's WSJ database). For smaller or narrow
domain databases M=2 is usually sufficient. For large
databases covering a rather diverse subject matter, like
WSJ or SJMN (San Jose Mercury News), we used M[OCRerr].5
This, however, turned out not to be sufficient. We would
still generate fairly strong similarity links between terms
such as aerospace and pharmaceutical where 6 and more
common contexts were found. In the example at hand the
following common contexts were located, all occurring
at the head (left) position Of a pair (at right are their glo-
bal entropy weights and frequencies with aerospace and
4 It would not he appropriate to predict similarity between
language and logarithm on the basis of their co[OCRerr]occw[OCRerr]ce with natur-
aL
5 For example banana and Dominican were found to have two
common contexts: republic and plant, althou[OCRerr]i this second occuered in
apparently different senses in Dominican plant and banana plant.