SP500207
NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)
Natural Language Processing in Large-Scale Text Retrieval Tasks
chapter
T. Strzalkowski
National Institute of Standards and Technology
Donna K. Harman
not be considered for similarity relation on the basis
of their occurrences in the common context of repub-
lic, no matter how frequent, Unless there is another
such common context comparably frequent (there
wasn't any in TREC WSJ database). 16
It may be worth pointing out that the similari-
ties are calculated using term co-occurrences in syn-
tactic rather than in document-size contexts, the latter
being the usual pracfice in non-linguistic clustering
(e.g., Sparck Jones and Barber, 1971; Crouch, 1988;
Lewis and Croft, 1990). Although the two methods of
term clustering may be considered mutually comple-
mentary in certain situafions, we believe that more
and stronger associations can be obtained through
syntactic-context clustering, given sufficient amount
of data and a reasonably accurate syntactic parser. 17
QUERY EXPANSION
Similarity relations are used to expand user
queries with new terms, in an attempt to make the
final search query more comprehensive (adding
synonyms) and/or more pointed (adding specializa-
tions).'8 It follows that not all similarity relafions will
be equally useful in query expansion, for inst£'tnce,
complementary and antonymous relations like the
one between australian and canadian, or accept and
reject may actually harm system's performance,
since we may end up retrieving many irrelevant
documents. Similarly, the effectiveness of a query
containing vitamin is likely to diminish if we add a
similar but far more general term such as acid. On
the other hand, database search is likely to miss
relevant documents if we overlook the fact that for-
tran is a programming language, or that infant is a
baby and baby is a child. We noted that an average
set of similarities generated from a text corpus con-
tains about as many "good" relations (synonymy,
specialization) as "bad" relations (antonymy,
16 However, SJM(banana,dominican) was generated since
two independent contexts were indeed found: republic and plant,
even though word plant apparently occurs in different senses in ba-
nana plant and dominican plant.
17 Non-syntactic contexts cross sentence boundaries with no
fuss, which is helpful with short, succinct documents (such as
CACM abstracts), but less so with longer texts; see also (Grishman
etal., 1986).
I' Query expansion (in the sense considered here, though not
quite in the same way) has been used in information retrieval
research before (e.g., Sparck Jones and Tait, 1984; Harman, 1988),
usually with mixed results. An altemative is to use term clusters to
create new terms, "metaterms", and use them to index the database
instead (e.g., Crouch, 1988; Lewis and Croft, 1990). We found that
the query expansion approach gives the system more flexibility, for
instance, by making room for hypertext-style topic exploration via
user feedback.
181
complementation, generalization), as seen from the
query expansion viewpoint. Therefore any attempt to
separate these two classes and to increase the propor-
tion of "good" relations should result in improved
retrieval. This has indeed been confirmed in our ear-
her experiments where a relatively crude filter has
visibly increased retrieval precision.
In order to create an appropriate filter, we
expanded the IC function into a global specificity
measure called the cumulative informational contri-
bution function (ICW). ICW is calculated for each
term across all contexts in which it occurs. The gen-
eral philosophy here is that a more specific
word/phrase would have a more limited its use, i.e., a
more specific term would appear in fewer distinct
contexts. In this respect, ICW is similar to the stan-
dard inverted document frequency (idf) measure
except that term frequency is measured over syntactic
units rather than document size units.19 Terms with
higher ICW values are generally considered more
specific, but the specificity comparison is only mean-
ingful for terms which are akeady known to be suni-
lar. The new function is calculated according to the
following formula:
F JCL(W) * JCR(W) if both exist
JCW(w) =[OCRerr]JCR(W) if only ICR (w) exists
Lo otherwise
where (with n[OCRerr], d[OCRerr] > 0):
ICL(W) = IC ([w,[OCRerr]]) = d[OCRerr](n[OCRerr]+d[OCRerr]- 1)
ICR (w) =IC([_,wJ) = d[OCRerr](n[OCRerr]+d[OCRerr]-l)
For any two terms w1 and w2, [OCRerr]`lnd constants [OCRerr] > 1,
[OCRerr]2 > 1, the following situations were considered. If
JCW(w2) >- 6[OCRerr] * ICW(w1) then w2 is considered
more specific than w1. If ICW(w2) < 62 * ICW(w1)
and ICW(w2)> ICW(w1) then w2 is considered
synonymous with w 1 20 In addition, if
SIM[OCRerr]rm(W 1,w2) = 0> 0, where 0 is an empirically
established threshold, then w2 can be added to the
query containing term w1 with weight o.21 The
19 We believe that measuring term specificity over
document-size contexts (e.g., Sparck Jones, 1972) may not be ap-
propriate in this case. In particular, syntax-based contexts allow for
processing texts without any intemal document structure.
20 In TkBC runs we used 8 = 10 and 82 = 3.
21 For CACM-3204 collection the filter was most effective at
= 0.57. For TREC we changed the similarity formula slightly in
order to obtain correct normalizations in all cases[OCRerr] This however
lowered similarity coeffidents in general and a new threshold had
to be selected. We used tt = 0.1 in TREC runs.