SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
On Expanding Query Vectors with Lexically Related Words
chapter
E. Voorhees
National Institute of Standards and Technology
D. K. Harman
synsets into a set of approximately ten hierarchies1.
Figure 1 shows a piece of WordNet. The figure con-
tains all the ancestors and descendents as defined by
the is-a relation for the six senses of the noun swing.
Also shown is that one of the senses, a child's toy, is
pan-of a playground.
Given a synset, there is a wide choice of words to
add to a query vector one can add only the syn-
onyms within the synset, or all descendents in the is-a
hierarchy, or all words in synsets one link away from
the original synset regardless of link type, etc. One of
the goals of this work is to discover which such strate-
gies are effective. Wang et al. found expanding vectors
from relational thesauri to be effective [6], but based
those conclusions on experiments performed on one
small collection. Experiments we performed as part of
our TREC-1 work showed showed serious degradation
when anything other than synonyms were used in the
expansion but the TREC-1 resuJts were dominated
by the problem of finding good synsets to expand. This
work examines the effectiveness of the different relation
types assuming good synsets are used as the basis.
Siemens' official TREC-2 runs consist of one rout-
ing run (topics 51-100 against the documents on disk
three) and two ad hoc runs (topics 101-150 against the
documents on the first two disks). All of the runs are
manual since the input text of the topics was modified
by hand. There are two types of modifications: parts
of the topic statement that explicitly list things that
are not relevant were removed, and synsets containing
nouns germane to the topic statement were added as a
new section of the topic text. Document text was in-
dexed completely automatically (once the errors were
fixed2) using the standard SMART indexing routines [1]
(i.e., tokenization, stop word removal, and stemming).
In general, only the "text" fields of the documents were
indexed. For example, only the title, abstract, detailed
claims, claims, and design claims sections were indexed
for the patent subcollection. The manually assigned
keywords included in some of the Ziff documents were
not used, nor were the photograph captions of the San
Jose Me[OCRerr]ury collection.
The goal in selecting synsets to be included in a topic
statement was to pick synsets that emphasized impor-
tant concepts of the topic. One aspect of the prob-
lem is sense resolution: selecting the synset that con-
tains the correct sense of an ambiguous original topic
word. However, since one purpose of the experiments
1The actual structure is not quite a hierarchy since a few
synsets have more than one parent.
2There were seven errors total in ifies patni)14 and patn[OCRerr]51
that were not on the official list, but caused the ifies to not con-
form to the patent collection's readme ifie. These errors - miss-
ing `/TEXT' tags,'TEXT' tags preceding `OREF' tags, and the
like - were also fixed manually.
224
is to investigate how effective lexical relations are in
expanding queries assuming good starting concepts, I
did not restrict myself to adding only synsets that con-
tain some original topic word. For example, topic 93
asks for information about the support of the NRA
and never mentions the word gun. Nevertheless, I be-
lieved gun to be an important concept of the topic and
added the synset containing gun meaning "a weapon
that discharges a missile from a metal tube or barrel"
to the topic. (Rifle, a word that does appear in the
topic statement, is a grandchild of this synset in Word-
Net, with the intervening synset being {flrearm, piece,
small-arm}). Synset selection was also influenced by
the fact that these synsets would be used to expand
the query. Early experiments demonstrated that ex-
pansion worked poorly when synsets with very many
children in the is-a hierarchy (e.g. coun[OCRerr][OCRerr]y) were used,
so those synsets were avoided. Furthermore, when se-
lecting one sense among the different senses in WordNet
was difficult, I frequently used the words related to the
synsets as a way of making a decision. Figure 2 shows
the original text of topic 93 and the synsets that were
added to it.
Some topics contained important concepts that had
no corresponding synset. Occasionally, the missing
synset was a gap in WordNet; for example, joxic was[OCRerr]e,
gene[OCRerr]ic engineering, and sancuons meaning economic
disciplinary measures are not in version 1.3 of WordNet.
More often, the important concept was a proper noun
or highly technical term that one wouldn't expect to be
in WordNet. NRA or Nalional Rifle Associajion, for
example, is an important concept for topic 93 but does
not occur in WordNet. Nothing was added to the topic
texts for concepts that lacked corresponding synsets in
these experiments, although making some provision for
them would improve retrieval performance.
Once the text of the topics is annotated with synsets,
the remainder of the processing is automatic. Selected
fields of the topic statements (the title, nationality, nar-
rative, factors, description, and concept fields) are in-
dexed using the standard SMART routines. The terms
derived from these sections are "original query terms
The expansion procedure is invoked when the synonym
set section is reached. The procedure is controlled by
a set of parameters that specifies for each relation type
included in WordNet the maximum length of a chain
of that type of link that may be followed. A chain
begins at each synset listed in the synset section of
the topic text and may contain only links of a sin-
gle type. All synonyms contained within a synset of
the chain are added to the query. Collocations such
as change[OCRerr]of[OCRerr]loca[OCRerr]ion in Figure 1 are broken into their
component words, stop words such as of are removed,
and the remaining words are stemmed. The word stems