NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) On Expanding Query Vectors with Lexically Related Words chapter E. Voorhees National Institute of Standards and Technology D. K. Harman synsets into a set of approximately ten hierarchies1. Figure 1 shows a piece of WordNet. The figure con- tains all the ancestors and descendents as defined by the is-a relation for the six senses of the noun swing. Also shown is that one of the senses, a child's toy, is pan-of a playground. Given a synset, there is a wide choice of words to add to a query vector one can add only the syn- onyms within the synset, or all descendents in the is-a hierarchy, or all words in synsets one link away from the original synset regardless of link type, etc. One of the goals of this work is to discover which such strate- gies are effective. Wang et al. found expanding vectors from relational thesauri to be effective [6], but based those conclusions on experiments performed on one small collection. Experiments we performed as part of our TREC-1 work showed showed serious degradation when anything other than synonyms were used in the expansion but the TREC-1 resuJts were dominated by the problem of finding good synsets to expand. This work examines the effectiveness of the different relation types assuming good synsets are used as the basis. Siemens' official TREC-2 runs consist of one rout- ing run (topics 51-100 against the documents on disk three) and two ad hoc runs (topics 101-150 against the documents on the first two disks). All of the runs are manual since the input text of the topics was modified by hand. There are two types of modifications: parts of the topic statement that explicitly list things that are not relevant were removed, and synsets containing nouns germane to the topic statement were added as a new section of the topic text. Document text was in- dexed completely automatically (once the errors were fixed2) using the standard SMART indexing routines [1] (i.e., tokenization, stop word removal, and stemming). In general, only the "text" fields of the documents were indexed. For example, only the title, abstract, detailed claims, claims, and design claims sections were indexed for the patent subcollection. The manually assigned keywords included in some of the Ziff documents were not used, nor were the photograph captions of the San Jose Me[OCRerr]ury collection. The goal in selecting synsets to be included in a topic statement was to pick synsets that emphasized impor- tant concepts of the topic. One aspect of the prob- lem is sense resolution: selecting the synset that con- tains the correct sense of an ambiguous original topic word. However, since one purpose of the experiments 1The actual structure is not quite a hierarchy since a few synsets have more than one parent. 2There were seven errors total in ifies patni)14 and patn[OCRerr]51 that were not on the official list, but caused the ifies to not con- form to the patent collection's readme ifie. These errors - miss- ing `/TEXT' tags,'TEXT' tags preceding `OREF' tags, and the like - were also fixed manually. 224 is to investigate how effective lexical relations are in expanding queries assuming good starting concepts, I did not restrict myself to adding only synsets that con- tain some original topic word. For example, topic 93 asks for information about the support of the NRA and never mentions the word gun. Nevertheless, I be- lieved gun to be an important concept of the topic and added the synset containing gun meaning "a weapon that discharges a missile from a metal tube or barrel" to the topic. (Rifle, a word that does appear in the topic statement, is a grandchild of this synset in Word- Net, with the intervening synset being {flrearm, piece, small-arm}). Synset selection was also influenced by the fact that these synsets would be used to expand the query. Early experiments demonstrated that ex- pansion worked poorly when synsets with very many children in the is-a hierarchy (e.g. coun[OCRerr][OCRerr]y) were used, so those synsets were avoided. Furthermore, when se- lecting one sense among the different senses in WordNet was difficult, I frequently used the words related to the synsets as a way of making a decision. Figure 2 shows the original text of topic 93 and the synsets that were added to it. Some topics contained important concepts that had no corresponding synset. Occasionally, the missing synset was a gap in WordNet; for example, joxic was[OCRerr]e, gene[OCRerr]ic engineering, and sancuons meaning economic disciplinary measures are not in version 1.3 of WordNet. More often, the important concept was a proper noun or highly technical term that one wouldn't expect to be in WordNet. NRA or Nalional Rifle Associajion, for example, is an important concept for topic 93 but does not occur in WordNet. Nothing was added to the topic texts for concepts that lacked corresponding synsets in these experiments, although making some provision for them would improve retrieval performance. Once the text of the topics is annotated with synsets, the remainder of the processing is automatic. Selected fields of the topic statements (the title, nationality, nar- rative, factors, description, and concept fields) are in- dexed using the standard SMART routines. The terms derived from these sections are "original query terms The expansion procedure is invoked when the synonym set section is reached. The procedure is controlled by a set of parameters that specifies for each relation type included in WordNet the maximum length of a chain of that type of link that may be followed. A chain begins at each synset listed in the synset section of the topic text and may contain only links of a sin- gle type. All synonyms contained within a synset of the chain are added to the query. Collocations such as change[OCRerr]of[OCRerr]loca[OCRerr]ion in Figure 1 are broken into their component words, stop words such as of are removed, and the remaining words are stemmed. The word stems