SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Recent Developments in Natural Language Text Retrieval chapter T. Strzalkowski J. Carballo National Institute of Standards and Technology D. K. Harman potentially lOSt, and strong associations could be pr[OCRerr] duced where there weren't any. A way to improve things is to oonsider different syntactic relations independently, perhaps as independent soures Of evidence that could lend support (or not) to cerain term similarity predic- tions. We have started investigating this option during ThEC-2, however, it has not been sufficiently tested yet. One difficulty in obtaining head-modifier pairs of highest accuracy is the notorious ambiguity of nominal compounds. For example, the phrase natural language processing should generate Ianguage+natural and processing+language, while dynamic information pro- cessing is expected to yield processing+dynamic and processing+information. A still another case is executive vice president where the association president+executive may be stretching things a bit too far. Since our parser has no knowledge about the text domain, and uses no semantic preferences, it does not attempt to guess any internal associations within such phrases. histead, this task is passed to the pair extractor module which processes ambiguous parse structures in two phases. In phase one, all and only unambiguous head-modifier pairs are extracted, and the frequencies of their occurrences are recorded. `[1 phase two, frequency information about pairs generated in the first pass is used to form associa- tions from ambiguous structures. For example, if language+natural has occurred unambiguously a number of times in contexts such as parser for natural language, white processing+natural has occurred significantly fewer times or perhaps none at all, then we will prefer the former association as valid. InThEC-2 phrase disambiguation was not used, instead we decided to avoid ambiguous phrases altogether. While our disam- big program worked generally satisfactorily, we could resolve only a small fraction Of cases (about 7%) and thus it's impact on the overall system's performance was limited. However, query-level disambiguation may be more importanL TERM CORRELATIONS FROM TEXT Head-modifier pairs form compound terms used in database indexing. They also serve as occurrence con- texts for smaller temis, including single-word terms. if two terms tend to be modified with a number of common modifiers and otherwise appear in few distinct contexts, we assign them a similarity coefficient, a real number between 0 and 1. The sinillarity is determined by com- paring distribution characteristics for both terms within the corpus: how much information content do they carry, do their information contribution over contexts vary greatly, are the common contexts in which these terms occur specific enough? In general we will credit high- content terms appearing in identical contexts, especially 127 if these contexts are not too commonplace.4 For JREC-2 runs we used a similarity formula which, unlike the similarity formula used in [OCRerr][OCRerr]EC-1, produces clusters of related words and phrases, but will not generate uniform term simllarity ranking across clus- ters. This new formula, however, appeared better suited to handie the diverse subject matter of the WSJ database. We used a (revised) variant of weighted Tanimoto's measure described in (Grefenstette, 1992): [OCRerr]MIN (W([x,aaj),W ([OCRerr],att]) SIM(x1,x2) = aU LMAX (W ([x,att]),W (ly,att]) att with W ([x,y]) = GEW (x)*log [OCRerr] log LnxY1 1 GEW(x)=1+[OCRerr][OCRerr] L[OCRerr]*l;[)Y) J In the above, [OCRerr] stands for absolute frequency of pair [x,y], [OCRerr] is the frequency of term y, and N is the number of single-word terms. Sample clusters obtained from approx. 250 MByte (42 million words) subset of WSJ (years 19[OCRerr]1992) are given in Table 1. In order to generate better similarities, we require that words x1 and x2 appear in at least M distinct com- mon contexts, where a common context is a couple of pairs [x1,y] and [x2,y], or [OCRerr]xi] and Lyx2] such that they each occerred at least three times. Thus, banana and Bal- tic will not be considered for simllarity relation on the basis Of their occurrences in the common context Of republic, no matter how frequent, unless there is another such common context comparably frequent (there wasn't any in [OCRerr]I[OCRerr][OCRerr]EC's WSJ database). For smaller or narrow domain databases M=2 is usually sufficient. For large databases covering a rather diverse subject matter, like WSJ or SJMN (San Jose Mercury News), we used M[OCRerr].5 This, however, turned out not to be sufficient. We would still generate fairly strong similarity links between terms such as aerospace and pharmaceutical where 6 and more common contexts were found. In the example at hand the following common contexts were located, all occurring at the head (left) position Of a pair (at right are their glo- bal entropy weights and frequencies with aerospace and 4 It would not he appropriate to predict similarity between language and logarithm on the basis of their co[OCRerr]occw[OCRerr]ce with natur- aL 5 For example banana and Dominican were found to have two common contexts: republic and plant, althou[OCRerr]i this second occuered in apparently different senses in Dominican plant and banana plant.