NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Natural Language Processing in Large-Scale Text Retrieval Tasks chapter T. Strzalkowski National Institute of Standards and Technology Donna K. Harman not be considered for similarity relation on the basis of their occurrences in the common context of repub- lic, no matter how frequent, Unless there is another such common context comparably frequent (there wasn't any in TREC WSJ database). 16 It may be worth pointing out that the similari- ties are calculated using term co-occurrences in syn- tactic rather than in document-size contexts, the latter being the usual pracfice in non-linguistic clustering (e.g., Sparck Jones and Barber, 1971; Crouch, 1988; Lewis and Croft, 1990). Although the two methods of term clustering may be considered mutually comple- mentary in certain situafions, we believe that more and stronger associations can be obtained through syntactic-context clustering, given sufficient amount of data and a reasonably accurate syntactic parser. 17 QUERY EXPANSION Similarity relations are used to expand user queries with new terms, in an attempt to make the final search query more comprehensive (adding synonyms) and/or more pointed (adding specializa- tions).'8 It follows that not all similarity relafions will be equally useful in query expansion, for inst£'tnce, complementary and antonymous relations like the one between australian and canadian, or accept and reject may actually harm system's performance, since we may end up retrieving many irrelevant documents. Similarly, the effectiveness of a query containing vitamin is likely to diminish if we add a similar but far more general term such as acid. On the other hand, database search is likely to miss relevant documents if we overlook the fact that for- tran is a programming language, or that infant is a baby and baby is a child. We noted that an average set of similarities generated from a text corpus con- tains about as many "good" relations (synonymy, specialization) as "bad" relations (antonymy, 16 However, SJM(banana,dominican) was generated since two independent contexts were indeed found: republic and plant, even though word plant apparently occurs in different senses in ba- nana plant and dominican plant. 17 Non-syntactic contexts cross sentence boundaries with no fuss, which is helpful with short, succinct documents (such as CACM abstracts), but less so with longer texts; see also (Grishman etal., 1986). I' Query expansion (in the sense considered here, though not quite in the same way) has been used in information retrieval research before (e.g., Sparck Jones and Tait, 1984; Harman, 1988), usually with mixed results. An altemative is to use term clusters to create new terms, "metaterms", and use them to index the database instead (e.g., Crouch, 1988; Lewis and Croft, 1990). We found that the query expansion approach gives the system more flexibility, for instance, by making room for hypertext-style topic exploration via user feedback. 181 complementation, generalization), as seen from the query expansion viewpoint. Therefore any attempt to separate these two classes and to increase the propor- tion of "good" relations should result in improved retrieval. This has indeed been confirmed in our ear- her experiments where a relatively crude filter has visibly increased retrieval precision. In order to create an appropriate filter, we expanded the IC function into a global specificity measure called the cumulative informational contri- bution function (ICW). ICW is calculated for each term across all contexts in which it occurs. The gen- eral philosophy here is that a more specific word/phrase would have a more limited its use, i.e., a more specific term would appear in fewer distinct contexts. In this respect, ICW is similar to the stan- dard inverted document frequency (idf) measure except that term frequency is measured over syntactic units rather than document size units.19 Terms with higher ICW values are generally considered more specific, but the specificity comparison is only mean- ingful for terms which are akeady known to be suni- lar. The new function is calculated according to the following formula: F JCL(W) * JCR(W) if both exist JCW(w) =[OCRerr]JCR(W) if only ICR (w) exists Lo otherwise where (with n[OCRerr], d[OCRerr] > 0): ICL(W) = IC ([w,[OCRerr]]) = d[OCRerr](n[OCRerr]+d[OCRerr]- 1) ICR (w) =IC([_,wJ) = d[OCRerr](n[OCRerr]+d[OCRerr]-l) For any two terms w1 and w2, [OCRerr]`lnd constants [OCRerr] > 1, [OCRerr]2 > 1, the following situations were considered. If JCW(w2) >- 6[OCRerr] * ICW(w1) then w2 is considered more specific than w1. If ICW(w2) < 62 * ICW(w1) and ICW(w2)> ICW(w1) then w2 is considered synonymous with w 1 20 In addition, if SIM[OCRerr]rm(W 1,w2) = 0> 0, where 0 is an empirically established threshold, then w2 can be added to the query containing term w1 with weight o.21 The 19 We believe that measuring term specificity over document-size contexts (e.g., Sparck Jones, 1972) may not be ap- propriate in this case. In particular, syntax-based contexts allow for processing texts without any intemal document structure. 20 In TkBC runs we used 8 = 10 and 82 = 3. 21 For CACM-3204 collection the filter was most effective at = 0.57. For TREC we changed the similarity formula slightly in order to obtain correct normalizations in all cases[OCRerr] This however lowered similarity coeffidents in general and a new threshold had to be selected. We used tt = 0.1 in TREC runs.