IRS13
Scientific Report No. IRS-13 Information Storage and Retrieval
Word-Word Associations in Document Retrieval Systems
chapter
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
IX-8
To investigate this problem, a list of word-word associations from
a collection in aerodynamics (the Cranfield collection mentioned earlier)
was prepared and analyzed for significance, the cutoff used being 0.60.
Each word pair was examined and judged either as significant or non-
significant. Significant pairs are those pairs which seem to be composed
of semantically related words. The words are judged to be semantically
related if they would normally be used together in discussions of the same
topic, considering the most common technical definitions of the words to be
their meaning. For example, 11per11 and "cent" is considered a significant
pair; so is "atmosphere" and "satellite". On the other hand, "leading" and
edge" is judged non-significant (by this standard) as is [OCRerr] and [OCRerr]
or "machine" and "evaluating". There is a great deal of subjectivity in
such decisions; however, all judgments were made by the author and are
therefore reasonably consistent.
All the word pairs were then classified by frequency of components
and by correlation. The resulting table was then examined to see if any
combination of parameters yields a particularly high ratio of significant
to non-significant correlations. Overall, only 16.2% of all correlations
are judged significant. Fig. 2 and Table 2 show the variation of signifi-
cance with correlation level. There is a small increase in the fraction
of pairs judged significant at higher correlationq but the number of
correlations above cutoff decreases so rapidly as the cutoff is decreased
that this is not a practical way of improving the quality of information
produced by the association scheme. Even at a cutoff of 0.9 (which means
that if two words occur five times each, all five would have to be co-
occurrences, or if each occurs ten times, nine would have to be co-occurrences)