IRS13 Scientific Report No. IRS-13 Information Storage and Retrieval Word-Word Associations in Document Retrieval Systems chapter M. E. Lesk Harvard University Gerard Salton Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. IX-8 To investigate this problem, a list of word-word associations from a collection in aerodynamics (the Cranfield collection mentioned earlier) was prepared and analyzed for significance, the cutoff used being 0.60. Each word pair was examined and judged either as significant or non- significant. Significant pairs are those pairs which seem to be composed of semantically related words. The words are judged to be semantically related if they would normally be used together in discussions of the same topic, considering the most common technical definitions of the words to be their meaning. For example, 11per11 and "cent" is considered a significant pair; so is "atmosphere" and "satellite". On the other hand, "leading" and edge" is judged non-significant (by this standard) as is [OCRerr] and [OCRerr] or "machine" and "evaluating". There is a great deal of subjectivity in such decisions; however, all judgments were made by the author and are therefore reasonably consistent. All the word pairs were then classified by frequency of components and by correlation. The resulting table was then examined to see if any combination of parameters yields a particularly high ratio of significant to non-significant correlations. Overall, only 16.2% of all correlations are judged significant. Fig. 2 and Table 2 show the variation of signifi- cance with correlation level. There is a small increase in the fraction of pairs judged significant at higher correlationq but the number of correlations above cutoff decreases so rapidly as the cutoff is decreased that this is not a practical way of improving the quality of information produced by the association scheme. Even at a cutoff of 0.9 (which means that if two words occur five times each, all five would have to be co- occurrences, or if each occurs ten times, nine would have to be co-occurrences)