ISR11
Scientific Report No. ISR-11 Information Storage and Retrieval
Information Analysis and Dictionary Construction
chapter
G. Salton
M. E. Lesk
Harvard University
Gerard Salton
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
iv-66
subbranches;
3) if all words in a given word group are bei[OCRerr]ng placed in the same
branch with the high-frequency word, this word belongs one level
up as a parent of all the remaining words.
Consider again the vocabulary of Fig. 19. The highest frequency word
is computer11 (frequency 508), and two classes are first formed of words
like ttcomputer1, and of the 1other'1 words (see Fig. 21). The high frequency
class is the one containing the term 1tcomputer11, so that it is subdivided
again using the word `Tcomputer't as a criterion. This produces two
classes consisting respectively of 1computer, program, digital, memory'
and "[OCRerr]ystem, circuit, data'1; the term "machine" which is generic to the
whole class is left on the second level. The ori[OCRerr]inal "other'1 category can
also be subdivided, using the included high-frequency word "operate" as a
guide, and producing the complete hierarchy shown in Fig. 21.
A comparison of the hierarchies of Figs. 20 and 21 reveals that the
word groups produced by the thesaurus question method of Fig. 20 may be
more reasonable; however, the frequency procedure is more systematic and
may conceivably be easier to apply.
The last hierarchy formation process is also based on a term-do[OCRerr]ment
or a term-property matrix. In this case, however, the process of forming
the hierarchy is completely automatic, even though the original property
matrix may have been constructed by hand. Consider two arbitrary terms
identified by weighted property vectors. The following conditions may
then obtain:
1) terms A and B are identified by different properties, and as
such are not related;
2) terms A and B are identified by the same properties, and the