NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Automatic Classification and Categorization chapter Mary Elizabeth Stevens National Bureau of Standards 5.3 Latent Class Analysis Like the earlier work of Tanimoto, the latent class analysis approach of Baker (1962 [27[OCRerr] )to problems of automatic information classification and retrieval is at least to date theoretical rather than experimental in nature, and so will be considered only briefly here. Baker claims that the latent class model developed in the field of the sociological sciences for the determination of latent classes among individuals responding "yes" or !!no[OCRerr]t to items in a questionnaire would have attractive features for application to information categorization and search, because the model is based upon response patterns that are analogous to the presence or absence of clue words or phrases in documents and because the analysis yields an ordering ratio that could serve a function similar to the relevance weightings suggested by Maron and Kuhns. This ordering ratio is the probability that a given pattern of clue words will occur in a document properly belonging to a particular latent class. The probabilities of the same pattern being generated by a document properly belonging to other classes are also provided, giving an uncertainty which Baker thinks justifiable because a "document could generate a given pattern of key words, yet not belong to the same area of interest as the majority of documents possessing the same pattern of keywords". I1it should be noted, however, that the question of how to select appropriate clue words is begged 2/ and that 3/ no computer programs are as yet available for carrying out latent class analyses. - 5.4 Examples of Other Proposed Classificatory Techniques There are certain other document classificatory techniques that have been proposed and to some extent investigated experimentally. Trials of document clusterings based on co-citingness, co-citedness, or bibliographic coupling as compared with subject con- tent groupings have, as noted above, been conducted both by Kessler at the M.I.T. Libraries and by Salton's group at Harvard. 4/ Consideration of Doyle's work on word co-occurrence statistics has been deliberately deferred to a later section which covers his general `1association map" approach. Similarly, several other investigations will be discussed in terms of potentially related research such as linguistic data processing. Two particular examples of other suggested classificatory techniques for document grouping or classification are somewhat unusual, however. These are the methods pro- posed by Te Nuyl and by Lefkovitz (1963 [353]). Cleverdon and Mills comment on Te Nuyl's method as follows: 1/ 2/ Baker, 1962[27], p. 518. Ibid, p.517. Note also that the footnote states: "A referee of this paper has proper- ly cautioned that the effectiveness of an information retrieval system may be due more to the appropriateness of the key words than the subsequent processing." See also Hillman, 1963 [272], p.323: "Baker's theory, however, is based on inter- relationships of key words, and thus constitutes an approach which is regarded with some suspicion by Farradane, who thinks that the real problem concerns the inter- relationships of the concepts which key words denote." 3/ 4/ Baker, 1962 [27], p. 516. See Kessler, 1963 [320]; Lesk, 1963 [356, 357], and p. 30 of this report. [OCRerr]l3