MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Classification and Categorization
chapter
Mary Elizabeth Stevens
National Bureau of Standards
5.3 Latent Class Analysis
Like the earlier work of Tanimoto, the latent class analysis approach of Baker (1962
[27[OCRerr] )to problems of automatic information classification and retrieval is at least to date
theoretical rather than experimental in nature, and so will be considered only briefly here.
Baker claims that the latent class model developed in the field of the sociological sciences
for the determination of latent classes among individuals responding "yes" or !!no[OCRerr]t to
items in a questionnaire would have attractive features for application to information
categorization and search, because the model is based upon response patterns that are
analogous to the presence or absence of clue words or phrases in documents and because
the analysis yields an ordering ratio that could serve a function similar to the relevance
weightings suggested by Maron and Kuhns.
This ordering ratio is the probability that a given pattern of clue words will occur
in a document properly belonging to a particular latent class. The probabilities of the
same pattern being generated by a document properly belonging to other classes are also
provided, giving an uncertainty which Baker thinks justifiable because a "document could
generate a given pattern of key words, yet not belong to the same area of interest as the
majority of documents possessing the same pattern of keywords". I1it should be noted,
however, that the question of how to select appropriate clue words is begged 2/ and that
3/
no computer programs are as yet available for carrying out latent class analyses. -
5.4 Examples of Other Proposed Classificatory Techniques
There are certain other document classificatory techniques that have been proposed
and to some extent investigated experimentally. Trials of document clusterings based
on co-citingness, co-citedness, or bibliographic coupling as compared with subject con-
tent groupings have, as noted above, been conducted both by Kessler at the M.I.T.
Libraries and by Salton's group at Harvard. 4/ Consideration of Doyle's work on word
co-occurrence statistics has been deliberately deferred to a later section which covers
his general `1association map" approach. Similarly, several other investigations will be
discussed in terms of potentially related research such as linguistic data processing.
Two particular examples of other suggested classificatory techniques for document
grouping or classification are somewhat unusual, however. These are the methods pro-
posed by Te Nuyl and by Lefkovitz (1963 [353]). Cleverdon and Mills comment on Te
Nuyl's method as follows:
1/
2/
Baker, 1962[27], p. 518.
Ibid, p.517. Note also that the footnote states: "A referee of this paper has proper-
ly cautioned that the effectiveness of an information retrieval system may be due
more to the appropriateness of the key words than the subsequent processing." See
also Hillman, 1963 [272], p.323: "Baker's theory, however, is based on inter-
relationships of key words, and thus constitutes an approach which is regarded with
some suspicion by Farradane, who thinks that the real problem concerns the inter-
relationships of the concepts which key words denote."
3/
4/
Baker, 1962 [27], p. 516.
See Kessler, 1963 [320]; Lesk, 1963 [356, 357], and p. 30 of this report.
[OCRerr]l3