NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards * . The Cue method has as its source of machine recognizable clues, the general characteristics of the corpus that are provided by the bodies of the documents and is based on a Cue Dictionary of function words apt to appear in the body of a document. ..... The Title method has as its source of machine recognizable clues, the specific characteristics of the skeleton of the document, i. e. , title, headings, and format, and is based on a Title Glossary compromising those content words found in the title, subtitles, and headings, but excluding certain words of the Cue Dictionary. * . The Location method has as its source of machine recognizable clues, the general characteristics of the corpus that are provided by the skeletons of the documents and uses a Heading Dictionary of certain function words that appear in the skeletons of documents.'1 1/ The Harvard work involving detection of the first incidences of nouns as sentence selection and indexing clues is part of a larger-scale program for mechanized informa- tion selection and retrieval under the general direction of Salton (1961 [siz], 1962 [513], 1963 [514] and [515]). The specific mixed system involving frequency data, syntactic identification clues, and positional criteria is primarily the result of investigations by Lesk and Storm (1961 [577], 1962 [358]). Related work takes advantage of computer techniques for predictive syntactic analysis and automatic dictionary lookup also under development at the Harvard Computation Laboratory (Kuno and Oettinger, 1963 [339], [340], [341]). The Lesk-Storm experiments have involved investigations where the hypothesis assumed is that the points in a text where the author has first introduced a specific noun or nominal phrase, or where he has used, with higher frequencies, a combination of first-referred-to-nouns, are most likely to be especially indicative sections of text with respect to subject-content representativeness. The assumption is further,that areas in which specific `1new" ideas, not mentioned previously in the text, are first introduced is particularly rich in topical-content concentration. 2* The_mixed-system emphasis followed by Lesk and Storm, however, is revealed in the following comments: `11t is not, of course, apparent that a count of initial occurrences of nouns . . . is by itself sufficient to reveal areas of significant information content for purposes of abstracting or indexing. Accordingly, the method suggested here must be used together with other available means, and is not expected to provide by itself an acceptable abstracting algorithm. 3/ In their actual investigations, Lesk and Storm first made manual counts of initial noun occurrences in various sample texts, noting paragraph, sentence, and first incidence-of-word identifications. The computer was then used to carry out three distinctive tasks: (1) calculation of the number of new nouns for each sentence in the text; 1/ 2/ 3' Thompson Ramo Wooldridge, 1963 [603], p. 1. Lesk and Storm, 1962 [358], p. 1-6. Storm, 1961 [577], pp. I-i and 1-2. 87