MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
* . The Cue method has as its source of machine recognizable clues, the general
characteristics of the corpus that are provided by the bodies of the documents and
is based on a Cue Dictionary of function words apt to appear in the body of a
document.
..... The Title method has as its source of machine recognizable clues, the specific
characteristics of the skeleton of the document, i. e. , title, headings, and format,
and is based on a Title Glossary compromising those content words found in the
title, subtitles, and headings, but excluding certain words of the Cue Dictionary.
* . The Location method has as its source of machine recognizable clues, the
general characteristics of the corpus that are provided by the skeletons of the
documents and uses a Heading Dictionary of certain function words that appear
in the skeletons of documents.'1 1/
The Harvard work involving detection of the first incidences of nouns as sentence
selection and indexing clues is part of a larger-scale program for mechanized informa-
tion selection and retrieval under the general direction of Salton (1961 [siz], 1962 [513],
1963 [514] and [515]). The specific mixed system involving frequency data, syntactic
identification clues, and positional criteria is primarily the result of investigations by
Lesk and Storm (1961 [577], 1962 [358]). Related work takes advantage of computer
techniques for predictive syntactic analysis and automatic dictionary lookup also under
development at the Harvard Computation Laboratory (Kuno and Oettinger, 1963 [339],
[340], [341]).
The Lesk-Storm experiments have involved investigations where the hypothesis
assumed is that the points in a text where the author has first introduced a specific noun
or nominal phrase, or where he has used, with higher frequencies, a combination of
first-referred-to-nouns, are most likely to be especially indicative sections of text with
respect to subject-content representativeness. The assumption is further,that areas in
which specific `1new" ideas, not mentioned previously in the text, are first introduced is
particularly rich in topical-content concentration. 2*
The_mixed-system emphasis followed by Lesk and Storm, however, is revealed in
the following comments:
`11t is not, of course, apparent that a count of initial occurrences of nouns . . . is by
itself sufficient to reveal areas of significant information content for purposes of
abstracting or indexing. Accordingly, the method suggested here must be used
together with other available means, and is not expected to provide by itself an
acceptable abstracting algorithm. 3/
In their actual investigations, Lesk and Storm first made manual counts of initial
noun occurrences in various sample texts, noting paragraph, sentence, and first
incidence-of-word identifications. The computer was then used to carry out three
distinctive tasks: (1) calculation of the number of new nouns for each sentence in the text;
1/
2/
3'
Thompson Ramo Wooldridge, 1963 [603], p. 1.
Lesk and Storm, 1962 [358], p. 1-6.
Storm, 1961 [577], pp. I-i and 1-2.
87