MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Conclusion
chapter
Mary Elizabeth Stevens
National Bureau of Standards
several thousand, possible indexing or classificatory labels? 1/
The use of very brief short articles, or of abstracts, as the members of experiment-
al corpora for investigations of automatic assignment indexing techniques presuming the
processing of full text, either for indexing purposes or for subsequent "indexing-at-time-
of search", is seriously misleading. First, it is not truly representative of discursive
text, either in vocabulary-syntax, or stylistic variations involving synonymity, tropes,
elisions, dangling referents, and inumerable other meaning [OCRerr]implication 5, not explicitly
stated.
Secondly, as any author of a technical paper, for which he must provide an abstract,
knows all too well, he must concentrate in the abstract on a telegraphic emphasis toward
his principal topic and the points he wishes to make. He must omit most qualifying, spec-
ifying, and [OCRerr] words and phrases, which he will in
fact develop in the text itself. For this reason, even supposing that the author himself is
unusually well-aware of the multiple points of access that many different potential users
might desire, the required brevity of the abstract form almost necessarily demands terse,
shorthand-type statements that can only increase the problems of "technese", of homo-
graphy, and of single-subject representation.
Granted, in either manual or machine-serviceable systems today, the current-
awareness scanning need is largely met by indexing based solely or primarily on title only,
or title-plus-abstract. But is this good enough for search and retrieval? If and only if it
is, then automatic indexing potentialities available today should be considered for both
purposes.
OuŁ final question as to whether automatic indexing can be accomplished by statisti-
cal means alone or must involve syntactic, semantic and pragmatic considerations is not
entirely answerable. In terms of achieving comparable quality with many manually pre-
pared indexes available today, statistical means alone do appear promising. But is the
achievement of just this level (even if accompanied by significant gains in timeliness,
coverage, and economy) really good enough? There are a number of serious investigators
1/
For example, Black predicts (1963) [64 [OCRerr] , p. 19) that for most systems an adequate
vocabulary or thesaurus will comprise some twenty thousand terms. See also
Arthur D. Little, Inc. , 1963 [ Z3 [OCRerr] , p. 65: "The enormous number of computations
required increases very rapidly with the number of indexing terms. Existing com-
puters, operating serially, do not appear to be capable of handling the problem
economically for collections with 9000 or more terms even if the simplest associative
techniques are employed"; Williams, 1963 L64Z [OCRerr], p. 16Z: `One of the practical
problems... is in the inversion of large matrices. In certain methods the order of the
matrix will equal the number of different word types in the population, which is
usually in the thousands."
179