MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Automatic Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
`1The differentiation that is made between the two types of indexing is that word
indexing is inextricably tied to the words in a text: If a word appears it gets
indexed as such; if it does not appear it does not get indexed. Concept index-
ing, on the other hand, has an element of abstraction in it: Words may either
be indexed as such or may be converted, either by themselves or in combination
with other words, into concepts which may not bear a direct resemblance to
the words or combinations of words that evoked them in the indexer1s mind."
Machine techniques such as those of Luhn's KWIC, like the early Uniterm systems,
look no farther than the words used by the one author himself. Techniques such as those
of Maron, Swanson, Borko, Meadow and Williams, among others, look specifically to
relationships between words as used by one author to patterns of word usages in a given
subject area or given document collection. They may also look to these patterns as in
turn related to prior human analytic judgments of the "aboutness" referrents of items in
the collection. In this sense, they at least attempt replication by machine of assignment
indexing.
There is no real question but that machines can in fact derive words from text pro-
vided that it is in machine-readable form. This machine procedure may involve direct
extraction of all words as index entries, as in a complete concordance. It may involve
the extraction of only those words which survive a "purging" operation in which articles,
conjunctions, adjectives, and other "common" words are first deleted. Various machine-
controlled modifications to such "derivative" indexing are also available. The case for
machine achievement of assignment indexing for any but limited special cases is not so
clear.
2. INDEXES COMPILED BY MACHINE
A first and obvious use of machines in indexing processes is in the manipulation of
index entries, previously selected on the basis of human analysis, to produce various
orderings, duplications and listings of these entries. The power of machine techniques
to speed and economize the sorting, ordering and listing operations in the preparation
or compilation of indexes was recognized quite early, both in the field of library science
and in the consideration of potential areas of application by specialists in machine
potentialities.
In particular, two specialized types of index, at least in the broad sense, are such
that their compilation would be almost prohibitive in terms of time and cost were it not
for the use of machines. These are, respectively, the case of the complete index, the
index to all words of a text in their various contexts, which is a concordance [OCRerr]l and the
case of the "citation index", which has been used in the field of law for many years but
has only quite recently been suggested for literature search purposes related to
scientific and technical information.
1/
See, for example. Doyle,1963 £162] , [OCRerr]p. 11: "Without data-processing
machinery, concordances are prohibitively expensive to generate for most uses
except in those cases where it is well known that a given volume of text is going
to be used again and again, by large numbers of people over a long period of
time. As we know, clergymen have made use of manually prepared concordances
of the Bible since the 12th century".
14