MONO91
NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report
Indexes Generated by Machine-Automatic Derivative Indexing
chapter
Mary Elizabeth Stevens
National Bureau of Standards
Interesting variations are to be noted in the current practices of using stop lists.
Some lists are quite short, and others extend to several thousand words. Parkins reports
that a mere 14 words on the stop lists used for B. A.S.I.C. are responsible for 80 percent
of the title lines that need not be printed, but that their original list of 200 stop words grew
quite rapidly to more than 1, 000 now in use. 1/ ChemicalAbstracts Service representatives
reported in 1962 an initial list of about 1, 000 words which dropped to 300 at one time and
then was increased again to the original level. 2/ Using a stop list of 82 words eliminated
30 percent of a 42, 000-word corpus of internal reports at the System Development
Corporation, (Olney, 1961 [456]).
Critical questions in the establishment of stop lists relate to the problem of balancing
the economics of the number of title lines to be printed and to be subsequently scanned
against the loss of retrieval effectiveness if certain words are omitted from the search
entry positions. How this balance should be achieved may vary from one subject field to
another and between different organizations. In several regularly published KWIC indexes,
the actual list used to exclude the presumably nonsignificant words is printed so that the
user can check before proceeding to actual search. Williams has suggested that each
excluded word be listed once, in its proper alphabetic place in the index, if it occurs in
the titles of the particular set of items being indexed. 3/
In general, however, not enough is yet known about the requirements of particular
subject fields and particular types of organization to arrive at the most effective compro-
mises in establishing exclusion lists for keyword indexing. Noting that stop lists in
actual use vary from only a few function words such as prepositions and conjunctions to
lists several hundred words long, Brandenberg points out that:
11At the present state of the KWIC indexing art the selection of stop words appears
to be largely arbitrary and a comparison of half a dozen stop lists shows that they
have about two dozen words in common. 11 4/
Kennedy and Doyle both specifically suggest that more research on the contents and
effects of stop lists is necessary, (Kennedy, 1961 [311], 1962 [310]; Doyle, 1963 [162]),
but Kennedy points out the ease with which the machine programs themselves can be used
for modification of the lists. 5/
1/
2/
3/
4/
5/
Parkins, 1963 [466], p.27.
F. A. Tate, discussions at seminar on the word and vocabulary byproducts
muted title indexing, Biological Abstracts headquarters, October 8, 1962.
of per-
T. M. Williams, discussions at seminar on word and vocabulary byproducts of per-
muted title indexing, Biological Abstracts headquarters, October 8, 1962.
Brandenberg, 1963 [80], p. 57.
See also Clark, (1960 [123], p.459), who suggests: "It is very probable.. .that the
cut-off points [for most common, for very infrequent, words] will have to be adjusted
to the material we actually use. The effect on the process of such factors as style,
size of text, the complexity of the subject matter, and the like, is as yet not clearly
seen. The collection of large amounts of text and their analysis will undoubtedly be
the best way of determiningthe effects of these variables."
65