MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards Interesting variations are to be noted in the current practices of using stop lists. Some lists are quite short, and others extend to several thousand words. Parkins reports that a mere 14 words on the stop lists used for B. A.S.I.C. are responsible for 80 percent of the title lines that need not be printed, but that their original list of 200 stop words grew quite rapidly to more than 1, 000 now in use. 1/ ChemicalAbstracts Service representatives reported in 1962 an initial list of about 1, 000 words which dropped to 300 at one time and then was increased again to the original level. 2/ Using a stop list of 82 words eliminated 30 percent of a 42, 000-word corpus of internal reports at the System Development Corporation, (Olney, 1961 [456]). Critical questions in the establishment of stop lists relate to the problem of balancing the economics of the number of title lines to be printed and to be subsequently scanned against the loss of retrieval effectiveness if certain words are omitted from the search entry positions. How this balance should be achieved may vary from one subject field to another and between different organizations. In several regularly published KWIC indexes, the actual list used to exclude the presumably nonsignificant words is printed so that the user can check before proceeding to actual search. Williams has suggested that each excluded word be listed once, in its proper alphabetic place in the index, if it occurs in the titles of the particular set of items being indexed. 3/ In general, however, not enough is yet known about the requirements of particular subject fields and particular types of organization to arrive at the most effective compro- mises in establishing exclusion lists for keyword indexing. Noting that stop lists in actual use vary from only a few function words such as prepositions and conjunctions to lists several hundred words long, Brandenberg points out that: 11At the present state of the KWIC indexing art the selection of stop words appears to be largely arbitrary and a comparison of half a dozen stop lists shows that they have about two dozen words in common. 11 4/ Kennedy and Doyle both specifically suggest that more research on the contents and effects of stop lists is necessary, (Kennedy, 1961 [311], 1962 [310]; Doyle, 1963 [162]), but Kennedy points out the ease with which the machine programs themselves can be used for modification of the lists. 5/ 1/ 2/ 3/ 4/ 5/ Parkins, 1963 [466], p.27. F. A. Tate, discussions at seminar on the word and vocabulary byproducts muted title indexing, Biological Abstracts headquarters, October 8, 1962. of per- T. M. Williams, discussions at seminar on word and vocabulary byproducts of per- muted title indexing, Biological Abstracts headquarters, October 8, 1962. Brandenberg, 1963 [80], p. 57. See also Clark, (1960 [123], p.459), who suggests: "It is very probable.. .that the cut-off points [for most common, for very infrequent, words] will have to be adjusted to the material we actually use. The effect on the process of such factors as style, size of text, the complexity of the subject matter, and the like, is as yet not clearly seen. The collection of large amounts of text and their analysis will undoubtedly be the best way of determiningthe effects of these variables." 65