NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Other Potentially Related Research chapter Mary Elizabeth Stevens National Bureau of Standards 6.4 Probabilistic Indexing and Natural Language Text Searching As in the case of automatic indexing proposals based upon automatic sentence extraction techniques, machine searching of full natural language text has been suggested as a basis for, at least, automatic derivative indexing. We have remarked previously that the machine use of complete text can only be considered to be "indexing" in a very special sense, that it is subject either to the non-availability of suitable corpora already in machine-usable form or to high costs of conversion to this form, and that too little is yet known of linguistic analysis and searching-selection strategies effectively applicable to natural language materials. Various examples of corroborating opinion, other than those previously cited, are as follows: `1Machine searching is superb if it is known exactly how to describe the object of search, and if one could know how to choose from among many possible search- ing strategies. I doubt if any one is yet in this comfortable position with respect to machine searching of text." 1/ "The most effective programs in automatic linguistic analysis have served only to illustrate how really complex is the structure of the language, and how far removed the present state of the art is from any system which might be useful in practice. `2/ "The recognition of words involves only the matching of digital codes, but the recognition of an idea is a severe intellectual problem, the solution to which will probably never be exact. Nevertheless, this is the problem which must be attacked if accuracy is ever to be attained, or even approached, in using the text of information items as a basis for their recovery." 3/ Nevertheless, some of the work both in natural language text searching and in "probabilistic indexing" (where weights representing judgments as to degree of relevance of an indexing term to an item are used either in indexing or search), provide instructive insights into some of the problems of automatic indexing. In the period 1958-1960, work at Ramo-Wooldridge resulted in the release or publication of provocative papers by Maron, Kuhns, and Ray on "probabilistic indexing" (1959 [398], 1960 [3971) and by Swanson on natural language text searching by computer (1960 [587, 582], 1963 [583]). Subsequent work along these lines has included further developments at Thompson Ramo-Wooldridge, the law statutes work at the Health Law Center at the University of Pittsburgh, and the experimental investigations of Eldridge and Dennis in a project jointly sponsored by the American Bar Foundation, IBM, and the Council on Library Resources. 1/ 2/ 3/ Doyle, 1959 [168], p. 2. Salton, 1962 [520], p. 111-1 through 111-2. Doyle, 1959 [165], p. 12. 132