NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report

MONO91 NIST Monograph 91: Automatic Indexing: A State-of-the-Art Report Indexes Generated by Machine-Automatic Derivative Indexing chapter Mary Elizabeth Stevens National Bureau of Standards "The criteria for attributing significance to words . . . may be positional (in virtue of their occurrence in titles or section headings), or semantic (in virtue of their relation to words like `summary'), or perhaps even pragmatic (in the case of names of specialists mentioned in text footnotes, or bibliography "A cataloguer or abstract-writer would naturally give more weight to a technical word that appears in a title, in a first paragraph, or in a summary. A machine can be programmed to do the same. It can be instructed to recognize the title by position and capitalization . .. It can place first-paragraph indications... It can test every heading or subtitle for the words rsummaryt or `conclusions' and place a summary indication after each word in the summary paragraphs." 1/ "The statistical criteria . . . by no means exhaust the potential clues to the representativeness of sentences. Among other plausible clues are certain words and phrases ... authors use words such as `conclusion', `demonstrate', `disclose', `prove', `show', and `summary' (and related forms of these) with high frequency in sentences that contain concise statements about the topic or topics of the article. The occurrence in a sentence of such a phrase as `it was found that...', `the experiment proves. . . `, or `the central problem is . . . ` would indicate probably even more sharply than any single word could that the sentence was likely to be highly representative of the topics..." 2/ 3.3.6 Recent Examples of Mixed Systems Experimentation It is quite obvious from the above samples of suggestions for the use of various special clues for automatic extraction, that improved systems will largely depend upon a mixture of means for determining subject- representativeness of words, phrases, and sentences Many of the clues suggested by Edmundson and WyUys are continuing to be explored, as mixed systems, at RAND 3/ and the System Development Corporation, (1962 [590]), for example. Two specific recent examples of mixed systems experimentation are the automatic abstracting experiment programs at Thompson Ramo-Wooldridge and the work involving detection of first incidences of nouns at the Harvard Computation Laboratory. The TRW programs to investigate possibilities of computer generation of document auto-abstracts, involving both English and Russian language texts are based upon a combination of four different methods to measure significance and determine representa- tiveness. These four methods are briefly described as follows: The Key method has its source of machine recognizable clues the specific characteristics of the body of the document and is based on a Key Glossary of content words taken from the body pf the document. 1/ 2/ 3/ Edmundson and Wyllys, 1961 [181], pp. 227 and 229. Wyllys, 1963L653J, p.25. See National Science Foundation's CR&D report No. 11, [430], pp. 314-315. 86