CRANV1P1 ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text Testing Techniques chapter Cyril Cleverdon Jack Mills Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. - 91 - for the file, together with weights, resulted in serious difficulties. Another problem that loomed large was that of recording the aggregate of the different documents retrieved out of all the possible coordinations at a given ordination level, since many documents would be retrieved several times. One possible solution to these problems was to prepare a new peek-a-boo index for each of the recall devices; that is to say, :here would be one index for natural language terms, a second index with the synonyms controlled, a third index with word endings con- founded, etc. However the manual re-punching of new indexes would have been a big task, and at that time no equipment could be found to aggregate a set of postings from a number of different cards all on to one card. Other considerations mitigating against a peek-a-boo index were the task of withdrawing and refiling large numbers of cards during a search, and the difficulty of performing more than one search at one time. As a result other conventional index forms were considered but offered no satisfactory solution. At this point in the project, several people working on associative retrieval expressed interest in the possibility of using the indexing being performed on our collection for their own testing of statistical associative techniques, clumping, etc. With the agreement of the National Science Foundation, arrangements were made to make the indexing available in machine readable form, on magnetic tape. The format used for this is given appendix 6.1, and details of supplementary tests being made are given in Chapter 7. With the indexing available on magnetic tape, the use of this for computer searching for the testing was then considered. A number of discussions were held with various groups, and we received cost estimates for programming and searching which varied by a factor of ten An effort was made to discover whether any suitable computer programme already existed, which could be used to do the required searches. Discussions were held during a visit of one of the project staff to the U.S.A. , but no suitable programme was discovered to do the minimum of what was required. This led to a reconsideration of preparing programmes in this country, but not only were the cost estimates high in relation to the present project, but also the time factor was becoming critical. Particularly discouraging was to learn that the searches which we had requested would result in seven million lines of print out; for these reasons and our own lack of experience in the field, the idea of using computers was abandoned. The flirtation with computers had not been entirely wasted, for by this time we had a clear idea of exactly what was needed, and this helped in producing a method which met the main requirements. At the time when the solution was first proposed, no similar method was known to exist, for it is quite unconventional and it is difficult to visualise any application in real life circumstances. However, it was later discovered tt:at a somewhat similar suggestion had been made by Dr. John O'Connor, known as the ,Scan-column index' (ref. 31 ), although no actual example of its use in practice is known. It had the advantages of flexibility to meet changing circumstances, so that it would give results for the many different types of search, and also of permitting quite complex analyses to be done clerically. The first stage in .the preparation of the index was a complete posting of each single term used in. indexing onto a set of cards. These cards also contained information regarding the weights assigned to each term. The indexing decisions regarding Document 2076 .are shown on the master indexing sheet in Fig. 6.1. From this sheet, the single terms and their weights were posted on to cards, with a separate card fon each tern:. Thus 'Insulated 10', tol[OCRerr],-ther with the document code number (2076) would.be posted on one card, ,Two-dimensional 10' on another card together with a code number and so on to every index term. These cards were then sorted into alphabetical order and sub-sorted into document number within each term.