CRANV1P1
ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text
Testing Techniques
chapter
Cyril Cleverdon
Jack Mills
Michael Keen
Cranfield
An investigation supported by a grant to Aslib by the National Science Foundation.
Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government.
- 91 -
for the file, together with weights, resulted in serious difficulties. Another problem
that loomed large was that of recording the aggregate of the different documents
retrieved out of all the possible coordinations at a given ordination level,
since many documents would be retrieved several times. One possible solution to
these problems was to prepare a new peek-a-boo index for each of the recall
devices; that is to say, :here would be one index for natural language terms, a
second index with the synonyms controlled, a third index with word endings con-
founded, etc. However the manual re-punching of new indexes would have been a
big task, and at that time no equipment could be found to aggregate a set of postings
from a number of different cards all on to one card. Other considerations mitigating
against a peek-a-boo index were the task of withdrawing and refiling large numbers
of cards during a search, and the difficulty of performing more than one search at
one time.
As a result other conventional index forms were considered but offered no
satisfactory solution. At this point in the project, several people working on
associative retrieval expressed interest in the possibility of using the indexing
being performed on our collection for their own testing of statistical associative
techniques, clumping, etc. With the agreement of the National Science Foundation,
arrangements were made to make the indexing available in machine readable form,
on magnetic tape. The format used for this is given appendix 6.1, and details of
supplementary tests being made are given in Chapter 7. With the indexing available
on magnetic tape, the use of this for computer searching for the testing was then
considered.
A number of discussions were held with various groups, and we received
cost estimates for programming and searching which varied by a factor of ten
An effort was made to discover whether any suitable computer programme already
existed, which could be used to do the required searches. Discussions were held
during a visit of one of the project staff to the U.S.A. , but no suitable programme
was discovered to do the minimum of what was required. This led to a reconsideration
of preparing programmes in this country, but not only were the cost estimates high
in relation to the present project, but also the time factor was becoming critical.
Particularly discouraging was to learn that the searches which we had requested
would result in seven million lines of print out; for these reasons and our own
lack of experience in the field, the idea of using computers was abandoned.
The flirtation with computers had not been entirely wasted, for by this time
we had a clear idea of exactly what was needed, and this helped in producing a
method which met the main requirements. At the time when the solution was first
proposed, no similar method was known to exist, for it is quite unconventional
and it is difficult to visualise any application in real life circumstances. However,
it was later discovered tt:at a somewhat similar suggestion had been made by
Dr. John O'Connor, known as the ,Scan-column index' (ref. 31 ), although no actual
example of its use in practice is known. It had the advantages of flexibility to
meet changing circumstances, so that it would give results for the many different
types of search, and also of permitting quite complex analyses to be done clerically.
The first stage in .the preparation of the index was a complete posting of each
single term used in. indexing onto a set of cards. These cards also contained
information regarding the weights assigned to each term. The indexing decisions
regarding Document 2076 .are shown on the master indexing sheet in Fig. 6.1.
From this sheet, the single terms and their weights were posted on to cards, with
a separate card fon each tern:. Thus 'Insulated 10', tol[OCRerr],-ther with the document
code number (2076) would.be posted on one card, ,Two-dimensional 10' on another
card together with a code number and so on to every index term. These cards were
then sorted into alphabetical order and sub-sorted into document number within each
term.