CRANV1P1 ASLIB Cranfield Research Project: Factors Determining the Performance of Indexing Systems: VOLUME 1. Design, Part 1. Text Indexing Procedures chapter Cyril Cleverdon Jack Mills Michael Keen Cranfield An investigation supported by a grant to Aslib by the National Science Foundation. Use, reproduction, or publication, in whole or in part, is permitted for any purpose of the United States Government. 48 Hierarchical linkage may be implemented in a post coordinate system by gene- ric posting, by coding based on a form of semantic factoring as in the WI[OCRerr]U system, or by search strategies based on a classification schedule or thesaurus. In a pre- coordinate system it may be achieved to a large degree by the file arrangement, as in the systematic juxtaposition of related classes in a classified file; it may be achieved fragmentarily by inversion of subject heading in an Alphabetical subject catalogue (e. g., Drag, Base; Drag, Form; Drag, Induced; etc. ) or by search strategies based on a syndetic network of see also references. Measuring the performance of index devices In order to establish recall and precision performance figures for the different devices, both singly and in various combinations, it was first of all desirable that we established as far as possible figures for indexing in which none of the devices was operating. Then it would be possible to determine the impact on these figures of the introduction of each device in turn. This assumes, of course, a test collection and a set of questions to be put to it, where it is known just what documents are rele- vant to each question, as described in the previous chapter. Performance figures for an 'unindexed, collection seemed to imply a situation in which the complete text of each item in the collection was searched for each ques- tion. This would have been too tedious an operation (although something like it, ex- cept that it was on a small scale, using computer facilities, has been described by Swanson (Ref.18)). The alternative which we decided to take, was to use, as the base situation, one in which the simplest known indexing device was used and to measure the impact on this of all the other devices. This simplest device was taken to be that of condensation of the full text into an index language consisting solely of the 'uniterms, thrown up by the title and text of the document itself, quite uncon- trolled by any prior index language. So the first step was to establish, by the indexing of the test documents, a crude, elemental index language from which all the other languages (each one characterized by the addition of a particular device or aggregate of devices} would be derivable. Before this could be done it was necessa:ry to provide for the control of two major parameters in indexing, exhaustivity and specificity. Exhaustivity and specificity Exhaustivity in indexing refers to the degree to which one recognizes (i.e. includes in the index descriptions} the different concepts or notions dealt with in a document. Specificity refers to the generic level at which these concepts or notions are recog- nized. For example, suppose a report has as its main theme the subject 'Drag on swept wings at high subsonic speeds'. If one neglects, for the time being, the various subsidiary themes which are also dealt with, this report may be said to deal with three concepts - an aerodynamic characteristic, an aerodynamic structure and a flow condition. If these concepts were described in the above fashion in the index description, this latter would be exhaustive but not specific. If the description con- sisted only of Drag - High subsonic speeds it would be neither exhaustive nor speci- fic; for whilst the terms retained are specific, the absence of any reference to Swept wings implies that the subject deals with aerodynamic structures in general (some structure is implicit, of course} and this is less than specific, since to be this a description must be exactly coextensive with the notion represented. There can be no reduction in exhaustivity which is not a reduction in specificity; but the reverse does not hold.