IRE Information Retrieval Experiment The Cranfield tests chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Criticisms of Cranfield 1 271 in involving real relevance assessment, the initial poor WRU performance improves, indeed until it becomes superior to that for facet: performance figures are reversed as the test becomes more realistic. There is no doubt that such reviewers point to failures in the related Cranfield 1 and lj tests. However it is clear from the comments that it was recognized that it was only through the experience of major tests that significant progress could be made in experimental design and system understanding. The reviewers all emphasize the importance of Cranfield I in particular: for example Mote says `this project represents the most serious attempt yet made to derive a basis for the comparison of indexing systems.' (p.81) At the same time, the way forward was indicated by Richmond's call for more care and Sharp's condemnation of source documents: as Sharp said, `the source-document principle should be dropped and future tests carried out taking into account all relevant documents retrieved.' (p.174) As indicated at the beginning of the chapter, these criticisms are reproduced here to illustrate contemporary methodologically-motivated reactions to Cranfield 1. However some of the attacks on the test were fundamentally mistaken, the most obvious example being Taube's on the `pseudo-mathematics' of its treatment of relevance10, which confounded relevance assessment and precision measure. Another example is Mote's misconceived comment on the lack of control of indexing depth in questions. More importantly, points which were not obviously wrong varied in status as criticisms of the test. Some criticisms disregarded the stated test objectives. Thus Mote's view that the system operations were not realistic is hardly a criticism when it is directed at the constraints required by experimental control. Other critics of the test suggest that particular factors could have influenced the test without demonstrating that they did: an example is Richmond's remark about primary and subsidiary indexing. Such criticisms though suggestive must be regarded as speculative. Yet other criticisms have substance, but not in a narrow sense. These mostly concern the use of source document questions. It has never been shown that source document questions do not either look or behave like `regular' questions. Thus while Swanson suggests that it is possible that the lack of any real difference between the indexing languages can be attributed to the use of source documents, it does not follow that it must be so attributed. On the contrary, the fact that in many subsequent tests of different kinds indexing languages have tended to perform th e same suggests that the Cranfield results were correctly attributed to the nguage variable. But though the use of source documents may not be g[OCRerr]larounds for straightforwardly criticizing the test, whether the source ocuments could have affected the results is a serious question about the test. This was clearly accepted at Cranfield, and a different procedure was adopted I for Cranfield 2. The real limitation of Cranfield 1 was its failure to measure precision, though this was recognized in time for the supplementary test, and Cranfield 1+ and later Cranfield 2 tests were designed to measure precision along with recall. Reviewing the criticisms of Cranfield 1 now it is evident that though some WI