IRE
Information Retrieval Experiment
The Cranfield tests
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Criticisms of Cranfield 1 271
in involving real relevance assessment, the initial poor WRU performance
improves, indeed until it becomes superior to that for facet: performance
figures are reversed as the test becomes more realistic.
There is no doubt that such reviewers point to failures in the related
Cranfield 1 and lj tests. However it is clear from the comments that it was
recognized that it was only through the experience of major tests that
significant progress could be made in experimental design and system
understanding. The reviewers all emphasize the importance of Cranfield I in
particular: for example Mote says
`this project represents the most serious attempt yet made to derive a basis
for the comparison of indexing systems.' (p.81)
At the same time, the way forward was indicated by Richmond's call for
more care and Sharp's condemnation of source documents: as Sharp said,
`the source-document principle should be dropped and future tests carried
out taking into account all relevant documents retrieved.' (p.174)
As indicated at the beginning of the chapter, these criticisms are
reproduced here to illustrate contemporary methodologically-motivated
reactions to Cranfield 1. However some of the attacks on the test were
fundamentally mistaken, the most obvious example being Taube's on the
`pseudo-mathematics' of its treatment of relevance10, which confounded
relevance assessment and precision measure. Another example is Mote's
misconceived comment on the lack of control of indexing depth in questions.
More importantly, points which were not obviously wrong varied in status
as criticisms of the test. Some criticisms disregarded the stated test objectives.
Thus Mote's view that the system operations were not realistic is hardly a
criticism when it is directed at the constraints required by experimental
control. Other critics of the test suggest that particular factors could have
influenced the test without demonstrating that they did: an example is
Richmond's remark about primary and subsidiary indexing. Such criticisms
though suggestive must be regarded as speculative. Yet other criticisms have
substance, but not in a narrow sense. These mostly concern the use of source
document questions. It has never been shown that source document questions
do not either look or behave like `regular' questions. Thus while Swanson
suggests that it is possible that the lack of any real difference between the
indexing languages can be attributed to the use of source documents, it does
not follow that it must be so attributed. On the contrary, the fact that in many
subsequent tests of different kinds indexing languages have tended to perform
th
e same suggests that the Cranfield results were correctly attributed to the
nguage variable. But though the use of source documents may not be
g[OCRerr]larounds for straightforwardly criticizing the test, whether the source
ocuments could have affected the results is a serious question about the test.
This was clearly accepted at Cranfield, and a different procedure was adopted
I
for Cranfield 2.
The real limitation of Cranfield 1 was its failure to measure precision,
though this was recognized in time for the supplementary test, and
Cranfield 1+ and later Cranfield 2 tests were designed to measure precision
along with recall.
Reviewing the criticisms of Cranfield 1 now it is evident that though some
WI