IRE Information Retrieval Experiment The methodology of information retrieval experiment chapter Stephen E. Robertson Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Some examples 19 That said, the measures of performance or effectiveness used in the majority of retrieval tests are the well-known recall and precision. Ignoring for the moment the problems concerned with averaging over requests, these measures are usually defined as follows Recall Proportion of the relevant documents that are retrieved; Precision = Proportion of the retrieved documents that are relevant. Clearly, these measures relate closely to the ideas discussed above concerning performance limits and failure analysis; a relevant document not retrieved or a non-relevant document retrieved is to be regarded as a failure, and the implicit suggestion is of an ideal performance of 100 per cent recall and 100 per cent precision. It should be noted, however, that there may be other (lower) limits to performance[OCRerr]reasons why some of these `failures' are inevitable. These questions are taken up again, in more detail, by van Rusbergen in the next chapter. The bulk of the rest of this chapter is concerned, in rather general terms, with the problem of making inferences from the results of retrieval tests. Aside from questions of performance, the main category of measurements discussed above was that of costs. Generally speaking, measurement of costs does not present the same kind of intellectual problems as measurement of performance, in that (for example) the final measure is not in dispute, and the problem of averaging is replaced by the relatively simple procedure of accumulating. (This is not to claim that costing has no problems-on the contrary-but the kinds of problems that arise are more pragmatic than conceptual.) Many of the intermediate variables such as inter-indexer consistency, however, present much the same kinds of problems as retrieval performance-though they have not received the same amount of attention. 2.3 Some examples Detailed discussions of particular experiments are well represented in the chapters that follow, and I do not wish to pre-empt such analyses. However, it is appropriate at this point to look briefly at some experiments that have taken place, in order to illustrate the above account of the `normal' or archetypal retrieval test, and some variants on the archetype. References are given in the bibliography at the end of this chapter. Cranfield 2 The second Cranfield experiment (which is described much more fully by Sparck Jones in Chapter 13) was a laboratory experiment, undertaken with the object of shedding light on the construction of index languages, and the effect of different rules of construction on retrieval performance. Thus almost all of the `system', with the exception of the translation of raw indexing into a formal language, was chosen to be as simple and unobtrusive as possible. The translation step, on the other hand, was done in a large number of alternative ways, thus generating a large number of alternative systems. The main aim of the project was to decide which of these alternative systems performed well and which badly.