IRE
Information Retrieval Experiment
The methodology of information retrieval experiment
chapter
Stephen E. Robertson
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Some examples 19
That said, the measures of performance or effectiveness used in the
majority of retrieval tests are the well-known recall and precision. Ignoring
for the moment the problems concerned with averaging over requests, these
measures are usually defined as follows
Recall Proportion of the relevant documents that are retrieved;
Precision = Proportion of the retrieved documents that are relevant.
Clearly, these measures relate closely to the ideas discussed above concerning
performance limits and failure analysis; a relevant document not retrieved or
a non-relevant document retrieved is to be regarded as a failure, and the
implicit suggestion is of an ideal performance of 100 per cent recall and 100
per cent precision. It should be noted, however, that there may be other
(lower) limits to performance[OCRerr]reasons why some of these `failures' are
inevitable.
These questions are taken up again, in more detail, by van Rusbergen in
the next chapter. The bulk of the rest of this chapter is concerned, in rather
general terms, with the problem of making inferences from the results of
retrieval tests.
Aside from questions of performance, the main category of measurements
discussed above was that of costs. Generally speaking, measurement of costs
does not present the same kind of intellectual problems as measurement of
performance, in that (for example) the final measure is not in dispute, and the
problem of averaging is replaced by the relatively simple procedure of
accumulating. (This is not to claim that costing has no problems-on the
contrary-but the kinds of problems that arise are more pragmatic than
conceptual.) Many of the intermediate variables such as inter-indexer
consistency, however, present much the same kinds of problems as retrieval
performance-though they have not received the same amount of attention.
2.3 Some examples
Detailed discussions of particular experiments are well represented in the
chapters that follow, and I do not wish to pre-empt such analyses. However,
it is appropriate at this point to look briefly at some experiments that have
taken place, in order to illustrate the above account of the `normal' or
archetypal retrieval test, and some variants on the archetype. References are
given in the bibliography at the end of this chapter.
Cranfield 2
The second Cranfield experiment (which is described much more fully by
Sparck Jones in Chapter 13) was a laboratory experiment, undertaken with
the object of shedding light on the construction of index languages, and the
effect of different rules of construction on retrieval performance. Thus almost
all of the `system', with the exception of the translation of raw indexing into
a formal language, was chosen to be as simple and unobtrusive as possible.
The translation step, on the other hand, was done in a large number of
alternative ways, thus generating a large number of alternative systems. The
main aim of the project was to decide which of these alternative systems
performed well and which badly.