IRE
Information Retrieval Experiment
Retrieval effectiveness
chapter
Cornelis J. van Rijsbergen
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
40 Retrieval effectiveness
effectiveness is to select a point in terms of E, this might be the best, worst or
some other point, and then compare it with E for the set output. Once a single
number measure has been adopted statistical summaries for sets of queries
become straightforward, and interpolation and extrapolation are not needed.
Of course in conflating the precision-recall (P-R) graph to one value there is
loss of information, but this is not as severe as it may appear at first sight.
There is a certain amount of evidence now that no matter what model is
adopted for retrieval, the precision-recall graphs are constrained to some
extent. In fact it is not difficult to prove that under the probability ranking
principle expected recall and expected precision are inversely related. This
means that given any E value for a point on the P-R graph, E values for other
points are constrained by this trade-off. This is not to say that there is a
functional relationship between P and R but that there is `almost' one. In this
sense the loss of information is not so severe, although it must be admitted
that this loss has not been quantified.
Further difficulties arise in evaluation when attempting to measure the
comparative effectiveness of relevance feedback strategies. In these strategies
certain documents are looked at on a first iteration to establish the parameters
for the second iteration. Typically the documents looked at are the top
documents in a ranking, the remaining documents are unsighted. To establish
the effectiveness of the feedback, we must somehow measure how feedback
improves retrieval. The most sensible way of doing this is to generate a
residual ranking for the second iteration which is a ranking with the n
feedback documents removed. This can then be compared with the ranking
for the first iteration with the same n documents removed, in this latter case
they are of course the top n documents. From these rankings, one for the first
and one for the second iteration, precision-recall graphs can be generated.
This method has been used extensively by Harper4 and Ide9 for evaluating
feedback experiments. It neatly measures the effect of feedback on documents
the user has not previously seen.
3.5 Liinits to retrieval
In evaluating the results of retrieval experiments, it is often important to
establish the bounds on retrieval. Trivial bounds obviously exist in that
retrieval effectiveness cannot exceed precision and recall jointly being 100
per cent, nor can it fall below both being 0 per cent.
One interesting speculative question to ask is whether in fact we wish to
design retrieval systems that achieve 100 per cent precision and recall. It is
not too difficult to argue that this could be achieved for some specific query.
Achieving 100 per cent precision and recall on the average for some unknown
set of queries is a different matter. In designing any retrieval system we use
certain models for the structures and processes involved. These models are
necessarily an imperfect reflection of the reality they are trying to model. In
particular any model for relevance we might invoke will have built in an
inherent uncertainty. Therefore one would hypothesize that perfect retrieval
is impossible, or to put it differently, that a retrieval system cannot be all
things to all men. Let us now look at a possible objection to the above claim
of the impossibility of perfect retrieval. One might claim it is the primitive