IRE Information Retrieval Experiment Retrieval effectiveness chapter Cornelis J. van Rijsbergen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 40 Retrieval effectiveness effectiveness is to select a point in terms of E, this might be the best, worst or some other point, and then compare it with E for the set output. Once a single number measure has been adopted statistical summaries for sets of queries become straightforward, and interpolation and extrapolation are not needed. Of course in conflating the precision-recall (P-R) graph to one value there is loss of information, but this is not as severe as it may appear at first sight. There is a certain amount of evidence now that no matter what model is adopted for retrieval, the precision-recall graphs are constrained to some extent. In fact it is not difficult to prove that under the probability ranking principle expected recall and expected precision are inversely related. This means that given any E value for a point on the P-R graph, E values for other points are constrained by this trade-off. This is not to say that there is a functional relationship between P and R but that there is `almost' one. In this sense the loss of information is not so severe, although it must be admitted that this loss has not been quantified. Further difficulties arise in evaluation when attempting to measure the comparative effectiveness of relevance feedback strategies. In these strategies certain documents are looked at on a first iteration to establish the parameters for the second iteration. Typically the documents looked at are the top documents in a ranking, the remaining documents are unsighted. To establish the effectiveness of the feedback, we must somehow measure how feedback improves retrieval. The most sensible way of doing this is to generate a residual ranking for the second iteration which is a ranking with the n feedback documents removed. This can then be compared with the ranking for the first iteration with the same n documents removed, in this latter case they are of course the top n documents. From these rankings, one for the first and one for the second iteration, precision-recall graphs can be generated. This method has been used extensively by Harper4 and Ide9 for evaluating feedback experiments. It neatly measures the effect of feedback on documents the user has not previously seen. 3.5 Liinits to retrieval In evaluating the results of retrieval experiments, it is often important to establish the bounds on retrieval. Trivial bounds obviously exist in that retrieval effectiveness cannot exceed precision and recall jointly being 100 per cent, nor can it fall below both being 0 per cent. One interesting speculative question to ask is whether in fact we wish to design retrieval systems that achieve 100 per cent precision and recall. It is not too difficult to argue that this could be achieved for some specific query. Achieving 100 per cent precision and recall on the average for some unknown set of queries is a different matter. In designing any retrieval system we use certain models for the structures and processes involved. These models are necessarily an imperfect reflection of the reality they are trying to model. In particular any model for relevance we might invoke will have built in an inherent uncertainty. Therefore one would hypothesize that perfect retrieval is impossible, or to put it differently, that a retrieval system cannot be all things to all men. Let us now look at a possible objection to the above claim of the impossibility of perfect retrieval. One might claim it is the primitive