IRE Information Retrieval Experiment Laboratory tests: automatic systems chapter Robert N. Oddy Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 166 Laboratory tests: automatic Systems evidence for general hypotheses by conducting series of experiments. A comparative test which indicates, on the basis of one test collection, that one setting, A, of certain factors gives better performance than another setting, B, can be repeated on a number of other collections. In fact, both Sparck Jones and Bates6 and Salton8 have reported that a number of results hold for several of their test collections. A comparison6 of the results of similar experiments by different groups of researchers, however, shows that there is often broad agreement, but that the situation is confused by variations in the evaluation techniques: the various performance averaging methods give materially different figures34. 9.4 Experimental objectives What questions are tests of the type I am discussing designed to answer? What are the strengths and limitations of the methodology? Criticisms of the methodology, usually pointing out lack of realism, are so common as to be a part of the information retrieval folklore. Experimenters often acknowledge the problem by qualifying their results appropriately. So, how successful has the methodology turned out to be? In his review of theoretical work in information retrieval, Robertson'5 discusses the role of experimental work, and distinguishes between experiments which test the assumptions on which a theory is based, and those which test theories by evaluating the retrieval effectiveness of systems based upon them. There have been very few experiments fulfilling the former role[OCRerr]I shall have more to say about this presently. Laboratory experiments are usually intended to determine the effect of some input parameter or system design feature on retrieval effectiveness, that is, on the system's ability to retrieve relevant documents. If the researcher views his tests as a series of engineering trials, this is the obvious approach: he is simply determining whether he has achieved his objective. It is not so obvious that it is the right approach if the researcher's objectives are scientific, in other words, if he wishes to test a theory. Recently, as Robertson15 points out, theories have arisen which explicitly relate retrieval effectiveness to system parameters. (For instance, one first shows that ranking documents according to probability of relevance gives optimal retrieval performance, in some sense35, and then proceeds to devise ways of estimating that probability1 2.) Even in such cases, an experimenter might be accused of impatience if he moves directly to a test of retrieval performance. There are assumptions to test: if the document collection, or the system's users do not conform to the assumptions, what can the experimental result tell us about the theory? In general, nothing! If the result of the test is good, he may have an engineering success, but it is not a scientific one, because he still does not know why the system works as well as it does. It is not within the scope of this chapter to survey the results of laboratory experiments in information retrieval over the past years (see those of Sparck Jones and Salton). I shall therefore confine my account to what is relevant to methodological issues. The effects of various factors on retrieval performance have been studied with the aid of test collections. The factors can be regarded as falling into two broad categories, although the boundary is indistinct. I I I q I 1 U- I I I I I d U I U 4 e I I I