IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. r The outcome of 20 years' testing 245 Overall, the impression must be of how comparatively little the non- negligible amount of work done has told us about the real nature of retrieval Systems. Of course, compared with areas like biological research, the number of tests has been extremely small; and a point brought out by the survey is how few really serious tests there have been. In his 1970 review of evaluation tests Cleverdon119 includes perhaps a couple of dozen tests; and as the present chapter suggests, the number would not be more than doubled 10 years later. One might nevertheless suppose that enough experimental and investigative work had been done to provide some concrete information about retrieval systems. Yet the most striking feature of the test history of the past two decades is its lack of consolidation. It is true that some very broad generalizations have been endorsed by successive tests: for example that performance is pretty middling, or that different languages perform the same; but there has been a real failure at the detailed level to build one test on another. As a result there are no explanations for these generalizations, and hence no means of knowing whether improved systems could be designed. It is of course unreasonable to expect a high degree of consistency in the conduct of experiments: this would presuppose a framework for system characterization and evaluation which does not exist. Conducting large test programmes in document retrieval is also extremely laborious; it requires resources which are not available to many individual projects. It is nevertheless the case that the lack of solid results must be attributed primarily to poor methodological standards. As the test details presented in this chapter show, there is so little control in individual tests and so much variation in method between tests that interpretations of the results of any one test or of their relationships with those of others must be uncertain. The general inadequacy of information retrieval tests, but also the practical reasons for it, are best exhibited by looking again at the conditions for a retrieval test. There are the data on one hand, the mechanism on the other. The data consists of the actual documents and the actual user queries and assessments. The mechanism consists of the indexing and searching apparatus. We initially think of the data (D) as given, the mechanism (M) as chosen, i.e. we exploit specific techniques in a specific environment, to obtain a total retrieval system. The minimal system study then consists simply of noting the performance of this system: call this D: M. We then recognize that different mechanism options are available and, selecting some part of the mechanism, say the indexing language, for study as the primary experimental variable, we compare two of its values, say Ml 1 and M12, in a test D : M11/Mi 2. We then consider connections between different parts of the mechanism and proceed to relate the behaviour of variable Ml to that of some other variable M2, say indexing exhaustivity, for a test with the structure D:Ml 1/M12:M21/M22. In this we perhaps regard M2 as the secondary variable. We can naturally extend the test series for any set Ml, .2... Mn of mechanism variables we choose to examine. But of course, from the point of view of understanding retrieval systems in general, D is as important as M. The behaviour of retrieval systems is a function of both D and M. We should therefore consider the con- stituent variables of D, say types of user, giving us Dl l/D12: Ml l/M12: M21/M22, and, further, say different document types, giving us