IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 246 Retrieval system tests I958[OCRerr]1978 Dll/Dl2:D21/D22:Mll/M12:M21/M22; and we can clearly continue, for as many variables and values of each as we can identify. In general retrieval system tests have exhibited biases in the way they have approached this set of study possibilities. Much more attention has been paid to the mechanism variables M than the data variables D. The mechanism variables have been made explicit, the data one left implicit: in other words[OCRerr] though test authors have often paid lip service to the possible influence of their data variable values on their results, they have nevertheless tended tO characterize the entire system performance in terms of the mechanism variables studied; variable D has been left undifferentiated, while perhaps several values of a single M variable, or a few values of several M variables, have been examined. It has not, moreover, been open to third parties to put different tests together on the grounds that while their data variable values have differed their mechanism variable values have been the same, so amalgamating the tests would permit the effects of data variation to be examined: the mechanism variables have generally not been identically or sufficiently similarly treated. Some projects, like those of Salton and Sparck Jones, have begun to tackle this problem by working with more than one data set; but it has to be recognized (as the data details of Sparck Jones and Bates95 make plain) that the characterization and control of data variables in these test series is much less systematic even than that of the mechanism variables. It is moreover generally the case that where the same data have been used by different projects, the treatment ofthe mechanism variables has been too heterogeneous for it to be possible to combine the test results to obtain information about an extended set of mechanism variable values. 12.8 Methodological and substantive achievements Thus if we accept that a proper understanding of retrieval Systems can be achieved only with the aid of both a well-organized descriptive framework and extensive series of experiments, each bearing on the other, and look now at the evidence of the chapter survey, what methodological and substantive progress has been made in achieving this understanding? If we compare, say, Montague's test of 196519 with Evans' of 197561, 62, we can detect some methodological improvements and a substantive develop- ment: Montague's test was vitiated by the use of incomparable query sets and incomparable document sets, i.e. the individual experiments in the group could not be compared usefully with one another because the data sets used differed. In many of them the query set used was also very small. Evans used a constant set of queries and documents for a range of comparisons, and a somewhat larger query set than any of Montague's. The substantive development is represented by the shiff from document indexing, studied in Montague's test, to query formulation, the focus of Evan's test. At the same time, the difference between the tests is not as large as might be hoped for, in methodological solidity or depth of understanding. While Montague's test is open to criticism in mixing real and synthetic queries, in Evans' the amount of output assessed for relevance per query was somewhat arbitrarily varied. Again, while Montague's test explored a variety of rather arbitrarily related