IRE Information Retrieval Experiment Laboratory tests of manual systems chapter E. Michael Keen Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Introduction 137 little special control in construction. The clearest use of this technique is the Aberystwyth `Off-shelf' test6 where six printed indexes were taken off the library shelves and searched in a laboratory. The existence of these rather different test techniques, yet all conducted within laboratory environments, suggests that the problem of testing manual Systems is that of controlling the rioting variables without distorting them or so torturing them that they become skeletons of reality. Of course the answer to control with realism will never be found, at least to the satisfaction of the perfectionist or sceptic, but the researcher is always striving for improved ways of steering between this Scylla and Charybdis. Are manual systems more difficult than automated ones in this respect? Do manual systems contain more human and behavioural variables? It is doubtful if logical analysis would distinguish any fundamental differences, but in the area of searching in particular the sheer abundance of possibilities for variability and human error is surely greater in the evaluation testing scene of manual systems than of automated ones. Evaluation test validity What constitutes a valid evaluation test? Why can we regard everything done before [OCRerr]Cranfield 1 (and some later work) as inadequate to answer questions of the merit of retrieval systems? Cyril Cleverdon himself regards some tests as incomplete7, and says that as a result they do not advance the state of knowledge about information retrieval. The three things he requires of a test are that there is a collection of documents under test, a set of search requests, and some relevance decisions that identify documents relevant to the requests. These requirements need not be met in a `real' manner: even a set of simulated documents, requests and relevance decisions could be used. Thus a valid test must involve the total environment of information retrieval even if only one small subsystem is under investigation, such as varieties in term order in printed index entries. In addition to these three desiderata, a further three seem necessary. First, to be acceptable as a test the measures and performance criteria measured must be adequate. In particular, a measure or valid estimate of system recall needs to be made, as well as the more easily obtainable precision performance (or a substitute for it). These are necessary whether efficiency measurements are also being made or not, since measures of efficiency or cost cannot be interpreted without knowing what quality is being provided for the expenditure incurred. Another aspect of this is that a sufficiently comprehen- sive set of performance criteria be measured. For operational testing this means all the criteria covering effectiveness and efficiency, but in a laboratory situation a criterion such as cost would not be appropriate unless an accurate simulation was involved. System coverage and currency would also hardly apply. But `off-shelf' testing needs a wide range of criteria, and the Aberystwyth test covered hits (recall), waste [OCRerr]recision), search time, search effort (page turns), presentation clarity (relevance prediction), and usability (subjective preferences of searchers). Cranfield 1 concentrated on recall, and also systematically varied indexing time, but the sample precision ratios included in the final report reveal the growing understanding there was of adequacy of effectiveness measurement. The deep laboratory test may well