IRE
Information Retrieval Experiment
Laboratory tests of manual systems
chapter
E. Michael Keen
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
Introduction 137
little special control in construction. The clearest use of this technique is the
Aberystwyth `Off-shelf' test6 where six printed indexes were taken off the
library shelves and searched in a laboratory.
The existence of these rather different test techniques, yet all conducted
within laboratory environments, suggests that the problem of testing manual
Systems is that of controlling the rioting variables without distorting them or
so torturing them that they become skeletons of reality. Of course the answer
to control with realism will never be found, at least to the satisfaction of the
perfectionist or sceptic, but the researcher is always striving for improved
ways of steering between this Scylla and Charybdis. Are manual systems
more difficult than automated ones in this respect? Do manual systems
contain more human and behavioural variables? It is doubtful if logical
analysis would distinguish any fundamental differences, but in the area of
searching in particular the sheer abundance of possibilities for variability
and human error is surely greater in the evaluation testing scene of manual
systems than of automated ones.
Evaluation test validity
What constitutes a valid evaluation test? Why can we regard everything
done before [OCRerr]Cranfield 1 (and some later work) as inadequate to answer
questions of the merit of retrieval systems? Cyril Cleverdon himself regards
some tests as incomplete7, and says that as a result they do not advance the
state of knowledge about information retrieval. The three things he requires
of a test are that there is a collection of documents under test, a set of search
requests, and some relevance decisions that identify documents relevant to
the requests. These requirements need not be met in a `real' manner: even a
set of simulated documents, requests and relevance decisions could be used.
Thus a valid test must involve the total environment of information retrieval
even if only one small subsystem is under investigation, such as varieties in
term order in printed index entries.
In addition to these three desiderata, a further three seem necessary. First,
to be acceptable as a test the measures and performance criteria measured
must be adequate. In particular, a measure or valid estimate of system recall
needs to be made, as well as the more easily obtainable precision performance
(or a substitute for it). These are necessary whether efficiency measurements
are also being made or not, since measures of efficiency or cost cannot be
interpreted without knowing what quality is being provided for the
expenditure incurred. Another aspect of this is that a sufficiently comprehen-
sive set of performance criteria be measured. For operational testing this
means all the criteria covering effectiveness and efficiency, but in a laboratory
situation a criterion such as cost would not be appropriate unless an accurate
simulation was involved. System coverage and currency would also hardly
apply. But `off-shelf' testing needs a wide range of criteria, and the
Aberystwyth test covered hits (recall), waste [OCRerr]recision), search time, search
effort (page turns), presentation clarity (relevance prediction), and usability
(subjective preferences of searchers). Cranfield 1 concentrated on recall, and
also systematically varied indexing time, but the sample precision ratios
included in the final report reveal the growing understanding there was of
adequacy of effectiveness measurement. The deep laboratory test may well