IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
r
The outcome of 20 years' testing 245
Overall, the impression must be of how comparatively little the non-
negligible amount of work done has told us about the real nature of retrieval
Systems. Of course, compared with areas like biological research, the number
of tests has been extremely small; and a point brought out by the survey is
how few really serious tests there have been. In his 1970 review of evaluation
tests Cleverdon119 includes perhaps a couple of dozen tests; and as the
present chapter suggests, the number would not be more than doubled 10
years later. One might nevertheless suppose that enough experimental and
investigative work had been done to provide some concrete information
about retrieval systems. Yet the most striking feature of the test history of the
past two decades is its lack of consolidation. It is true that some very broad
generalizations have been endorsed by successive tests: for example that
performance is pretty middling, or that different languages perform the
same; but there has been a real failure at the detailed level to build one test
on another. As a result there are no explanations for these generalizations,
and hence no means of knowing whether improved systems could be
designed.
It is of course unreasonable to expect a high degree of consistency in the
conduct of experiments: this would presuppose a framework for system
characterization and evaluation which does not exist. Conducting large test
programmes in document retrieval is also extremely laborious; it requires
resources which are not available to many individual projects. It is
nevertheless the case that the lack of solid results must be attributed primarily
to poor methodological standards. As the test details presented in this chapter
show, there is so little control in individual tests and so much variation in
method between tests that interpretations of the results of any one test or of
their relationships with those of others must be uncertain.
The general inadequacy of information retrieval tests, but also the practical
reasons for it, are best exhibited by looking again at the conditions for a
retrieval test. There are the data on one hand, the mechanism on the other.
The data consists of the actual documents and the actual user queries and
assessments. The mechanism consists of the indexing and searching
apparatus. We initially think of the data (D) as given, the mechanism (M) as
chosen, i.e. we exploit specific techniques in a specific environment, to obtain
a total retrieval system. The minimal system study then consists simply of
noting the performance of this system: call this D: M. We then recognize that
different mechanism options are available and, selecting some part of the
mechanism, say the indexing language, for study as the primary experimental
variable, we compare two of its values, say Ml 1 and M12, in a test
D : M11/Mi 2. We then consider connections between different parts of the
mechanism and proceed to relate the behaviour of variable Ml to that of
some other variable M2, say indexing exhaustivity, for a test with the
structure D:Ml 1/M12:M21/M22. In this we perhaps regard M2 as the
secondary variable. We can naturally extend the test series for any set Ml,
.2... Mn of mechanism variables we choose to examine.
But of course, from the point of view of understanding retrieval systems
in general, D is as important as M. The behaviour of retrieval systems
is a function of both D and M. We should therefore consider the con-
stituent variables of D, say types of user, giving us Dl l/D12: Ml l/M12:
M21/M22, and, further, say different document types, giving us