IRE
Information Retrieval Experiment
Laboratory tests: automatic systems
chapter
Robert N. Oddy
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
166 Laboratory tests: automatic Systems
evidence for general hypotheses by conducting series of experiments. A
comparative test which indicates, on the basis of one test collection, that one
setting, A, of certain factors gives better performance than another setting,
B, can be repeated on a number of other collections. In fact, both Sparck
Jones and Bates6 and Salton8 have reported that a number of results hold for
several of their test collections. A comparison6 of the results of similar
experiments by different groups of researchers, however, shows that there is
often broad agreement, but that the situation is confused by variations in the
evaluation techniques: the various performance averaging methods give
materially different figures34.
9.4 Experimental objectives
What questions are tests of the type I am discussing designed to answer?
What are the strengths and limitations of the methodology? Criticisms of the
methodology, usually pointing out lack of realism, are so common as to be a
part of the information retrieval folklore. Experimenters often acknowledge
the problem by qualifying their results appropriately. So, how successful has
the methodology turned out to be?
In his review of theoretical work in information retrieval, Robertson'5
discusses the role of experimental work, and distinguishes between
experiments which test the assumptions on which a theory is based, and
those which test theories by evaluating the retrieval effectiveness of systems
based upon them. There have been very few experiments fulfilling the former
role[OCRerr]I shall have more to say about this presently. Laboratory experiments
are usually intended to determine the effect of some input parameter or
system design feature on retrieval effectiveness, that is, on the system's
ability to retrieve relevant documents. If the researcher views his tests as a
series of engineering trials, this is the obvious approach: he is simply
determining whether he has achieved his objective. It is not so obvious that
it is the right approach if the researcher's objectives are scientific, in other
words, if he wishes to test a theory. Recently, as Robertson15 points out,
theories have arisen which explicitly relate retrieval effectiveness to system
parameters. (For instance, one first shows that ranking documents according
to probability of relevance gives optimal retrieval performance, in some
sense35, and then proceeds to devise ways of estimating that probability1 2.)
Even in such cases, an experimenter might be accused of impatience if he
moves directly to a test of retrieval performance. There are assumptions to
test: if the document collection, or the system's users do not conform to the
assumptions, what can the experimental result tell us about the theory? In
general, nothing! If the result of the test is good, he may have an engineering
success, but it is not a scientific one, because he still does not know why the
system works as well as it does.
It is not within the scope of this chapter to survey the results of laboratory
experiments in information retrieval over the past years (see those of Sparck
Jones and Salton). I shall therefore confine my account to what is relevant to
methodological issues. The effects of various factors on retrieval performance
have been studied with the aid of test collections. The factors can be regarded
as falling into two broad categories, although the boundary is indistinct.
I
I
I
q
I
1
U-
I
I
I
I
I
d
U
I
U
4
e
I
I
I