IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
234 Retrieval system tests 1958-1978
formulation breadth as a secondary variable. In general the language tests
were concerned with post-coordinate systems, though it has to be recognized
that many thesauri include compound terms and so allow a kind of hybrid
pre- and post-coordinate indexing. Keen's ISILT test included pre-coordinatc
languages, and the printed subject indexes of EPSILON embody pre-
coordination. The most ambitious test of a thorough pre-coordinate system
is represented by Yates-Mercer's non-comparative evaluation of Farradane's
relational indexing'02. It is noteworthy that classifications figured much less
largely in the tests of this decade than in those of the previous one.
The motivation for these tests was generally the straightforward one of
simply comparing the languages concerned, or perhaps of evaluating natural
language compared with a given controlled language. The underlying
assumption tended to be either that the different languages behave much the
same, or more particularly, that natural language is competitive, as in Salton's
Medlars test, for example. Yates-Mercer's investigation is thus noteworthy
in that it was explicitly aimed, in contrast, at justifying a very sophisticated
relational approach. Of course in all these tests, as in those of the previous
decade, the implicit assumption is that the indexing language used in a
retrieval system is important.
In form these tests tended to follow by now standard practice, though with
rather more emphasis on real user requests, with evaluation ordinarily by
precision and recall or, for the larger document sets, relative recall. Aitchison
et al.'s and Keen's tests used exhaustive recall, but the majority of the tests
were restricted to recall relative to collection subsets. The collections used
tended, as before, to be rather small: only Barker ci al. used more than 100
requests. However they, Miller, Olive et al. and Cleverdon87 all used large
document sets represented by regular service files. The numbers are not
always given, but Miller, for example, searched some 210 000 documents.
With respect to the test results, again considering broadly comparable tests
in terms of both objective and conduct, the specific findings as before show
very different values for precision and recall, again not surprisingly for
relative recall. Thus individual project ranges for precision were from 12 to
15 per cent for Miller (my calculation from his thesis68), from 27 to 51 per
cent for Cleverdon (calculated by extrapolation101), from 46.3 to 89.9 per
cent for Klingbiel and Rinker, and from 38.8 to 66.0 per cent for Barker ci
al.; for relative recall from 3()[OCRerr]52 per cent for Cleverdon, 51-64 per cent for
Miller, 49.2-73.3 per cent for Klingbiel and Rinker, to 56.4-95.7 per cent for
Barker et al., with absolute recall ranging from approximately 4-28 per cent
for Aitchison et al. to 57.9-92.3 per cent for Keen. For this group as a whole
precision ranged from 12 per cent (Miller) to 66.0 per cent (Barker ci al.), and
absolute recall from about 3 per cent (Aitchison) to 100 per cent (Keen).
Other measures, like the numbers of non-relevant documents retrieved used
by Keen, ranged from medians of 8.6 to 24.4. These are figures for simple sets
of retrieved documents. A significant feature of the work of this period was
the use of ranked output, for which performance representations may be
obtained, with the document cutoff methods used by Aitchison ci al., for
example, or the recall cutoff techniques used by the Smart Project. Graph
comparison presents many problems; it may therefore simply be noted that,
for the same calculation methods, large and presumably significant
differences between graphs appear in Aitchison et al., for instance. Thus in