IRE
Information Retrieval Experiment
Retrieval system tests 1958-1978
chapter
Karen Sparck Jones
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
The decade 1958-1968 227
Actual evaluation tests included Dale and Dales's21, designed to examine
alternative clustering bases for establishing term associations, and similar
ones conducted by Smart workers29-32, especially Lesk's40, at A. D.
Little33' 34, by Tague25, and by Dennis27. The Smart workers and Williams
and Perriens41 also investigated weighting. The Smart tests and Tague's
included comparisons between automatic and manual indexing.
These tests were too few and too heterogeneous for systematic comparison
under headings to be worthwhile. It is sufficient to note that, overall, they
tended to show rather little difference in performance for the various
statistical techniques studied, In comparisons with manual indexing,
generally using simple extracted terms but sometimes, as in some Smart tests,
thesaurus terms, the general conclusion was that performance is roughly
comparable.
Unfortunately these tests were vitiated by even greater methodological
defects than those associated with manual indexing studies. Dale and Dale,
for example, used only 4 requests, Smart tests very small document sets,
sometimes containing less than 100 documents. Dennis' studies are an
honourable exception since by the standards of the day they were on an
enormous scale, particularly as far as the document sample was concerned.
However the request set was typically small, only 6 in one experiment.
Dennis' test was in many ways typical of the period in mixing valiant
attempts at control in some directions with serious failures in others, to
produce a rich but unevenly cooked whole.
The major non-evaluation tests included, for example, Damerau's of text
extraction using 7 articles23, Stone and Rubinoff's of the collection
vocabulary28, Borko's extended study of classes obtained by factor analysis24,
and the A. D. Little study of term association techniques for a very large
NASA document collection34. Both Damerau and Borko judged their
automatic indexing results by comparisons with manually selected or grouped
words. The A. D. Little study in fact included a crude performance test for
a single search, but evaluation was chiefly simply by inspection of the
statistical association products.
A few interesting studies, for example by the Smart Project30 and by
Melton26, were concerned with non-statistical automatic text analysis
specifically designed to identify syntactic relations between words. However
these tended to show similar results to those obtained with manual syntax,
and the work could in any case not be carried far because of the unresolved
general problems of linguistic analysis.
The main emphasis in automatic indexing work was indeed on statistical
approaches, but even here the retrieval system testing done was much less
substantial than that done on manual indexing. This is not wholly surprising,
since the methods had to be worked out before they could be tested. However
the many difficulties encountered damped the enthusiasm of the early 1960s,
particularly since the problems of devising and validating statistical methods
were compounded by the problems of conducting information retrieval tests
of any kind which were increasingly recognized by research Dennis'
project was at least as discouraging in showing absolutely poor performance
for vast work as encouraging for showing that something could be done, and
only a few projects like those of Sparck Jones and Salton were involved in
serious statistical indexing and evaluation testing by 1968.