IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. The decade 1958-1968 227 Actual evaluation tests included Dale and Dales's21, designed to examine alternative clustering bases for establishing term associations, and similar ones conducted by Smart workers29-32, especially Lesk's40, at A. D. Little33' 34, by Tague25, and by Dennis27. The Smart workers and Williams and Perriens41 also investigated weighting. The Smart tests and Tague's included comparisons between automatic and manual indexing. These tests were too few and too heterogeneous for systematic comparison under headings to be worthwhile. It is sufficient to note that, overall, they tended to show rather little difference in performance for the various statistical techniques studied, In comparisons with manual indexing, generally using simple extracted terms but sometimes, as in some Smart tests, thesaurus terms, the general conclusion was that performance is roughly comparable. Unfortunately these tests were vitiated by even greater methodological defects than those associated with manual indexing studies. Dale and Dale, for example, used only 4 requests, Smart tests very small document sets, sometimes containing less than 100 documents. Dennis' studies are an honourable exception since by the standards of the day they were on an enormous scale, particularly as far as the document sample was concerned. However the request set was typically small, only 6 in one experiment. Dennis' test was in many ways typical of the period in mixing valiant attempts at control in some directions with serious failures in others, to produce a rich but unevenly cooked whole. The major non-evaluation tests included, for example, Damerau's of text extraction using 7 articles23, Stone and Rubinoff's of the collection vocabulary28, Borko's extended study of classes obtained by factor analysis24, and the A. D. Little study of term association techniques for a very large NASA document collection34. Both Damerau and Borko judged their automatic indexing results by comparisons with manually selected or grouped words. The A. D. Little study in fact included a crude performance test for a single search, but evaluation was chiefly simply by inspection of the statistical association products. A few interesting studies, for example by the Smart Project30 and by Melton26, were concerned with non-statistical automatic text analysis specifically designed to identify syntactic relations between words. However these tended to show similar results to those obtained with manual syntax, and the work could in any case not be carried far because of the unresolved general problems of linguistic analysis. The main emphasis in automatic indexing work was indeed on statistical approaches, but even here the retrieval system testing done was much less substantial than that done on manual indexing. This is not wholly surprising, since the methods had to be worked out before they could be tested. However the many difficulties encountered damped the enthusiasm of the early 1960s, particularly since the problems of devising and validating statistical methods were compounded by the problems of conducting information retrieval tests of any kind which were increasingly recognized by research Dennis' project was at least as discouraging in showing absolutely poor performance for vast work as encouraging for showing that something could be done, and only a few projects like those of Sparck Jones and Salton were involved in serious statistical indexing and evaluation testing by 1968.