IRE Information Retrieval Experiment Retrieval system tests 1958-1978 chapter Karen Sparck Jones Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. 234 Retrieval system tests 1958-1978 formulation breadth as a secondary variable. In general the language tests were concerned with post-coordinate systems, though it has to be recognized that many thesauri include compound terms and so allow a kind of hybrid pre- and post-coordinate indexing. Keen's ISILT test included pre-coordinatc languages, and the printed subject indexes of EPSILON embody pre- coordination. The most ambitious test of a thorough pre-coordinate system is represented by Yates-Mercer's non-comparative evaluation of Farradane's relational indexing'02. It is noteworthy that classifications figured much less largely in the tests of this decade than in those of the previous one. The motivation for these tests was generally the straightforward one of simply comparing the languages concerned, or perhaps of evaluating natural language compared with a given controlled language. The underlying assumption tended to be either that the different languages behave much the same, or more particularly, that natural language is competitive, as in Salton's Medlars test, for example. Yates-Mercer's investigation is thus noteworthy in that it was explicitly aimed, in contrast, at justifying a very sophisticated relational approach. Of course in all these tests, as in those of the previous decade, the implicit assumption is that the indexing language used in a retrieval system is important. In form these tests tended to follow by now standard practice, though with rather more emphasis on real user requests, with evaluation ordinarily by precision and recall or, for the larger document sets, relative recall. Aitchison et al.'s and Keen's tests used exhaustive recall, but the majority of the tests were restricted to recall relative to collection subsets. The collections used tended, as before, to be rather small: only Barker ci al. used more than 100 requests. However they, Miller, Olive et al. and Cleverdon87 all used large document sets represented by regular service files. The numbers are not always given, but Miller, for example, searched some 210 000 documents. With respect to the test results, again considering broadly comparable tests in terms of both objective and conduct, the specific findings as before show very different values for precision and recall, again not surprisingly for relative recall. Thus individual project ranges for precision were from 12 to 15 per cent for Miller (my calculation from his thesis68), from 27 to 51 per cent for Cleverdon (calculated by extrapolation101), from 46.3 to 89.9 per cent for Klingbiel and Rinker, and from 38.8 to 66.0 per cent for Barker ci al.; for relative recall from 3()[OCRerr]52 per cent for Cleverdon, 51-64 per cent for Miller, 49.2-73.3 per cent for Klingbiel and Rinker, to 56.4-95.7 per cent for Barker et al., with absolute recall ranging from approximately 4-28 per cent for Aitchison et al. to 57.9-92.3 per cent for Keen. For this group as a whole precision ranged from 12 per cent (Miller) to 66.0 per cent (Barker ci al.), and absolute recall from about 3 per cent (Aitchison) to 100 per cent (Keen). Other measures, like the numbers of non-relevant documents retrieved used by Keen, ranged from medians of 8.6 to 24.4. These are figures for simple sets of retrieved documents. A significant feature of the work of this period was the use of ranked output, for which performance representations may be obtained, with the document cutoff methods used by Aitchison ci al., for example, or the recall cutoff techniques used by the Smart Project. Graph comparison presents many problems; it may therefore simply be noted that, for the same calculation methods, large and presumably significant differences between graphs appear in Aitchison et al., for instance. Thus in