IRE Information Retrieval Experiment Gedanken experimentation: An alternative to traditional system testing? chapter William S. Cooper Butterworth & Company Karen Sparck Jones All rights reserved. No part of this publication may be reproduced or transmitted in any form or by any means, including photocopying and recording, without the written permission of the copyright holder, application for which should be addressed to the Publishers. Such written permission must also be obtained before any part of this publication is stored in a retrieval system of any nature. Theory and experiment in information retrieval 201 experimental one, gedanken experimentation is an approach which offers considerable hope of supplementing, and perhaps in many cases rendering less necessary, classical retrieval testing. 11.1 Theory and experiment in information retrieval If one were pressed to describe the central `theory' underlying document retrieval, it would be hard to do much more than list the obvious conceptual elements of the retrieval situation. A typical list would note that there must be a collection of documents or records of some kind; a population of potential searchers; that to provide them with search assistance it seems necessary to isolate certain search properties of the documents (the `descriptors' or `index terms') and of the searchers' information needs (usually specified in the form of `requests' or `queries'); that rules for matching information need properties against document properties (the `match function' or `retrieval strategy') are also needed; and so forth. Although some might be willing to dignify such an account with the name `theory', it is really not so much a theory of retrieval as a review of the problem setting with suggested terminology for discussing it. Occasionally a powerful bit of real theory might surface, as for instance the theory of syntax in a scheme for automatic indexing, or Boolean Logic in the specification of certain request languages, but these have to do with special kinds of retrieval systems or their components and do not constitute an overall theory of retrieval. In fact, in the search for a general theory it is hard to do much better than to give some elaboration of the vague rule that a system should retrieve for the user those documents most likely to satisfy him. As scientific theories go this truism is not very impressive, but it is the only wisp of general theory we have. What was said of a recent political candidate can be said of document retrieval theory: Deep down inside it's shallow. Perhaps partly in recognition of this paucity of theory, many researchers have turned to experimentation, and especially laboratory experimentation. As might be expected, the classical experimental approach has been fairly theory-independent, consisting essentially in the trying out of various competing retrieval schemes (including indexing methods, etc.) to see which seem to work best. The methodology involved has been ably documented in other chapters of this book, and so need not be reviewed here except to note that the difficulties to be met in drawing useful conclusions from a retrieval experiment of classical design have turned out to be much more numerous and serious than had been expected. There are sampling and other statistical difficulties; difficulties in generalizing results obtained in just one or a few test collections; difficulties in generalizing the needs of the test user population, or in the absence of a real user population difficulties in assuring the realism of manufactured requests; difficulties arising from the variability and sensitivity to test conditions of the judgements of document relevance or usefulness; difficulties in extrapolating results to real situations where something about the system or the environment is bound to be different; and difficulties arising from the interaction of various available features of the retrieval rules under test which, if at all numerous, cannot as a practical