IRE
Information Retrieval Experiment
An experiment: search strategy variations in SDI profiles
chapter
Lynn Evans
Butterworth & Company
Karen Sparck Jones
All rights reserved. No part of this publication may be reproduced
or transmitted in any form or by any means, including photocopying
and recording, without the written permission of the copyright holder,
application for which should be addressed to the Publishers. Such
written permission must also be obtained before any part of this
publication is stored in a retrieval system of any nature.
312 An experiment: se[OCRerr]rch strategy vd rid' tons in SDI profiles
automatic indexing it is now being seen more clearly how interdependent arc
indexing and searching methods.
At various points in the above description some shortcomings of the
original experiment have been mentioned. It may be useful to conclude by
gathering together and discussing these defects and also those questions
which were raised but remained unresolved. It is hoped that some activities
were performed adequately but inevitably they are of less interest and will
only be mentioned briefly.
Those parts of the investigation which are considered to have been sound
include: a very adequate document collection; a meaningful range of search
strategies; a realistic profile compilation method involving standard tasks
which allowed an accurate measure of the effort required from the
information scientist on the different search strategies; a valid procedure for
collecting relevance assessments; and the recruitment of the user group and
the mechanics of the experiment in general.
Less satisfactory areas include: the rather low number of queries; retrieval
performance evaluation by the boolean comparison method; the absence of
automatic term-weighting; the lightweight nature of the cost data; and the
significance of the experimental results.
Concerning the number of queries it is now considered (although nowhere
proved) that perhaps twice the number of queries would have been more
convincing' or, at least a number sufficient enough that the results of a few
individual queries do not obtrude on the overall results. In our experiment
this effect was exemplified by the differences observed when calculating by
the two averaging methods, numbers and ratios. With a greater number of
queries it would also have been possible to ignore those queries for which
there were too few or, less importantly, too many relevant items in the
collection. It is not clear what the implications of such a practice are but
certainly the results would thereby be more reproducible. As has already
been mentioned too many recall/precision ratios of the order 0/1, 1/1, etc.,
are not really acceptable. The problem could have been eased indirectly if a
more drastic approach had been taken originally with some of the user
interest statements. Those that clearly comprised more than one question
could have been treated separately. This would have resulted in `cleaner'
profiles of which fewer were overlong, some profile performances would
probably have been subject to less extraneous influences, and the number of
queries would have been[OCRerr] larger. Although a token number of the user
statements were in fact split up more could have been and the experiment
would have been better for it. At the time the view taken was that as little as
possible should be done to change the conditions from that of `real life' and,
since these were statements very like those received from users of an
operational system, the less tampering the better. This is now deemed to have
been misguided and to have done what is now suggested would not have
affected the validity of the test in any way.
The most disappointing outcome of the whole experiment was the failure
to develop an acceptable method for comparing an optimum boolean strategy
with any strategy producing a ranked output. A few simple examples quickly
show the inappropriateness of using the boolean output itself as the basis for
comparison. Very little can be offered in the way of a solution even now and