methods for finding the relevant documents could have
been used. `lithe first method, flill relevance judgments
could have been made on over one million documents, for
each topic, resulting in over 100 million judgments. This
was cle&ly impossible. As a second approach, a random
sample of the documents could have been taken, with rel-
evance judgments done on that sample only. The problem
with this approach is that a random sample that is large
enough to find on the order of 200 relevant documents per
topic is a very large random sample, and is likely to result
in insulficient relevance judgments. The third method, the
one used in ~ was to make relevance judgments on
the sample of documents selected by the various partici-
pating systems. This method is known as the pooling
method, and has been used successfully in creating other
collections [Sparck Jones & van Rijsbergen 1975]. The
sample was constructed by taI~g the top 100 documents
retrieved by each system for a given topic and merging
them into a pool for relevance assessment. This is a valid
sampling method since all the systems used ranked
retrieval methods, with those documents most likely to be
relevant returned first.
Pooling proved to be an effective method. There was lit-
tle overlap among the 31 systems in their retrieved docu-
ments, although cousiderably more overlap than in
ThEC-1.

          Table 2. Overlap of Submitted Results

                   TREC-2        ThEC-1
                 Max  Actual  Max   Actual
Unique
Documents
PerTopic         4000 1106.0  3300  1278.86
(Adhoc, 40 runs
23 groups)       _____ ______     _______
Unique
Documents
PerTopic         4000 1465.6  2200  1066.86
~outing, 40 runs
24 groups)       ____________ _____ _______

Table 2 shows the overlap statistics. The first overlap
statistics are for the adhoc topics (test topics against train-
ing documents disks 1 and 2), and the second statistics are
for the routing topics (training topics against test docu-
ments disk 3 only). For example, out of a maximum of
4000 possible unique documents (40 runs times 100 docu-
ments), over one4ourth of the documents were actually
unique. This means that the different systems were lmd-
ing different documents as likely relevant documents for a
topic. Whereas this might be expected (and indeed has
been shown to occur, Katzer et. al. 1982) from widely


                                        6

differing systems, these overlaps were often between two
runs for the same system. One reason for the lack of
overlap is the very large number of documents that con-
tain many of the same terms as the relevant documents,
but the major reason is the very different sets of terms in
the constructed queries. This lack of overlap should
improve the coverage of the relevance set, and verifies the
use of the pooling methodology to produce the sample.
The merged list of results was then shown to the human
assessors. Icach topic was judged by a single assessor to
insure the best consistency of judgment. Varying numbers
of documents were judged relevant to the topics. For the
ThEC-2 adhoc topics (topics 101-150), the median num-
ber of relevant documents per topic is 201, down from
277 for topics 51-100 (as used for adhoc topics in
TREC-1). Only 11 topics have more than 300 relevant
documents, with only 2 topics having more than 500 rele-
vant documents. These topics were deliberately made
narrower than topics 51-100 because of a concern that
topics with more than 300 relevant documents are likely
to have incomplete relevance assessments.

4. Evaluation
An important element of TREC was to provide a common
evaluation  forum. Standard  recall~recision  and
recallifallout figures were calculated for each TREC sys-
tem and these are presented in Appendix A. A chart with
additional data about each system is shown in Appendix
B. This chart consolidates information provided by the
systems that describe features and system tinting, and
allows some primitive comparison of the amount of effort
needed to produce the results.

4.1 Definition of Recall/Precision and Recall/Fallout
Curves
Figure 2 shows typical recall~recision curves. The x axis
plots the recall values at fixed levels of recall, where

Recall =

   number of relevant items retrieved
total number of relevant items in collection

The y axis plots the average precision values at those
given recall values, where precision is calculated by

Precision =

number of relevant items retrieved
 total number of items retrieved

These curves represent averages over the 50 topics. The
averaging method was developed many years ago [Salton
& McGill 1983] and is well accepted by the information
retrieval community. The curves show system perfor-
mance across the fall range of retrieval, i.e., at the early
stage of retrieval where the highly-ranked documents give