methods for finding the relevant documents could have been used. `lithe first method, flill relevance judgments could have been made on over one million documents, for each topic, resulting in over 100 million judgments. This was cle&ly impossible. As a second approach, a random sample of the documents could have been taken, with rel- evance judgments done on that sample only. The problem with this approach is that a random sample that is large enough to find on the order of 200 relevant documents per topic is a very large random sample, and is likely to result in insulficient relevance judgments. The third method, the one used in ~ was to make relevance judgments on the sample of documents selected by the various partici- pating systems. This method is known as the pooling method, and has been used successfully in creating other collections [Sparck Jones & van Rijsbergen 1975]. The sample was constructed by taI~g the top 100 documents retrieved by each system for a given topic and merging them into a pool for relevance assessment. This is a valid sampling method since all the systems used ranked retrieval methods, with those documents most likely to be relevant returned first. Pooling proved to be an effective method. There was lit- tle overlap among the 31 systems in their retrieved docu- ments, although cousiderably more overlap than in ThEC-1. Table 2. Overlap of Submitted Results TREC-2 ThEC-1 Max Actual Max Actual Unique Documents PerTopic 4000 1106.0 3300 1278.86 (Adhoc, 40 runs 23 groups) _____ ______ _______ Unique Documents PerTopic 4000 1465.6 2200 1066.86 ~outing, 40 runs 24 groups) ____________ _____ _______ Table 2 shows the overlap statistics. The first overlap statistics are for the adhoc topics (test topics against train- ing documents disks 1 and 2), and the second statistics are for the routing topics (training topics against test docu- ments disk 3 only). For example, out of a maximum of 4000 possible unique documents (40 runs times 100 docu- ments), over one4ourth of the documents were actually unique. This means that the different systems were lmd- ing different documents as likely relevant documents for a topic. Whereas this might be expected (and indeed has been shown to occur, Katzer et. al. 1982) from widely 6 differing systems, these overlaps were often between two runs for the same system. One reason for the lack of overlap is the very large number of documents that con- tain many of the same terms as the relevant documents, but the major reason is the very different sets of terms in the constructed queries. This lack of overlap should improve the coverage of the relevance set, and verifies the use of the pooling methodology to produce the sample. The merged list of results was then shown to the human assessors. Icach topic was judged by a single assessor to insure the best consistency of judgment. Varying numbers of documents were judged relevant to the topics. For the ThEC-2 adhoc topics (topics 101-150), the median num- ber of relevant documents per topic is 201, down from 277 for topics 51-100 (as used for adhoc topics in TREC-1). Only 11 topics have more than 300 relevant documents, with only 2 topics having more than 500 rele- vant documents. These topics were deliberately made narrower than topics 51-100 because of a concern that topics with more than 300 relevant documents are likely to have incomplete relevance assessments. 4. Evaluation An important element of TREC was to provide a common evaluation forum. Standard recall~recision and recallifallout figures were calculated for each TREC sys- tem and these are presented in Appendix A. A chart with additional data about each system is shown in Appendix B. This chart consolidates information provided by the systems that describe features and system tinting, and allows some primitive comparison of the amount of effort needed to produce the results. 4.1 Definition of Recall/Precision and Recall/Fallout Curves Figure 2 shows typical recall~recision curves. The x axis plots the recall values at fixed levels of recall, where Recall = number of relevant items retrieved total number of relevant items in collection The y axis plots the average precision values at those given recall values, where precision is calculated by Precision = number of relevant items retrieved total number of items retrieved These curves represent averages over the 50 topics. The averaging method was developed many years ago [Salton & McGill 1983] and is well accepted by the information retrieval community. The curves show system perfor- mance across the fall range of retrieval, i.e., at the early stage of retrieval where the highly-ranked documents give