"hit" versus the probability of false alarm, or ale probability of a "false drop". The x axis plots the probability of false alarm, calculated as follows Probability of false alarm = number of nonrelevant items retrieved total number of nonrelevant items in collection The y axis plots the probability of detection, calculated as Probability of detection = number of relevant items retrieved total number of relevant items in collection Note that the probability of detection is the same as recall, and the probability of false alaIn is the same as fal- lout, an older measure in information retrieval (Salton & McGill 1983). These measures are for a single topic, but averages can be computed similarly to the recall-level averages by using probability of detection at fixed false alarm rates. The tables in Appendix A show both this average ROC curve and the same curve plotted on probability scales (Swets 1969). 5. Preliminary Results 5.1 Introduction The results of the ThEC-l conference should be viewed only as a preliminary baseline for what can be expected from systems working with large test collections. There are several reasons for this. First, the dead- lines for results were very tight, and most groups had minimal time for experiments. As discussed earlier, the huge scale-up in the size of the document collection required major work from all groups in rebuilding their sys- tems. Much of this work was simply a system engineering task: finding reasonable data structures to use, get- ting indexing routines to be efficient enough to finish indexing the data, finding enough storage to handle the large inverted files and other structures, etc. The second reason these results are preliminary is that groups were working blindly as to what constitutes a relevant document. There were no reliable relevance judgments for training, and the use of the long topics was completely new. This means that results were heavily influenced by an almost random selection of what parts of the topic to use. Groups also had to make often primitive adjustments to basic algorithms in order to get results, with litfie evidence of how well these adjustments were working. The large scale of the whole evalua- tion precluded any tuning without some relevance judgments, and the relevance judgments that were provided were generally sparse and sometimes inaccurate. These problems particularly affected those systems that needed training for routing. Many of the papers in the proceedings show some new results from work done in the short amount of time between the conference and the due date of the papers (less than 2 months). Some of the improvements are very significant, and the improvements seen in the ~PSThR results (where the results are a second-try at this task) are large. It can be expected that the results seen at the second TREC conference will be much better1 and also more indicative of how well a method works. Because these results are preliminary, they should he compared very carefully. Some very broad conclusions can be drawn, but no methods should be conclusively judged inferior or superior at this point. 5.2 Adhoc Results The adhoc evaluation used new topics (51-100) against the two disks of documents (Dl + D2). There were 33 sets of results for adhoc evaluation in TREC, with 20 of them based on runs for the full data set. Of these, 13 used automatic construction of queries, 6 used manual construction, and 1 used feedback. Figure 5 shows the recall/precision curve for the three TREC-1 runs with the highest 11-point averages using automatic construction of queries. These curves were all based on the use of the Cornell SMART system, but with important varia- tions. The "fuhrpl" results came from using the training data to find parameter weights (see Fuhr & Buckley paper), the "crnlpl" results came from doing local and global term weighting without training data (see Buckley, Salton & Allan paper), and the "siemsl" results came from using term expansion with terms from "Wordnet" (see Voorhees & Hou paper). 13