NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) Retrieval Experiments with a Large Collection using PIRCS chapter K. Kwok L. Papadopoulos K. Kwan National Institute of Standards and Technology Donna K. Harman Best Oper. Precision: .984 .944 .899 .689 .469 Precision: .624 .523 .480 .364 .281 63% 55% 53% 53% 60% Routing Best Oper. Recall: .083 .241 .387 .777 .915 Recall: .048 .145 .228 .443 .576 58% 60% 59% 57% 63% Best Oper. Precision: 1.00 .995 .945 .781 .551 Precision: .608 .597 .521 .398 .294 61% 60% 55% 51% 53% For ad hoc retrieval at 30 retrieved document cut-off for example, it means that on average we get about 20% (.204) of all the relevant documents, but nearly one out of two (.480) retrieved documents are relevant. However, because of the number of known relevant documents for each of the queries, the best operational recall and precision at this cut-off is only .456 and .899. Our .204 recall and .480 precision achieve 45% and 53% of these best values. On the whole we have achieved between 44-64% of the best operational recall and 53-63% of the best operational precision values for ad hoc, and respectively between 57-63% and 51-61% for routing. These are quite respectable figures. (d) Based on our strategy and experimental data, techniques that work for small collections also work for this large WSJ collection as well. For example, using the 11-pt Avg precision values, PIRCS2 performs better than PIRCSl in both ad hoc (0.322 vs 0.311, +3.6%) and routing (0.369 vs 0.343, +7.6%) environments. Thus, adding soft-boolean as a third retrieval method helps, but requires forming the boolean queries manually. An illustration of the better results of PIRCS2 over PIRCS1 can be seen by comparing between pairs of methods using the 11-pt Avg measure, where `equal' means values within 5%: AD HOC PIRCS1 @ PIRCS2 ROUTING PIRCS1 @ PIRCS2 = better: S 3 = equal: 9 6 = worse: 11 16 Both PIRCS3 and PIRCS4 make use of feedback based on evaluating the first ten retrieved sub-documents of PIRCS1, but PIRCS4 employs query expansion as well. These methods are automatic. Comparing PIRCS4 with PIRCS3 in ad hoc feedback retrieval shows that query expansion is better (0.305 vs 0.282, +8.1%) than no expansion. Note that in this feedback, only 23 queries are used because topic #66 and #73 have no relevants in the first ten retrieved. PIRCS3 and PIRCS4 have also been re-evaluated using the frozen rank method, viz. the ten documents used for feedback are given the same rank as in PIRCS 1, while the other retrieved documents follow. In addition, the two queries without relevants in first ten are put back carrying their PIRCS1 results. This way, PIRCSl,3f,4f can be directly comparable. It can be seen that feedback by itself works (0.3407 vs 0.3107, +9.7%) and query expansion improves further (0.3634 vs 0.3107, +17%). There are about 5.5 (126/23) relevant documents in the first ten retrieved. (e) To illustrate the power of feedback, we have tabulated the number of queries that perform better, about equal, or worse between pairs of methods using the 11-pt Avg measure: PIRCS1 @ PIRCS3f PIRCS3 @ PIRCS4f = better: 1 S = equal: S 3 = worse: 19 17 162