NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)

SP500215 NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2) Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression chapter W. Cooper A. Chen F. Gey National Institute of Standards and Technology D. K. Harman data for documents with relatively high predicted proba- bilities of relevance. In an attempt to achieve this empha- sis, the regression analysis was carried out in two steps. A logistic regression was first performed on a stratified sam- pie of about one-third of the then-available TIPSThR rele- vance judgements to obtain a preliminary ranking rule. This preliminary equation was then applied as a screening rule to the entire available body of judged pairs, and for each query all but the highest-ranked 500 documents were discarded. The resulting set of 50,000 judged query- document pairs (100 topics x 500 top-ranked docs per topic) served as the learning sample data for the final regression equation displayed above as Eq. (3). Application to Query-Specific Learning Data To this point it has been assumed that relevance judgement data is available for a sample of queries typical of the future queries for which retrieval is to be per- formed, but not for those very queries themselves. We consider next the contrasting situation in which the learn- ing data that is available is a set of past relevance judge- ments for the very queries for which new documents are to be retrieved. Such data is often available for a routing or SDI request for which relevance feedback has been col- lected in the past, or for a retrospective feedback search that has already been under way long enough for some feedback to have been obtained. For a query so equipped with its very own learning sample, it is possible to gather separate data about each individual term in the query. Such data reflects the term's special retrieval characteristics in the context of that query. For instance, for a query term T[OCRerr] we may count the number n ( T[OCRerr], R) of documents in the sample that con- tain the term and are also relevant to the query, the num- ber of documents n(T[OCRerr], RTh that contain the term but are not relevant to the query, and so forth. The odds that a future document will turn out to be be relevant to the query, given that it contains the term, can be estimated crudely as n(T[OCRerr],R) O(RIT[OCRerr]) n(T[OCRerr],R) To refine this estimate, a Bayesian trick adapted from Good (1965) is useful. One simply adds the two figures used in expressing the prior odds of relevance (i.e. n (R) and n ([OCRerr]) into the numerator and denominator respec- tively with an arbitrary weight P as follows: 62 O(RIT[OCRerr]) n(T[OCRerr], R)+[OCRerr]n(R) The smaller the value assigned to p, the more influ- ence the sample evidence will have in moving the esti- mate away from the prior odds. The adjustment causes large samples to have a greater effect than small, as seems natural. It also forestalls certain absurdities implicit in the unadjusted formula, for instance the possibility of calcu- lating infinite odds estimates. Suppose now that a query contains Q distinct terms T1....TM, TM+1....TQ, numbered in such a way that the first M of them are match terms that occur in the docu- ment against which the query is to be compared. The fun- damental equation can be applied by taking N to be Q, and interpreting the retrieval clues ..... . , AN as the pres- ence or absence of particular query terms in the document. The pertinent data can be organized as shown in Table II (next page). Thus we are led to postulate, for query-specific learning data, a retrieval equation of form Equation (4): logQ(RIA1......AQ) co+cif(Q)[[OCRerr]i +[OCRerr]2-[OCRerr]3] where M n(T,R)+pn(R) [OCRerr]i= [OCRerr]log m=1 Q [OCRerr]2= m=XM+l log n(T[OCRerr], [OCRerr] + P n(A) n(R) [OCRerr]= (Q-1) log and where f(Q) is some restraining function intended as before to subdue the inflationary effect of the Assumption of Linked Dependence. The coefficients C0, C1 are found by a logistic regression over a sample of query-document pairs involving many queries. Combining Query-Specific and Query-Nonspecific Data If a query-specific set of relevance judgements is available for the query to be processed, a larger