SP500215
NIST Special Publication 500-215: The Second Text REtrieval Conference (TREC-2)
Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression
chapter
W. Cooper
A. Chen
F. Gey
National Institute of Standards and Technology
D. K. Harman
data for documents with relatively high predicted proba-
bilities of relevance. In an attempt to achieve this empha-
sis, the regression analysis was carried out in two steps. A
logistic regression was first performed on a stratified sam-
pie of about one-third of the then-available TIPSThR rele-
vance judgements to obtain a preliminary ranking rule.
This preliminary equation was then applied as a screening
rule to the entire available body of judged pairs, and for
each query all but the highest-ranked 500 documents were
discarded. The resulting set of 50,000 judged query-
document pairs (100 topics x 500 top-ranked docs per
topic) served as the learning sample data for the final
regression equation displayed above as Eq. (3).
Application to Query-Specific Learning Data
To this point it has been assumed that relevance
judgement data is available for a sample of queries typical
of the future queries for which retrieval is to be per-
formed, but not for those very queries themselves. We
consider next the contrasting situation in which the learn-
ing data that is available is a set of past relevance judge-
ments for the very queries for which new documents are
to be retrieved. Such data is often available for a routing
or SDI request for which relevance feedback has been col-
lected in the past, or for a retrospective feedback search
that has already been under way long enough for some
feedback to have been obtained.
For a query so equipped with its very own learning
sample, it is possible to gather separate data about each
individual term in the query. Such data reflects the term's
special retrieval characteristics in the context of that
query. For instance, for a query term T[OCRerr] we may count the
number n ( T[OCRerr], R) of documents in the sample that con-
tain the term and are also relevant to the query, the num-
ber of documents n(T[OCRerr], RTh that contain the term but are not
relevant to the query, and so forth.
The odds that a future document will turn out to be
be relevant to the query, given that it contains the term,
can be estimated crudely as
n(T[OCRerr],R)
O(RIT[OCRerr]) n(T[OCRerr],R)
To refine this estimate, a Bayesian trick adapted from
Good (1965) is useful. One simply adds the two figures
used in expressing the prior odds of relevance (i.e. n (R)
and n ([OCRerr]) into the numerator and denominator respec-
tively with an arbitrary weight P as follows:
62
O(RIT[OCRerr]) n(T[OCRerr], R)+[OCRerr]n(R)
The smaller the value assigned to p, the more influ-
ence the sample evidence will have in moving the esti-
mate away from the prior odds. The adjustment causes
large samples to have a greater effect than small, as seems
natural. It also forestalls certain absurdities implicit in the
unadjusted formula, for instance the possibility of calcu-
lating infinite odds estimates.
Suppose now that a query contains Q distinct terms
T1....TM, TM+1....TQ, numbered in such a way that
the first M of them are match terms that occur in the docu-
ment against which the query is to be compared. The fun-
damental equation can be applied by taking N to be Q,
and interpreting the retrieval clues ..... . , AN as the pres-
ence or absence of particular query terms in the document.
The pertinent data can be organized as shown in Table II
(next page).
Thus we are led to postulate, for query-specific
learning data, a retrieval equation of form
Equation (4):
logQ(RIA1......AQ)
co+cif(Q)[[OCRerr]i +[OCRerr]2-[OCRerr]3]
where
M n(T,R)+pn(R)
[OCRerr]i= [OCRerr]log
m=1
Q
[OCRerr]2= m=XM+l log n(T[OCRerr], [OCRerr] + P n(A)
n(R)
[OCRerr]= (Q-1) log
and where f(Q) is some restraining function intended as
before to subdue the inflationary effect of the Assumption
of Linked Dependence. The coefficients C0, C1 are found
by a logistic regression over a sample of query-document
pairs involving many queries.
Combining Query-Specific and Query-Nonspecific
Data
If a query-specific set of relevance judgements is
available for the query to be processed, a larger