Bootstrap Evaluation of Statistical Error in Measures of Information
Retrieval

On one hand, bioinformatics generally regards the bootstrap as a simple,
dependable method for performing statistical analysis.  On the other hand,
statisticians usually regard inferences from the bootstrap as deeply
suspect, at least until proven otherwise.  Here at NCBI, the bootstrap has
often been used to establish the statistically significance of an
"improvement" to a retrieval algorithm.  A scientific program of iterative
incremental improvement of retrieval is now standard at NCBI and NLM, making
the statistical evaluation of algorithmic improvements correspondingly
important.   

I start this talk with a simple coin-tossing example to show that bootstrap
results can not be interpreted in isolation.  In practice, they must rely on
a statistical sampling model, if only tacitly.  After reviewing some basics
of evaluating database retrieval, I offer statistical models for three
sampling processes involved in evaluating database retrieval.  These
sampling processes correspond to: (1) the records in a test set or database;
(2) the queries used to form a test set; and (3) the judges who evaluate the
test set.  I then state (in layman's terms) a theorem that shows that the
bootstrap accurately estimates the statistical uncertainty inherent in
sampling the test set. 

The sampling models force a reinterpretation of certain bootstrap findings.
For example, the unusual (I-)bootstrap used to evaluate PSI-BLAST variants
corresponds to a rather implausible sampling model.  Thus, under most
circumstances, the I-bootstrap should be deprecated.  The sampling models
also require that records be sampled into the test set or database
independently of one another.  While one can argue the independence for
non-redundant databases like RefSeq, it is rather unlikely for MEDLINE
(e.g., many authors write several papers on a subject).  Without
independence, the bootstrap likely overestimates the statistical
significance of differences in retrieval efficacy.  Perhaps most
interestingly, the analysis shows that the database ROC[n] estimates a
statistical quantity that contains the sampling frequencies of the database
records.  Because the sampling frequencies change over time in response to
scientific interest, no subtle difference between retrieval efficacies
should be considered permanent.  

------------------------------
John L. Spouge                |
NCBI, NLM, NIH                |
Building 38A, Room 6N 603     |
Lister Hill Center, NLM       |
Bethesda, Maryland 20894      |
Email: spouge@nih.gov         |
Phone: +1 (301) 402-9310      |
Fax:   +1 (301) 480-2288      |
------------------------------