Bootstrap Evaluation of Statistical Error in Measures of Information Retrieval On one hand, bioinformatics generally regards the bootstrap as a simple, dependable method for performing statistical analysis. On the other hand, statisticians usually regard inferences from the bootstrap as deeply suspect, at least until proven otherwise. Here at NCBI, the bootstrap has often been used to establish the statistically significance of an "improvement" to a retrieval algorithm. A scientific program of iterative incremental improvement of retrieval is now standard at NCBI and NLM, making the statistical evaluation of algorithmic improvements correspondingly important. I start this talk with a simple coin-tossing example to show that bootstrap results can not be interpreted in isolation. In practice, they must rely on a statistical sampling model, if only tacitly. After reviewing some basics of evaluating database retrieval, I offer statistical models for three sampling processes involved in evaluating database retrieval. These sampling processes correspond to: (1) the records in a test set or database; (2) the queries used to form a test set; and (3) the judges who evaluate the test set. I then state (in layman's terms) a theorem that shows that the bootstrap accurately estimates the statistical uncertainty inherent in sampling the test set. The sampling models force a reinterpretation of certain bootstrap findings. For example, the unusual (I-)bootstrap used to evaluate PSI-BLAST variants corresponds to a rather implausible sampling model. Thus, under most circumstances, the I-bootstrap should be deprecated. The sampling models also require that records be sampled into the test set or database independently of one another. While one can argue the independence for non-redundant databases like RefSeq, it is rather unlikely for MEDLINE (e.g., many authors write several papers on a subject). Without independence, the bootstrap likely overestimates the statistical significance of differences in retrieval efficacy. Perhaps most interestingly, the analysis shows that the database ROC[n] estimates a statistical quantity that contains the sampling frequencies of the database records. Because the sampling frequencies change over time in response to scientific interest, no subtle difference between retrieval efficacies should be considered permanent. ------------------------------ John L. Spouge | NCBI, NLM, NIH | Building 38A, Room 6N 603 | Lister Hill Center, NLM | Bethesda, Maryland 20894 | Email: spouge@nih.gov | Phone: +1 (301) 402-9310 | Fax: +1 (301) 480-2288 | ------------------------------