NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1)

SP500207 NIST Special Publication 500-207: The First Text REtrieval Conference (TREC-1) CLARIT TREC Design, Experiments, and Results chapter D. Evans R. Lefferts G. Grefenstette S. Handerson W. Hersh A. Archbold National Institute of Standards and Technology Donna K. Harman 3 Overview of CLARIT-TREC Processing There were three major phases of processing for the CLARIT-TREC retrieval experiments. Initially, the entire corpus, along with the topic statements, was parsed to extract candidate NPs via CLARIT NLP. In the special case of topics, the candidate NPs were manually reviewed and evaluated to produce weighted query terms. Second, the entire corpus (in noun phrase form) was passed through a quick, and somewhat rough, ranking procedure that was designed to nominate a large subset of documents for further analysis. This step is referred to as "partitioning". A "partitioning thesaurus", or list of weighted, representative terminology was automatically created for each topic. In the final phase of processing, referred to as "querying", a "query vector" was produced for each topic. The query vector was used to retrieve (= rank) documents in the selected partition for the topic using a vector-space `similarity' metric. The details of these phases of processing are presented below, along with a discussion of different techniques used for "routing" and "ad-hoc" queries. 3.1 Design Philosophy-"Evoke" and "Discriminate" In approaching the principal TREC task of returning 200 ranked documents for each topic, we used a two-stage processing strategy, illustrated in Figure 3. The first stage of processing was designed to identify candidate documents that seemed likely to contain information related to a topic. Of course, since the topic was represented as a set of weighted terms, this step involved scoring each document based on the set of terms. Because this step involved scoring every document in the database against every topic, it was important to design the scoring procedure so that it was not computationally expensive. In fact, it was based on summing the value and number of `hits' between the topic's set of terms and the terms (NPs) in each document and was expected to result in an over-generated set of candidate documents. The highest-scoring documents were retained as a candidate `partition' of the database with respect to the topic. The second stage was designed to find the subset of documents in each partition that best matched the topic. In theory, greater (= more discriminating) processing resources could be devoted to this second-stage task, as the total number of documents involved was small compared to the whole collection. In practice, as illustrated in Figure 3, partitioning resulted in a set of 2,000 ranked docu- ments. The top 200 documents from the partition were submitted to NIST as the CLARIT "A" set of results. Final querying or `discrimination' among the documents in each partition yielded another, more accurately ranked set of 200 ranked documents, which were submitted as the CLARIT "B" results. 3.2 Overview of the Task As Figure 4 shows, different portions of the total TREC database were used for the "routing" and "ad-hoc" phases of the experiment. The routing task required `training' of the first fifty topics on the first set of data (represented as the darkened block in Figure 4). In the second step of processing, the partitioning and query vectors that were derived from step one were used to identify, first, 2000-document partitions in the second set of data (represented as a light block in Fignre 4) and, second, the top-200 ranked documents in each partition. The ad-hoc query task involved the whole database, but the CLARIT team actually used the first set of data for a preliminary retrieval of documents (based on partitioning). A few (5-10) of the top 2[OCRerr]50 were chosen by quick manual inspection to supplement the query vector and then a second automated round of partitioning over the total database was performed. The final top-200 ranked documents ultimately derived from these second-pass, 2000-document partitions. 256