3   Overview of CLARIT-TREC Processing

There were three major phases of processing for the CLARIT-TREC retrieval experiments.
Initially, the entire corpus, along with the topic statements, was parsed to extract candidate NPs
via CLARIT NLP. In the special case of topics, the candidate NPs were manually reviewed and
evaluated to produce weighted query terms. Second, the entire corpus (in noun phrase form) was
passed through a quick, and somewhat rough, ranking procedure that was designed to nominate
a large subset of documents for further analysis. This step is referred to as "partitioning".
A "partitioning thesaurus", or list of weighted, representative terminology was automatically
created for each topic. In the final phase of processing, referred to as "querying", a "query
vector" was produced for each topic. The query vector was used to retrieve (= rank) documents
in the selected partition for the topic using a vector-space `similarity' metric. The details of
these phases of processing are presented below, along with a discussion of different techniques
used for "routing" and "ad-hoc" queries.

3.1  Design Philosophy-"Evoke" and "Discriminate"

In approaching the principal TREC task of returning 200 ranked documents for each topic,
we used a two-stage processing strategy, illustrated in Figure 3. The first stage of processing
was designed to identify candidate documents that seemed likely to contain information related
to a topic. Of course, since the topic was represented as a set of weighted terms, this step
involved scoring each document based on the set of terms. Because this step involved scoring
every document in the database against every topic, it was important to design the scoring
procedure so that it was not computationally expensive. In fact, it was based on summing
the value and number of `hits' between the topic's set of terms and the terms (NPs) in each
document and was expected to result in an over-generated set of candidate documents. The
highest-scoring documents were retained as a candidate `partition' of the database with respect
to the topic. The second stage was designed to find the subset of documents in each partition
that best matched the topic. In theory, greater (= more discriminating) processing resources
could be devoted to this second-stage task, as the total number of documents involved was
small compared to the whole collection.
   In practice, as illustrated in Figure 3, partitioning resulted in a set of 2,000 ranked docu-
ments. The top 200 documents from the partition were submitted to NIST as the CLARIT
"A" set of results. Final querying or `discrimination' among the documents in each partition
yielded another, more accurately ranked set of 200 ranked documents, which were submitted
as the CLARIT "B" results.

3.2  Overview of the Task

As Figure 4 shows, different portions of the total TREC database were used for the "routing"
and "ad-hoc" phases of the experiment. The routing task required `training' of the first fifty
topics on the first set of data (represented as the darkened block in Figure 4). In the second step
of processing, the partitioning and query vectors that were derived from step one were used to
identify, first, 2000-document partitions in the second set of data (represented as a light block
in Fignre 4) and, second, the top-200 ranked documents in each partition. The ad-hoc query
task involved the whole database, but the CLARIT team actually used the first set of data for
a preliminary retrieval of documents (based on partitioning). A few (5-10) of the top 2~50
were chosen by quick manual inspection to supplement the query vector and then a second
automated round of partitioning over the total database was performed. The final top-200
ranked documents ultimately derived from these second-pass, 2000-document partitions.


                                   256