exact retrieval status values are computed as necessary.
Assume that d7 is the first (i.e.  d7 has the highest
Rsv0), d13 is the second, and d5 is the third docu-
ment as shown in Figure 1   In this case, RSV(q, d7)
is computed first and next, RSV(q, d13) is computed.
According to Figure 1, RSV(q, d13) is the highest ex-
act retrieval status value of those two exact retrieval
status values that are known at this moment.  Fur-
thermore, all unknown exact retrieval status values are
less than or equal to RSV0(q, d5) because of (3). Since
RSV(q, d13) > RSV0(q, d5), we can conclude that d13
has the highest exact retrieval status value of {Lll docu-
ments. In this example, only two exact retrieval status
values had to be computed to determine the top ranked
document. In practice, more exact retrieval status val-
ues have to be computed.
 The evaluation algorithm is based on an access strvc-
tvre that specifies the following functions.

                 d5 i. ~d~)
                                     : ~ e d~}
         midf: tpj I    nidf(~)
         size: d5        d;

The functions ~ and ~ determine the signatures and the
non-inverted document descriptions respectively. The
functions nidf and size provide scaling and normaliza-
tion factors. The function values ~(d~) and ~(d5) are
determined completely by the document d5 itself. The
function values niŁ~f(~) and size (d5), however, depend
on the entire document collection. Fortunately, the vari-
ation of nidf(~~) is small if the data collection is suf-
ficiently large.  Thus, nidf(~) has to be recomputed
only if the domain of the data collection is shifting and
size(d5) has to be recomputed if either TLidf was recalcu-
lated or d5 was modified. In the next section, we will see
how this basic query evaluation scheme can be refined.


3   Refinements
The basic evaluation algorithm described in Section 2
achieves an excellent update efficiency because the de-
scription of a single document is stored in a compact
form which can be updated efficiently. However, the
retrieval effectiveness as well as the retrieval efficiency
need further refinements in the case of the TREC experi-
ment where extremely large queries are matched against
a large document collection containing some large doc-
uments.
 To improve the retrieval effective~ess, we adapted the
basic evaluation algorithm to the best global weighting
scheme achieved by the SMART system at TREC-1 [1].
The query features are "ltn" weighted, i.e. the query
weights depend on the logarithm of the feature frequen-
cies and on the inverse document frequencies, and they

are not cosine normalized. The "lnc" weighted docu-
ment features depend on the logarithm of the feature
frequencies. They are cosine normalized, but they do
not have a document frequency component. The de-
tailed definitions of the feature weights are given in Ta-
ble 1.

          In order to achieve a good retrieval efflcie~cv for the
        TREC collection we have reduced the indexing vocab-
        ulary in such a way that on one hand the retrieval
        efficiency is improved and on the other hand, the re-
        trieval effectiveness is not deteriorated very much. Our
        indexing vocabulary was not only reduced by apply-
        ing a stop word list and a word reduction algorithm,
        but also by eliminating further indexing terms. From
        another project [3] and from the experiences of the
        SMART system at TREC-1 [1] we knew that the re-
        moval of the indexing features having a very high doc-
        ument frequency does not deteriorate the retrieval ef-
        fectiveness very much. The reasons why this restriction
(4)     has a small effect on the retrieval effectiveness is the
(5)     following. The elimination of the high document fre-
(6)     quency features changes the retrieval status values very
        little because these features have a small inverse docu-
(7)     ment frequency and hence, they contribute little to the
        retrieval status values. Such an elimination of the high
        document frequency features is simil& to the applica-
        tion of a collection dependent stop word list in addition
        to a general stop word list. Thus, this restriction has
        only a small effect on the retrieval status values.
          In what follows, we focus on the retrieval efficiency
        and why it is improved by this restriction. With a re-
        duced indexing vocabulary,

          1. fewer documents have a positive approximate re-
           trieval status value,

          2. fewer features ~j contribute an error (a~Q~ - a~~) *
           to the approximate retrieval status values, and

          3. fewer false matches (caused by collisions) occur
           when computing the approximate retrieval status
           values.

        Both scanning the signatures and sorting the approxi-
        mate retrieval status values become faster when fewer
        approximate retrieval status values are positive. Fur-
        thermore, the approximate retrieval status values be-
        come tighter upper bounds when fewer false matches
        occur and when the sum of the errors (~,Q5 - aij) *
        is reduced. Having tighter upper bounds means that
        fewer exact retrieval status values have to be computed
        until the top ranked document is determined. The re-
        trieval effectiveness and the retrieval efficiency achieved
        by means of the reduced indexing vocabulary are pre-
        sented in the next section.

165