exact retrieval status values are computed as necessary. Assume that d7 is the first (i.e. d7 has the highest Rsv0), d13 is the second, and d5 is the third docu- ment as shown in Figure 1 In this case, RSV(q, d7) is computed first and next, RSV(q, d13) is computed. According to Figure 1, RSV(q, d13) is the highest ex- act retrieval status value of those two exact retrieval status values that are known at this moment. Fur- thermore, all unknown exact retrieval status values are less than or equal to RSV0(q, d5) because of (3). Since RSV(q, d13) > RSV0(q, d5), we can conclude that d13 has the highest exact retrieval status value of {Lll docu- ments. In this example, only two exact retrieval status values had to be computed to determine the top ranked document. In practice, more exact retrieval status val- ues have to be computed. The evaluation algorithm is based on an access strvc- tvre that specifies the following functions. d5 i. ~d~) : ~ e d~} midf: tpj I nidf(~) size: d5 d; The functions ~ and ~ determine the signatures and the non-inverted document descriptions respectively. The functions nidf and size provide scaling and normaliza- tion factors. The function values ~(d~) and ~(d5) are determined completely by the document d5 itself. The function values niŁ~f(~) and size (d5), however, depend on the entire document collection. Fortunately, the vari- ation of nidf(~~) is small if the data collection is suf- ficiently large. Thus, nidf(~) has to be recomputed only if the domain of the data collection is shifting and size(d5) has to be recomputed if either TLidf was recalcu- lated or d5 was modified. In the next section, we will see how this basic query evaluation scheme can be refined. 3 Refinements The basic evaluation algorithm described in Section 2 achieves an excellent update efficiency because the de- scription of a single document is stored in a compact form which can be updated efficiently. However, the retrieval effectiveness as well as the retrieval efficiency need further refinements in the case of the TREC experi- ment where extremely large queries are matched against a large document collection containing some large doc- uments. To improve the retrieval effective~ess, we adapted the basic evaluation algorithm to the best global weighting scheme achieved by the SMART system at TREC-1 [1]. The query features are "ltn" weighted, i.e. the query weights depend on the logarithm of the feature frequen- cies and on the inverse document frequencies, and they are not cosine normalized. The "lnc" weighted docu- ment features depend on the logarithm of the feature frequencies. They are cosine normalized, but they do not have a document frequency component. The de- tailed definitions of the feature weights are given in Ta- ble 1. In order to achieve a good retrieval efflcie~cv for the TREC collection we have reduced the indexing vocab- ulary in such a way that on one hand the retrieval efficiency is improved and on the other hand, the re- trieval effectiveness is not deteriorated very much. Our indexing vocabulary was not only reduced by apply- ing a stop word list and a word reduction algorithm, but also by eliminating further indexing terms. From another project [3] and from the experiences of the SMART system at TREC-1 [1] we knew that the re- moval of the indexing features having a very high doc- ument frequency does not deteriorate the retrieval ef- fectiveness very much. The reasons why this restriction (4) has a small effect on the retrieval effectiveness is the (5) following. The elimination of the high document fre- (6) quency features changes the retrieval status values very little because these features have a small inverse docu- (7) ment frequency and hence, they contribute little to the retrieval status values. Such an elimination of the high document frequency features is simil& to the applica- tion of a collection dependent stop word list in addition to a general stop word list. Thus, this restriction has only a small effect on the retrieval status values. In what follows, we focus on the retrieval efficiency and why it is improved by this restriction. With a re- duced indexing vocabulary, 1. fewer documents have a positive approximate re- trieval status value, 2. fewer features ~j contribute an error (a~Q~ - a~~) * to the approximate retrieval status values, and 3. fewer false matches (caused by collisions) occur when computing the approximate retrieval status values. Both scanning the signatures and sorting the approxi- mate retrieval status values become faster when fewer approximate retrieval status values are positive. Fur- thermore, the approximate retrieval status values be- come tighter upper bounds when fewer false matches occur and when the sum of the errors (~,Q5 - aij) * is reduced. Having tighter upper bounds means that fewer exact retrieval status values have to be computed until the top ranked document is determined. The re- trieval effectiveness and the retrieval efficiency achieved by means of the reduced indexing vocabulary are pre- sented in the next section. 165