ranks. This effect was due to the positive coefficient of lognumierms, which gave indexing weights
larger than 1 for the terms in these documents, thus yielding high retrieval status value for these
documents w.r.t. most queries.
Obviously a linear indexing function is not well suited for coping with such extreme distributions of
the elements of the relevance description. Besides the effect on the few very long documents, other
documents are also affected by this distribution, for the following reason: Many terms in the long
documents will be assigned indexing weights larger than 1. For the computation of the squared error
that is to be minimized, the quadratic difference (y - u(x~)2 between the indexing weight u(x) and
the actual relevance decision y (0 or 1) in the learning sample is considered. Even if y = 1, there is
still the difference (u(x) - 1)2, for u(x~ > 1. In order to reduce this error and thus the overall error,
a larger error for other indexing weights is taken into account. There are three possible strategies for
coping with this problem:
  * Some experiments performed with other material have shown that overall indexing quality can
    be improved by excluding outliers from the learning sample ([Pfeifer 91]). As outliers, not only
    those pairs with weights lying on the "correct side" of the interval [0,1] (i.e. either y = 0 and
    u(x) < 0 or y = 1 and u(x~ > 1) should be regarded. Also the exclusion of those pairs with
    weights lying on the "wrong side" (i.e. either y = 0 and u(x~ > 1 or y = 1 and u(x~ < 0) yields
    better results.
  * Using logistic functions instead of linear functions overcomes the problem of indexing weights
    outside the interval [0,1]. However, we have some experimental evidence ([Pfeifer 90]) that even
    in this case the removal of outliers from the learning sample (where the outliers are defined w.r.t.
    a linear indexing function) improves the indexing quality.
  * For the experiments described here, we switched from linear to polynomial functions, which also
    gave fairly good results.
The function finally used for indexing with single terms only is

           u(i) =    0.00042293                                       +
                     0.00150083    if                logidf    imaxif +
                   -0.00150665     if               imaxif            +
                     0.00010465    logidf                             +
                   -0.00122627     lognumierms      imaxif.

F or indexing with phrases, we first had to derive the set of phrases to be considered. We took all
adjacent non-stopwords that occurred at least 25 times in the D1 (training) document set. Then
an indexing function common for single words and phrases was developed by introducing additional
binary factors is~single (is~phrase) having the value 1 if the term is a single word (a phrase) and 0
otherwise. The parameters if and logidf were defined for phrases in the same way as for single words,
and in the computation of imaxif and lognumierms for a document both phrases and single words
were considered. This procedure produced the indexing function

           u(x~ =  0.00034104                                           +
                   0.00141097  is~single       if      logidf   imaxif +
                  -0.00119826  is~single       if      imaxif           +
                   0.00014122  is~single       logidf                   +
                  -0.00120413  lognurnierms    zmaxif                   +
                   0.00820515  is~phrase       if      logidf  . imaxif +
                  -0.04188466  is~phrase       if      imaxif           +
                   0.00114585  is~phrase       logidf.


                                             92