takes into account the lexical, semantic and discourse sources of linguistic information in both documents and
queries. Secondly, it can serve as input to a filter which uses a more complex version of the original cut-off criterion
to determine how many documents should be farther processed by the system's fmal modules.

For the Integrated Mateher to produce a combined rankng, each document's similarity value for a given Topic
Statement can be thought of as being composed of two elements. One element is the SFC similarity value and one
element is the similarity value that represents the combined proper noun, complex nonnal, and text structure
simliarities. Additionally, the system will have computed the regression formula, the mean, and standard deviation of
the distribution of the SFC similarity values for the individual Topic Statement. Using these statistical values, the
system produces the cut-off criterion value. Since we know from the eighteen-month results, that 74% of the
relevant documents had what we refer to as a k-value (then PN value; now PN, CN, TS values) and the remaining
26% of the relevant documents had no k-value, we use this information to predict what proportion of the predicted
relevant documents should come from which segment of the ranked documents for flill recall. The combined ranking
can be envisioned as consisting of four segments, as shown in Figure 2.


Docs. having a k-value
& an SFC value                 I  Group 1
above the cut-off

                               -----cut-off criterion SFC similarity value
Docs. having a k-value
& an SFC value                 I  Group2
below the cut-off

Docs. having no k-value
& an SFC value                 I  Group 3
above the cut-off

                               ------cut-off criterion SFC simllarity value
Docs. having no k-value
& an SFC value                 I  Group4
below the cut-off


              Fig. 2: Schematic of Segmented Ranks from SFC & Integrated Ranking (k-value)


Four groups are required to reflect the tw~way distinction mentioned above. The fnst distinction is between those
groups which have a k-value and which should contain 74% of the relevant documents and those documents without
a k-value, which should contribute 26% of the relevant documents. The second distinction is between those
documents whose SFC similarity value is above the predicted cut-off criterion and those whose SFC similarity value
is not.

When a cut-off criterion is the application desired, the system will produce the ranked list in response to a desired
recall level, by concatenating the documents above the appropriate cut-off for that level of recall from Group 1; then
documents above the appropriate cut-off for that level of recall from Group 3. However, since our test results show
that there is a potential 8% error in the predicted cut-off criterion for 100% recall, we use extrapolation to add the
appropriate proportion of the top ranked documents from Group 2 to Group 1, before concatenating documents from
Group 3. These same values are used to produce the best end-t~end ranking of all the documents using the various
segments.

Document ranks are produced by the Integrated Matcher and the cut-off criterion is used either by an individual user
who requires a cert~~ recall level for a particular information need, or, as in the twenty four month JIPSThR test
situation, by the system to determine how many documents from the Integrated Matcher ranking will be passed on to
the fmal modules for further processing.


                                        92