Development time
knowledge base compilation

Run-time control flow

                 Boolean approximation     Routingi
Boolean
           ~     (establish OR establlshed) Retrieval
tables
                  AND venture

Lexico -semantic
regular expressions

{

Fin fte state pattern
matching                Extraction

(root establish) * venture

Statistical and manual

                            .4

                       Lexicon,
                       grammar,
                       knolwedge bases K

i

Natural language
  semantic interpretation
(c-establishing
 (r-effected (c-venture))

Interpretation

                      Figure 2: Approximation and Natural Language Processing

cept of establishing can be expressed, and insure that the
words appear in a grammatically acceptable way in the
input. This more detailed analysis can be crucial in data
extraction tasks.
  A critical point about this model is that even detailed
knowledge about language often ends up contributing to
the Boolean tables, so the Boolean expressions are con-
siderable more complex than those that could easily be
created by hand. Furthermore, the Boolean expressions
are a relaxed form of more complex knowledge-based con-
structs. It is very difficult to predict the impact of sim-
ply relaxing constraints. For example, TREC topic 53
includes the phrase "leveraged buy-out". In a strict syn-
tactic analysis, this phrase would have to appear exactly
in that form, enforcing both proximity ("leveraged" ad-
jacent to "buy-out") and order ("leveraged" before "buy-
out"). But relaxing such constraints seldom has any se-
vere effect on results: the Boolean expression (leveraged
AND buy-out) in our system will recognize the two terms
occurring anywhere in the same paragraph, which is actu-
ally likely to admit more texts about leveraged buy-outs
than admit texts in which the words coincidentally ap-
pear together.
  Even the phrase "prime rate", which matches some ir-
relevant texts in its Boolean form (including, for example,
one or two texts about Japan that mention "prime min-
ister" in the context of "growth rate"), also admits some
additional relevant texts in which "prime" appears in the
same paragraph as "rate". The well-known effects of word
order and proximity on meaning, exemplified by the dis-
tinction between "blind Venetian" and "Venetian blind"

 do not seem to appear very frequently in real examples.
 At least, these effects may be less frequent
   In TREC-2, we did not apply any constraints stricter
 than Booleans at run-time; in other words, we used only
 a Boolean retrieval engine (because in TREC-1 we proved
 that the stricter constraints didn't help). We also had to
 implement a module to relax some of the "hard" Boolean
 constraints for topics with a very small number of relevant
 documents, because of the nature of the performance met-
 rics. However, we still used the knowledge that was devel-
 oped for more detailed processing, including the phrases,
 semantic groupings and so forth. In addition, we added
 a more sophisticated ranking mechanism than we used in
 TREC-1, because ranking is very important in the evalua-
 tion. But the only retrieval engine in our TREC-2 system
 is a Boolean matcher.


 2.1   The Boolean Matcher
 The Boolean tables are efficiently organized so that a C
 program (which we now know as NLGREP) can match
 them against incoming texts at a rate of about 1 million
 words per minute. This program spends little time on
 anything other than marking where words in each doc-
 ument match terms in the table.   In both routing and
 retrieval, the total number of terms used for a set of 50
 topics is about 2000, or about 40 unique terms per topic.
 All other words are ignored entirely.
   For example, the following is a query for the topic
 "South African sanctions" (using the enhanced regular
 expression language of GE's pattern matcher [5]):


193