Bayesian Inference with Node Aggregation for
                                  Information Retrieval
                                       Brendan Del Favero
                                          Robert Fung
                               Institute for Decision Systems Research
                                 350 Cambridge Avenue, Suite 380
                                       Palo Alto, CA 94306
                                       idsr @ netcom.com

 1     Introduction
 Information retrieval can be viewed as an evidential
 reasoning problem. Given a representation of a document
 (e.g., the presence or absence of selected words and
 phrases), and a representation of an information need (e.g.,
 topics of interest), the problem of information retrieval is to
 infer the degree to which the document matches the
 information need. Since probability theory is the classical
 choice for automating evidential reasoning, probabilistic
 approaches to information retrieval are natural and have
 had a long history, starting in the 1960's (Maron & Kuhns,
 1960).

 In this paper we describe research that adapts and applies
 Bayesian networks, a new technology for probabilistic
 representation and inference, to information retrieval. The
 technology has substantial advantages over older
 technologies including an intuitive representation and a set
 of efficient inference algorithms. We discuss the Bayesian
 network technology and probabilistic information retrieval
 in Section 2 of this paper.

 Our research is directed at developing a probabilistic
 information retrieval architecture that:
  * is oriented towards assisting users that have stable
      information needs in routing (i.e., sorting through)
      large amounts of time-sensitive material,
  * gives users an intuitive language with which to specify
      their information needs,
  * requires modest computational resources (i.e., memory
      and CPU speed), and
  * can integrate relevance feedback and training data with
      users' judgements to incrementally improve retrieval
      performance.

 Towards these goals, we have developed a system that
 allows a user to specify: multiple topics of interest (i.e.,
 information needs), qualitative and quantitative
 relationships between the topics, document features that
 relate to the topics, and quantitative relationships b~tween
 these features and the topics. The system runs on a
 Macintosh II computer and can use training data to estimate
 any of the quantitative values in the system. We discuss the
 particular methods we developed and used in our system in
 Section 3.


                                              151

We participated in the exploratory group (Category B) of
the 1993 Text Retrieval Conference (TREC-2), sponsored
by the National Institute of Standards and Technology
(MST). As a participant in the exploratory group, we were
tasked with working with a subset of the TREC-2 training
and test data. Our training data consisted of Wall Street
Journal (WSJ) articles and our test data consisted of San
Jose Mercury News (SJMN) articles. We chose a subset of
10 topics out of the 50 TREC-2 routing topics to best
illustrate the methods and concepts we developed. The
choice of the 10 topics was reported to the TREC
coordinators prior to our training runs and, of course, prior
to our receipt of the test data. We generated routing queries
for each of the 10 chosen topics, trained against the WSJ
training set to improve our queries, and tested these queries
against the SJMN articles in the test data set.

Our system was developed entirely within the duration of
the TREC-2 project (January 93 to June 93) including the
document handling, feature extraction, inference, and
reporting capabilities. Our TREC-2 effort consisted of the
two authors. We describe the experimental set-up in
Section 4 and the result of our test run in Section 5.

We are very encouraged by the test results and have many
ideas for future research, which we discuss in Section 6.


2     Background
In this section we describe the Bayesian network
technology and outline the previous efforts in probabilistic
information retrieval.

2.1 Bayesian Networks

While probability theory provides a suitable theoretical
foundation for evidential reasoning, a technology based on
probability theory that is computationally tractable and that
includes an effective methodology for acquiring the needed
probabilistic information has been lacking. Recent
developments in Bayesian networks have provided these
features. As the name suggests, the technology is based on
a network representation of probabilistic information
(Howard & Matheson, 1981; Pearl, 1988).

A Bayesian network represents beliefs and knowledge
about a particular class of situations. The use of Bayesian