Bayesian Inference with Node Aggregation for Information Retrieval Brendan Del Favero Robert Fung Institute for Decision Systems Research 350 Cambridge Avenue, Suite 380 Palo Alto, CA 94306 idsr @ netcom.com 1 Introduction Information retrieval can be viewed as an evidential reasoning problem. Given a representation of a document (e.g., the presence or absence of selected words and phrases), and a representation of an information need (e.g., topics of interest), the problem of information retrieval is to infer the degree to which the document matches the information need. Since probability theory is the classical choice for automating evidential reasoning, probabilistic approaches to information retrieval are natural and have had a long history, starting in the 1960's (Maron & Kuhns, 1960). In this paper we describe research that adapts and applies Bayesian networks, a new technology for probabilistic representation and inference, to information retrieval. The technology has substantial advantages over older technologies including an intuitive representation and a set of efficient inference algorithms. We discuss the Bayesian network technology and probabilistic information retrieval in Section 2 of this paper. Our research is directed at developing a probabilistic information retrieval architecture that: * is oriented towards assisting users that have stable information needs in routing (i.e., sorting through) large amounts of time-sensitive material, * gives users an intuitive language with which to specify their information needs, * requires modest computational resources (i.e., memory and CPU speed), and * can integrate relevance feedback and training data with users' judgements to incrementally improve retrieval performance. Towards these goals, we have developed a system that allows a user to specify: multiple topics of interest (i.e., information needs), qualitative and quantitative relationships between the topics, document features that relate to the topics, and quantitative relationships b~tween these features and the topics. The system runs on a Macintosh II computer and can use training data to estimate any of the quantitative values in the system. We discuss the particular methods we developed and used in our system in Section 3. 151 We participated in the exploratory group (Category B) of the 1993 Text Retrieval Conference (TREC-2), sponsored by the National Institute of Standards and Technology (MST). As a participant in the exploratory group, we were tasked with working with a subset of the TREC-2 training and test data. Our training data consisted of Wall Street Journal (WSJ) articles and our test data consisted of San Jose Mercury News (SJMN) articles. We chose a subset of 10 topics out of the 50 TREC-2 routing topics to best illustrate the methods and concepts we developed. The choice of the 10 topics was reported to the TREC coordinators prior to our training runs and, of course, prior to our receipt of the test data. We generated routing queries for each of the 10 chosen topics, trained against the WSJ training set to improve our queries, and tested these queries against the SJMN articles in the test data set. Our system was developed entirely within the duration of the TREC-2 project (January 93 to June 93) including the document handling, feature extraction, inference, and reporting capabilities. Our TREC-2 effort consisted of the two authors. We describe the experimental set-up in Section 4 and the result of our test run in Section 5. We are very encouraged by the test results and have many ideas for future research, which we discuss in Section 6. 2 Background In this section we describe the Bayesian network technology and outline the previous efforts in probabilistic information retrieval. 2.1 Bayesian Networks While probability theory provides a suitable theoretical foundation for evidential reasoning, a technology based on probability theory that is computationally tractable and that includes an effective methodology for acquiring the needed probabilistic information has been lacking. Recent developments in Bayesian networks have provided these features. As the name suggests, the technology is based on a network representation of probabilistic information (Howard & Matheson, 1981; Pearl, 1988). A Bayesian network represents beliefs and knowledge about a particular class of situations. The use of Bayesian