Abstract: Most engines used for searching information resources via the Internet employ the Boolean Retrieval Model. Two main drawbacks of this model are that users have difficulty to precisely formulate their concept (or, topic) of interest using Boolean logic and the resulting output is not ranked. We propose to address both these problems by employing a Concept-based Retrieval Model, where a concept is defined by a set of production rules and the rule-base is represented as a rule-base tree. Features of a prototype developed at USL, referred to as the Concept-Set Structuring System (CS
The Internet has become a popular medium for the exchange of information and ideas among various groups of individuals. One group in particular that has become increasingly dependent upon the Internet for the timely exchange of research ideas and results is the scientific community. For instance, it is not uncommon for a scientist to regularly retrieve, through the Internet, both pre-print and printed scientific reports and papers. The Internet's increased role in the dissemination of scientific, as well as non-scientific, information has heightened the need for the design and implementation of efficient and effective Internet search engines.
The design of most commercially developed Internet search engines is based on the Boolean Retrieval Model [1,2]. A system that adheres to this model supports the formulation of search requests by combining individual search terms with the Boolean operators AND, OR and NOT. The occurrence of a Boolean operator within a search request is typically processed through the application of the corresponding AND, OR or NOT set operator. For example, a Boolean retrieval system would process the search request, "data AND mining", by applying the intersection operator to the sets A and B, where A and B are comprised of documents from the system's database that contain the index terms "data" and "mining", respectively. In other words, a document is retrieved if and only if it belongs to the intersection of sets A and B. A characteristic of Boolean retrieval systems is that the rank order of retrieved documents is arbitrary since a document either satisfies, or does not satisfy, a user's search request.
A Boolean retrieval system, although simple to design and implement, has a number of well documented shortcomings [2]. These shortcomings include the inability to allow a user to assign weights of importance to terms within a search request or a document, and the inability to rank a list of retrieved documents based on the documents' estimated degrees of usefulness to the user. Unfortunately, these shortcomings are amplified in the context of the Internet since the volume of searchable information is extremely large. The consequence, in general, of using a Boolean retrieval system to search the Internet in order to satisfy a specific information need is that the results are either too narrow or too broad.
In response to these noted shortcomings, various extensions have been suggested to the Boolean Retrieval Model [2]. Of special interest to the current work is the utilization of a user or expert defined knowledge base. The application of expert system technology was first explored in the context of an information retrieval system referred to as the RUBRIC [3]. The novelty of RUBRIC is its ability to support a user-defined rule-base that is used in formulating requests at the users' conceptual level. A rule-base is a mechanism for representing a set of concepts and the relationships among the concepts. The rule-base allows concept-based retrieval whereby a query expressed at the conceptual level is translated into a series of Boolean expressions that can subsequently be processed by one or more underlying search engines. This enhancement is aimed at alleviating the previously stated shortcomings of the Boolean Retrieval Model. In this paper, we describe the design and implementation of a prototype system for concept-based retrieval that is aimed at improving the performance of search engines accessible over the Internet. Specifically, we have developed a prototype Concept-Set Structuring System, referred to as CS
The key feature of concept-based retrieval is its support for search requests that are formulated by means of concepts structured as a rule-base tree. The significance of this feature is that concepts of interest are formulated using a top-down refinement strategy. In a top-down strategy, the first step is to express a given request as a single concept. The stated concept is intended to represent the search request at a very abstract level. The next step is to refine the initial concept by decomposing it into a set of component parts that are related through either the AND or OR logical operator. The individual components may take the form of a new concept defined at a different abstraction level, a text expression, or a single index term. In each case, a weight value is assigned to the individual concept-component pairs that are formed during the decomposition process. The assigned weight value represents the user's belief in the degree to which a given component characterizes the related concept. The process of decomposing concepts into component parts is repeated until a level of abstraction is reached in which every terminal node in the constructed rule-base tree represents an index term or a text expression.
Figure-1 shows the rule-base tree that has been constructed for the user defined concept "Human-Health-Science" where the leaf nodes are index terms and are enclosed within double quotations, the internal nodes are concepts, and the weight values are whole numbers displayed along the edges connecting concepts and components. The "Human-Health-Science" concept is first decomposed into two component parts, "Human" and "Health-Science", which are related to the initial concept by means of the AND operator. In this particular case, both the "Human" and "Health-Science" components represent concepts, but are defined at an abstraction level below the "Human-Health-Science" concept. In turn, the two component concepts, like the initial concept, have also been decomposed into component parts. As shown in Figure-1, the components of the "Human" concept are given as the index terms "Man", "Woman" and "Human"; and, the components of the concept "Health-Science" are given as the concepts "Health-Effects", "Biological-Effects", "Risk-Assessment" and "Molecular and Genomic Science". The decomposed components of the "Human" concept represent index terms, and thus are not decomposed further. However, the components of the "Health-Science" concept represent concepts themselves, and thus are decomposed into component parts. The decomposition, or refinement, process continues until every terminal node represents an index term. In the current example, this condition is reached following the decomposition of the concepts "Health-Effects", "Biological-Effects", "Risk-Assessment" and "Molecular and Genomic Science".
Figure-1 "Human-health-Science" rule-based tree.
The evaluation of a user-defined concept requires an analysis of the given concept's rule-base tree. In the case of RUBRIC the evaluation process is based upon a "run-time", bottom-up analysis of a rule-base tree. The term "run-time" is used here to denote the fact that the analysis of a rule-base tree occurs only during its application against a document database. In other words, there is no static analysis of a rule-base tree, for efficiency reasons, prior to its application against a document database.
The actual run-time evaluation of a rule-base tree begins with the matching of each document D in the given database against the index terms that occur in the given tree. Given a document D, if an index term that occurs in D is present in the rule-base tree, then the term is assigned a weight value of one; otherwise, it is assigned a weight value of zero. The assigned weight values are subsequently propagated upward in the rule-base tree. In general, the propagation of such values is governed through the application of the following two rules. In the case of an AND relationship the assigned weight value is determined by the expression min {component_wt
To illustrate the run-time evaluation of a rule-base tree consider the tree in Figure-1 and a document D whose index terms include, "human", "Genes", "DNA", and "Risk Assessment". Figure-2 shows, in parenthesis, the weight value assigned to each component in the rule-base tree as a result of the upward propagation of the term weights. The process begins with the assignment of weight ‘1’ to the leaf nodes corresponding to the index terms included in D. The method of weight propagation defined in the previous paragraph is applied to determine the importance of document D to the intermediate nodes. The weight value assigned to the root concept, or to the user's concept of interest, represents the current document's Retrieval Status Value (RSV). For example, the RSV of document D in the above example is 0.8. In general, a RSV is an estimate of a document's anticipated usefulness to the user as determined by the retrieval system [1]. The use of RSVs allows RUBRIC to rank documents in a non-increasing order of anticipated usefulness to the user.
Figure-2 Initial propagation of weight values in Fig. 1
In the next section, we describe how concept-based retrieval has been implemented in our prototype system, CS
The goal of the Concept-Set Structuring System (CS
A. Development of a Rule-Base Tree
The CS
Figure-3 CS
B. Evaluation of a Rule-Base Tree
The CS
Figure-4 Initial propagation of index terms in Fig.1
The next step is to propagate upward the previously constructed logical expressions. Specifically, the constructed expressions corresponding to components combined by the AND operator are propagated upward as a single search expression, while the expressions of components combined by the OR operator are propagated upward as multiple search expressions. In the latter case, each operand represents a distinct search expression and has its own assigned weight value. For example, the result of the upward propagation of the initial expressions constructed in Figure-4 is shown in Figure-5. In general, the propagation of logical expressions continues until the concept within the given hierarchy that represents the user's current information needs is itself represented as a set of weighted expressions. The final constructed set of expressions represents the MTS of the given concept. The MTS corresponding to the concept "Human-Health-Science" in Figure-1 is shown in Figure-6.
Figure-5 Propagation of search expressions constructed in Fig.4
Figure-6. Minimal Term Set for the concept "Human-Health-Science"
Each MTS that is constructed by CS
C. Initiation of a Retrieval Search
The selection of the "retrieval" control button causes the CS
After receiving all of the results from the evaluated MTS expressions, the CS
D. Interface with Boolean Search Engines
The interface between the CS
The CS
In this section we compare the results produced by the CS
A rule-base tree was generated using the CS
This paper has reported on the development the Concept-Set Structuring Subsystem (CS
We have several research projects planned with respect to concept-based retrieval and the CS
Figure-9 Results produced by the CS
Acknowledgments:
This work is supported in part by a grant from the U.S. Department of Energy (under Grant No. DE-FG02-97ER1220). We acknowledge Rick Fanning and Bob Donohue for their development of the cgi-script that allows the CS
Author Biography:
Fenghua Lu is a computer science M.S. student at the Center for Advanced Computer Studies (CACS) at the University of Southwestern Louisiana (USL). Tom Johnsten is a visiting assistant professor in the computer science department at USL. Vijay Raghavan is a professor of computer science at CACS. The research interests of Fenghua, Tom and Vijay are in database management, data mining, information retrieval and Internet computing. Dennis Traylor is an information scientist at the Department of Energy’s Office of Scientific and Technical Information.
[1] G. Salton and M. McGill, An Introduction to Modern Information Retrieval,
New York, NY: McGraw-Hill, 1983.
[2] K. Jones and P. Willett, "Introduction to Chapter Five," in Readings in Information
Retrieval (K. Jones and P. Willett, eds.), San Francisco, CA:Morgan Kaufmann,
pp.257-263, 1997.
[3] B. McCune, R. Tong, J. Dean and D. Shapiro, "RUBRIC: A System for Rule-Based
Information Retrieval," IEEE Transactions on Software Engineering, vol. 11:2,
pp. 939-944, 1985.
[4] "U.S. Department of Energy: Information -Bridge System,"
http://www.osti.gov/bridge/home.html.
[5] A. Alsaffar, J. Deogun, V. Raghavan and H. Sever, "Concept Based
Retrieval By Minimal Term Sets," International Symposium on Methodologies for
Intelligent Systems, Warsaw Poland, June , 1999.
[6] "U.S. Department of Transportation: National Transportation Library System,"
http://www.bts.gov/NTL/.
[7] "U.S. Department of Energy: Environmental Science Network,"
http://apollo.osti.gov:2001/.
InForum '99 Home Page |
Proceedings
|