Querying Heterogenous Molecular Biology Database*

Victor M. Markowitz, I-Min A. Chen, Anthony Kosky, and Ernest Szeto

Information and Computing Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720

Molecular biology data are scattered among multiple data repositories, including molecular biology databases (MDBs). Although containing related data, these repositories are often isolated and are characterized by various degrees of heterogeneity: they usually represent different views (schemas) of the molecular biology domain and are implemented using different database management systems (DBMSs). Comprehensive studies of biological data often involves examining data across heterogeneous databases.

Solutions currently promoted for querying data across heterogenous MBDs involve constructing MBD federations or data warehouses, such as the Genome Topographer (Cold Spring Harbor Laboratory) and the Integrated Genomic Database (German Cancer Research Institute). These solutions entail construction of a global view of a collection of MBDs, where definitions of the component MBDs are expressed in a common language and discrepancies between these definitions are resolved before they are integrated into a global view. For data warehouses, data from MBDs must be also loaded into a central data repository. The main problem of MBD federations and data warehouses is the complexity of constructing global views. Data warehouses have also the additional problems of not being synchronized with evolving component MBDs and of potentially extremely large physical sizes.

Querying heterogenous MBDs can be achieved without constructing MBD federations or data warehouses, by organizing MBDs in a loose multidatabase system. We have developed a multidatabase query strategy for MBDs implemented using relational DBMSs, in the context of the Object-Protocol Model (OPM) data management tools [1]. For MBDs that have not been developed using OPM, OPM views of the MBDs are first constructed using an OPM retrofitting tool. Then, existing OPM tools provide facilities for examining MBD schemas and browsing and querying individual MBDs associated with OPM views.

Our multidatabase query strategy is based on an MBD dictionary that contains information on MBDs, including their OPM views, DBMS implementation, and links to other MBDs. A multidatabase query tool processes queries over heterogenous MBDs associated with OPM views, by (1) decomposing these queries into subqueries for individual MBDs, (2) using exiting OPM query tools for processing the subqueries, and (3) assembling subquery results into multidatabase query results. Our query strategy assumes that users understand the structure and semantics of the MBDs they query. In a related project, we plan to develop an MBD Library containing comprehensive documentation on MBDs and with facilities that will assist users in expressing multidatabase queries.

Work is underway on applying the multidatabase query strategy outlined above for supporting queries over the new versions of Genome Data Base (GDB), Genome Sequence Data Base (GSDB), and Protein Data Bank (PDB).

*Supported by a grant from the Director, Office of Energy Research, Office of Health and Environmental Research, of the U.S. Department of Energy under Contract DE-AC03-76SF00098.

[1] Chen, I.A., and Markowitz, V.M., An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools, Information Systems, Vol. 20, No 5 (July 1995), pp. 393-418.


Abstracts scanned from text submitted for January 1996 DOE Human Genome Program Contractor-Grantee Workshop.

Return to Table of Contents