IMG: Integrated Microbial Genomes
IMG: Integrated Microbial Genomes

IMG Lineage

IMG joins a large collection of commercial and academic biological data management systems [3]. IMG has benefited from the experience gained in the past decade in developing biological data management systems and from the research in the area of microbial genome data analysis.

Data Management

From a data management perspective, IMG integrates data from multiple biological data sources. Integration of biological data resources has been considered extensively over the years because of the continuous proliferation of these resources and the need to access multiple resources inherent to biological data exploration. Data integration solutions are roughly classified into extract-transform-load (ETL) based data warehouse and resource-wrapped database federation based solutions [16]. While the data federation methodology is more appealing and has been the subject of extensive research, the data warehouse approach has proven to be better suited for dealing with inherently imprecise biological data that requires substantial manual data curation [2].

IMG follows the data warehouse methodology and overall architecture that has proven to be successful in building academic systems such as GUS [2] and commercial systems such as Genesis [7] and Resolver [14]. Similar to Genesis and Resolver, IMG has been developed using the Oracle database management system for hosting the underlying data warehouse, while developing custom ETL and data cleansing tools to address problems specific to the microbial genome data domain. IMG has incipient OLAP (on-line analytical processing) elements to deal with microbial genome data dimensionality [8] that will be extended as the system evolves and its data content grows.

The IMG database has been developed using the OPM toolkit [6] which allows describing database structures in terms of classes of objects and provides support for dealing effectively with rapid schema (data model) evolution. Furthermore, IMG has benefited from the experience gained in defining biological database schemas using this toolkit, including the schemas for the GDB Human Genome Database [4] and the gene annotation component of Genesis.

Data Content and Analysis

IMG includes data from public genome data resources such as EBI's Genome Reviews [5] and NCBI's RefSeq [13]. IMG is based on the fundamental principle most recently expressed in the FIG manifesto [12], that the value of genome analysis increases with the number of genomes available as a context for comparative analysis. Consequently, one of the key questions for IMG was finding a primary source of public microbial genomes that includes most of (if not all) the public complete microbial genomes. An additional requirement regards annotations which have to be amenable for integration with additional annotations available in other public data sources. EBI's Genome Reviews satisfies these requirements and therefore serves as IMG's main source for public microbial genome data.

The comparative analysis capabilities in IMG are based on techniques that follow observed biological evolutionary phenomena regarding functional coupling of genes [1, 9]. While IMG's overall analysis flow (see Data Analysis) is original, some of IMG's user interface capabilities have been inspired by other systems, including those mentioned above. For example, IMG's organism browser was inspired by MBGD's taxonomy browser, while it's Gene Cart and Gene Page were inspired by Gene Logic's Ascenta system. Some of IMG's functionality is similar to analogous functionality in several microbial genome data analysis systems such as WIT [10], ERGO [11], SEED [15], MBGD [17], PUMA2 [18], and VIMSS's Microbes Online [19]. For example, all these systems provide some form of Gene Pages, BLAST and keyword searches which are also common to other biological data resources.

IMG has also a number of unique analytical capabilities. For instance, instead of restricting users to a predefined collection of metabolic pathways compiled from the literature and mostly comprising model organisms, IMG provides users with the opportunity to define their own pathways and functional categories by employing Gene, COG, Enzyme and Pfam Analysis Carts regardless of existing annotations. Such user-defined pathways can be further analyzed using a variety of tools, such as COG, Enzyme and Pfam Profiles, Search by Phylogenetic Profile, Abundance Profiles, and the multi-genome alignment tool VISTA (see UsingIMG). These tools were specifically developed in order to enable the analysis of genomes that are poorly characterized, are phylogenetically distant from model organisms, and cannot be analyzed efficiently using traditional pathway databases.

Ultimately, the value of comparative analysis depends on data quality [12]. In IMG, accuracy of annotations is ensured using validation and correction procedures (see Data Cleaning), while annotation coherence involves detecting and resolving semantic discrepancies caused by different annotation methods, nomenclatures, etc. As IMG evolves, substantial effort will be invested in improving the coherence of annotations across all its genomes.

References

1. Bowers, P.M., M. Pellegrini, M., Thompson, M.J., Fierro, J., Yeates, T.O., and Eisenberg, D. 2004. Prolinks: A Database of Protein Functional Linkages Derived from Coevolution, Genome Biology 5.

2. Davidson, S.B., Crabtree, J., Bunk, B., Schug, J., Tannen, V., and Stoeckert, C. K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources, IBM Systems Journal, 40, 512-531, 2001.

3. Galperin, M.Y. The Molecular Biology Collection: 2005 Update, Nucleic Acids Research, 33, Database Issue, D5-dD24, 2005.

4. GDB Human Genome Database, ; See also: GDB Schema Documentation .

5. P. Kersey, et al. Integr8 and Genome Reviews: Integrated Views of Complete Genomes and Proteoms, Nucleic Acid Research 33, D297-D302, 2005. See Genome Reviews.

6. Markowitz, V.M., Chen, I.A., Kosky, A.S., and Szeto, E. Object-Protocol Model Data Management Tools '97. Bioinformatics, Databases and Systems, Stan Letovsky (ed), Kluwer Academic Publishers, pp. 187-199, 1999.

7. Markowitz V.M., Campbell, J., Chen, I.A., Kosky, A., Palaniappan, K., and Topaloglou, T., Integration Challenges in Gene Expression Data Management. Chapter in Bioinformatics: Managing Scientific Data, Morgan Kauffman Publishers (Elsevier Science), 2003, pp. 277-301. See also Genesis.

8. Markowitz V.M., Korzeniewski, F., Palaniappan, K., Szeto, E., Ivanova, N., and Kyrpides, N, The Integrated Microbial Genomes (IMG) System, A Case Study in Biological Data Management, to appear in Proc. of the 31st Int. Conference on Very Large Data Bases (VLDB), 2005.

9. Osterman, A., and Overbeek, R. 2003. Missing Genes in Metabolic Pathways: A Comparative Genomic Approach, Chemical Biology, 7: 238-251.

10. Overbeek, R., et al. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction, Nucleic Acids Research, 28(1):123-125, 2000.

11. Overbeek, R., et al. The ERGO Genome Analysis and Discovery System. Nucleic Acid Research 31, 164-171, 2003. See also ERGO.

12. Overbeek, R. The Project to Annotate the First 1000 Sequenced Genomes, Develop Detailed Metabolic Reconstructions, and Construct the Corresponding Stoichiometric Matrices.

13. Pruitt, K.D., Tatusova, T., and Maglott, D.R. NCBI Reference Sequence (RefSeq): A Curated Non-redundant Sequence Database of Genomes, Transcripts, and Proteins, Nucleic Acid Research 33, D501-D504, 2005. See RefSeq.

14. Rosetta Resolver System.

15. The SEED: an Annotation/Analysis Tool Provided by FIG.

16. Bioinformatics: Managing Scientific Data, Lacroix, Z. and Critchlow, T. (eds), Morgan Kauffman Publishers (Elsevier Science), 2003.

17. Uchiyama, I. MBGD: Microbial Genome Database for Comparative Analysis, Nucleic Acid Research 31, 58-62, 2003.

18. PUMA 2.

19. Virtual Institute for Microbial Stress and Survival, Microbes Online .