IMG: Integrated Microbial Genomes
IMG: Integrated Microbial Genomes

IMG Terms, Pathways, and Parts Lists

Rationale

Gene annotation in IMG involves characterization of genes in terms of their predicted functional roles, such as COG [3] annotations or EC numbers, that are further defined by their association with pathways or functional categories, such as COG functional categories or KEGG pathways [2]. These gene->function->pathway associations provide a connection between the phenotype of an organism and the genes encoded in its genome thus serving as a framework for predicting phenotype of an organism based its genome sequence ("metabolic reconstruction"). Depending on how these associations are defined, interpretation of genome sequence in terms of the metabolic capabilities of an organism can be more or less successful and require more or less sophisticated tools. KEGG and other public pathway repositories (e. g., MetaCyc [1]) provide valuable sources of functional roles and associated functional categories for IMG, however, none of them provides the framework fully compatible with the concept of functional profile, which is the basis of most comparative tools in IMG (see AboutIMG Data Analysis ).

For instance, each KEGG map comprises a network of reactions catalyzed by enzymes encoded by the genomes and interconnected through shared compounds; KEGG maps are constructed on the basis of all available biochemical knowledge rather than organism-specific information. Thus KEGG maps provide an organism-independent overview of metabolism and may be applied to any sequenced genome. However, while reactions in KEGG maps are chemically related, they may serve very different physiological purposes, such as pathways for biosynthesis and degradation of the same compound; as a result, certain reactions within the same map rarely co-occur in the same organism under the same physiological conditions. Due to this structure of KEGG maps, their profiles are typically characterized by presence of some, but never all, functions on every map in almost every genome, and a user has to decide whether the differences between the genomes are physiologically significant and whether they reflect the differences in biosynthesis or degradation of certain compounds, or both. Thus the interpretation of KEGG map profiles requires either significant knowledge of biochemistry or development of very sophisticated tools that would compare not only the presence or absence of enzymatic functions, but also their linkage within the map, such as the distribution of enzymes catalyzing consecutive reactions, analysis of branching points within the map, possible directions of metabolic fluxes, etc. In addition such tools have to take into account its connections with other maps, since some compounds serve as intermediates in multiple pathways and appear on multiple KEGG maps.

Another functional hierarchy widely used to infer the metabolic capabilities of newly sequenced genomes, COG Functional Categories and Pathways, groups together functions that are "related" according to the common biochemical knowledge, although no evidence of this relatedness is recorded. As a result, many COG Pathways do include only those COGs that participate in a certain biochemical pathway without assigning them to specific pathway implementations, pathway branches, etc. Thus, COG Pathways are also organism-independent and can be applied to any genome. However, many COG Categories contain COGs that are rather loosely associated with certain biological activity and their presence or absence from the COG pathway profile, if regarded as an indication of the presence or absence of a certain metabolic pathway, should be also interpreted with caution.

In contrast, MetaCyc consists of very concise pathways that are organized according to their physiological role in an organism. For instance, pathways for biosynthesis of some compound are grouped together and are always separated from pathways for degradation of the same compound. However, these pathways are also organism-specific, i.e. they describe metabolism of compounds as it occurs in certain model organisms, such as Escherichia coli, Bacillus subtilis and Pseudomonas aeruginosa. Therefore many MetaCyc pathways are represented by multiple versions that are essentially the same but differ by one or two reactions (e. g. -- multiple pathways for lysine and methionine biosynthesis). In many cases due to remarkable diversity and variability of microbial biochemistry and physiology, these pathways (as described in model organisms) are not directly applicable to other microbes, especially those phylogenetically distant from the few model bacteria. Although MetaCyc pathways are based on a different paradigm than the KEGG maps, their profiles in many cases look similar to the KEGG map profiles, with every genome having some, but not all enzymatic functions for every pathway variant. As a result, their interpretation with regard to the presence or absence of pathways and their variants in an organism also requires careful manual inspection and good knowledge of biochemistry.

Consequently, our ability to assign functional roles to the genes in microbial genomes and to infer the physiology of newly sequenced microorganisms based on the sets of functions encoded in their genomes is limited by the structure of pathways and functional hierarchies implemented in public repositories. The set of Analysis Carts in IMG, including Gene, COG, Pfam and Enzyme Carts, helps to some extent to overcome these limitations by allowing users to combine individual functions, as represented by ortholog clusters, COGs, Pfams or genes annotated with the same EC number, to create their own groupings of functions, irrespective of their placement in the pathways and functional hierarchies provided by various pathway repositories.

While providing the necessary flexibility, this approach also has certain shortcomings, such as dependence on sequence similarity-based protein clusters (Pfam families, COG groups or ortholog clusters defined as clusters of bidirectional best hits). There is no one-to-one correspondence between functions and homology-based clusters: some protein families include proteins with somewhat similar but distinct enzymatic activities (e. g. pfam01066 -- CDP-alcohol phosphatidyltransferase) and some functions are encoded by genes with very little sequence similarity (e. g. the large subunit of archaeal DNA primase). As a result, interpretation of profiles of objects in the Analysis Carts may require good knowledge of biochemistry and extensive experience in genome analysis, especially when the distribution of higher level functional categories (e. g. presence or absence of multiple pathway variants) is analyzed. The only alternative to the similarity-based protein clusters has been represented in IMG by the Enzyme Cart, which allows analysis of the phylogenetic distribution of enzymatic functions irrespective of their implementation as the specific protein families. However, the Enzyme Cart does not capture any non-enzymatic functions; moreover, it is impossible to analyze the distribution of many enzymatic functions, because the Enzyme Nomenclature lags behind significantly.

Examples

To illustrate the requirements for a functional hierarchy or pathway repository designed to support accurate functional annotation of genes and confident reconstruction of the metabolic capabilities of organisms, let us consider an example. This example regards an abstract view of a protoheme biosynthesis pathway, as implemented in 4 different bacteria: Escherichia coli, Bacillus subtilis, Brucella melitensis and Rhodobacter sphaeroides. Note that the variability of protoheme biosynthesis is not limited to the 4 variants depicted below: there are other bacteria that utilize different combinations of 5-aminolevulinate biosynthesis from L-glutamate (reactions R1 and R2) and from glycine (reaction R3) with anaerobic or aerobic coproporphyrinogen oxidation (reactions R8 and R9, respectively); there is an additional reaction of anaerobic protoporphyrinogen oxidation (not shown here, alternative to R11); there is an alternative pathway of heme biosynthesis in Archaea which branches at the level of uroporphyrinogen III (compound H) ("ancient pathway", not shown here). Finally, some of the intermediates in this pathway, including uroporphyrinogen III (compound H) and protoporphyrin IX (compound L), are used in biosynthesis of other cofactors, such as siroheme, cobalamin, chlorophylls a and b and bacteriochlorophyll (not shown here). Thus, if we attempt to represent each variant of protoheme synthesis as a separate pathway, the number of alternative pathways (and the number of alternative pathways for biosynthesis of other porphyrin cofactors mentioned above) will be so large, it will be nearly impossible to use them for genome comparison.

Protoheme Biosynthesis Variants

An alternative approach would be to combine all these variants of protoheme synthesis into one generalized map that should probably include all versions of biosynthetic pathways for other porphyrin cofactors to avoid redundancy and analyze the presence or absence of elementary reactions. However, this approach does not take into account the fact that many versions of this pathway share common invariant blocks that are larger than a single reaction and are connected at the branching points (e. g., biosynthesis of 5-aminolevulinate from L-glutamate represented by reactions R1 and R2, and biosynthesis of uroporphyrinogen III from 5-aminolevulinate represented by reactions R4-R6). These invariant blocks are defined by the underlying chemical transformations of the "main" substrates and products (i. e. those compounds that do not appear in many reactions, where they serve as coenzymes and cofactors) and represent sequences of reactions that, according to our current biochemical knowledge, should always co-occur in an organism. Otherwise, there will be some "hanging" metabolites left that are either produced but not utilized or utilized by not produced nor transported to/from the medium. Moreover, in many cases the backbone of the reaction (i. e. "main" substrates and products) remains the same, while the coenzyme changes (e. g., NAD-dependent oxidation vs. NADP-dependent). In other cases the entire reaction (i. e. chemical transformation of reactants as described in the reaction equation) remains the same, but the catalyst (and sometimes the reaction mechanism) may change (e. g., Mn-, Fe- or Cu/Zn-dependent superoxide dismutases). In both cases, these reactions can be included as part of invariant blocks of reactions either as alternative reactions or as the same reaction with alternative catalysts.

Protoheme Biosynthesis Generalized

Such invariant reaction blocks represent much more convenient objects for the purpose of genome comparison than the monolithic organism-specific pathways or large maps composed of individual reactions. They naturally fit into the concept of functional profile, as their presence or absence can be established using a relatively simple set of rules. In addition, the information that certain enzymatic reactions are always expected to co-occur in any organism is of great biological importance, as it points to much stronger functional linkage between certain genes and this tighter functional association could be reflected in other genomic features, such as conserved chromosomal neighborhoods or similarity of the evolutionary history. In addition, this information may be used for estimation of the completeness of metagenomic sequence data and for inference of the metabolic capabilities of organisms based on incomplete sequence data. Thus, for the purpose of genome comparison it seems more advantageous to compose organism-independent (generic, composite) pathways or maps not of individual reactions, but of such invariant blocks of reactions (IMG Pathways), as illustrated in the figure below.

Note that this framework can be also used to model protein-protein interaction (or non-metabolic) pathways, which represent a challenge to the traditional repositories of microbial pathways. For example, consider the process of bacterial DNA replication depicted below. Similar to metabolic pathways, it can be split into elementary "reactions", although in this case reactions may or may not involve any chemical transformations. Similar to metabolic pathways, a "catalyst" can be assigned to these "reactions", where may catalyze a traditional chemical conversion, such as DNA polymerization, or a conformational change of a certain macromolecule (e. g., DNA supercoiling or unwinding). It may also serve as a chaperone or "catalyze" the assembly of an active form of certain enzyme or protein complex (e. g., loading of the replicative DNA helicase). Note that similarly to metabolic pathways, alternative reactions and alternative catalysts may also exist in interaction pathways -- for instance, due to the difference in growth conditions (temperature, salinity or pH) assembly of the same protein complex may require different chaperones in different bacteria. Elementary non-metabolic "reactions" can be grouped together into invariant blocks that would be equivalent to metabolic pathways and these invariant blocks can be further assembled into higher-level structures.

Bacterial DNA Replication

The example of DNA replication pathway brings to our attention another important requirement to a "genome-oriented" pathway repository (as opposed to "flux analysis-oriented" or "protein cluster-oriented" classification), namely representation of the functional roles and their connection to gene products. As was discussed above the nomenclature of functional roles cannot be based exclusively on protein clusters and families, since there is no one-to-one relation between protein clusters and functional roles. Similarly, the EC numbers assigned by the Enzyme Nomenclature would fail to capture any non-enzymatic functions, as well as the complexity of some multi-subunit enzymatic complexes. Thus, an optimal nomenclature of functional roles would be multi-tiered, whereby the lower-level functional roles are assigned directly to genes as gene (protein) products, while the higher-level functional roles are not associated directly with any genes, but rather depict protein complexes and proteins that were modified in some other ways (e. g., by covalent or non-covalent attachment of a cofactor, proteolysis, etc.). Lower-level functional roles can be associated with the higher-level functional roles to represent a process of spontaneous assembly of a protein complex; if this process is not spontaneous, but driven by some other proteins/protein complexes, it can be represented as a pathway of assembly of protein complexes or protein modification. Some of the higher-level and lower-level functional roles corresponding to the mature enzymes can be then mapped onto the Enzyme Nomenclature. Finally, the lower-level functional roles should be associated directly with genes rather than with protein clusters or families of any kind (COGs, Pfams, TIGRfams, etc.), although the information about protein clusters and families should be taken into account while making such associations.

In summary, a functional hierarchy or pathway repository designed to support the efficient functional annotation of genes, genome comparison and confident reconstruction of the metabolic capabilities of the organisms should satisfy the following requirements:

  1. Functional classification should be organism-independent, i.e. rather than recording separate organism-specific pathways, it should combine all known versions of a pathway.
  2. Wherever possible it should record evidence of functional connection between genes; such evidence may include biochemical reactions catalyzed by the protein products of the genes, protein-protein interactions in which protein products participate, genetic evidence, etc.
  3. Pathways and/or functional categories should take into account the presence of certain invariant blocks of metabolic reactions and protein-protein interactions and such invariant blocks should represent the lower level of the functional hierarchy; functional categories of higher level corresponding to various combinations of the invariant blocks should be organized according to their physiological role.
  4. Functional roles in pathways and functional categories should not be equivalent to the EC numbers in the Enzyme Nomenclature or names of protein clusters and families; they should distinguish between the gene/protein products, as they are produced in the process of transcription/translation and protein complexes or modified proteins that have certain enzymatic or non-enzymatic functions.

IMG Terms, Pathways, Parts Lists, and Networks

We have addressed the requirements discussed above by introducing an IMG-specific collection of generic (protein cluster-independent) functions, called IMG Terms and generic (organism-independent) functional hierarchies, called IMG pathways and composed of IMG Reactions.

IMG Terms, Pathways, Parts Lists, 
  and Networks

IMG Terms are defined by domain experts (JGI's Genome Biology Program scientists) as part of the process of recording IMG Pathways into the system. IMG Terms form a hierarchy, whereby the leaves of this hierarchy consist of functional roles for gene products (protein product descriptions) assigned to individual genes. These lower-level IMG Terms of the type "Gene Product" can be directly associated with the reactions, whereby they become either "Catalysts" or "Reactants". Alternatively, they can be assigned as "children" to the IMG Terms of the type "Protein Complex", thus indicating that they constitute subunits of a multi-subunit protein complex. IMG Terms of the type "Protein Complex" can become "children" of another "Protein Complex" Term, which specifies the formation of a larger protein complex from several subassemblies. One more type of IMG terms, "Modified Protein", is used to represent a covalently modified protein, such as phosphorylated or uridylylated protein or an exported protein after the signal peptide has been cleaved. The "Modified Protein" Term can become a "Catalyst" in the reaction or can become a "child" of another IMG Term of the type "Modified Protein" or "Protein Complex", thereby describing the process of consecutive covalent modifications and protein complex formation (e. g., insertion of molybdenum cofactor and iron-sulfur clusters into the apoprotein of the catalytic subunit of the periplasmic nitrate reductase followed by the assembly of a mature enzyme). This hierarchy of IMG Terms captures many biologically important associations between genes and functions, such as the difference between single-subunit and multi-subunit implementations of the same enzyme that are assigned the same EC number but have drastically different physiological roles (e. g., multi-subunit replicative DNA polymerase and single-subunit low-fidelity DNA polymerase participating in mutagenic DNA repair) or the process of enzyme maturation and regulation of activity through covalent and non-covalent modifications. This hierarchical structure also allows tracking protein products through all chemical and physical transformations and modifications to the original genes that are used in comparing term and pathway profiles. Lower-level IMG Terms ("Gene Product" type) are associated with the genes in IMG, thus providing a product description for the genes and CDSs.

An IMG Pathway consists of a sequence of related IMG Reactions, whereby the reactions are linked through the common compounds (metabolic pathways) or through the common macromolecular complexes (non-metabolic or protein-protein interaction pathways), whereby the end product of a reaction serves as the substrate for the succeeding reaction in the pathway. IMG Pathways correspond to the sets of reactions on the metabolic or protein-protein interaction networks connecting the two branching points (i. e., the substrate of the first reaction can be produced in more than one reaction and the product of the last reaction can serve as a substrate in more than one reaction). Reactions in the pathways are numbered according to the order of chemical or physical transformation. IMG Pathways may include alternative reactions (those with the same "main" substrate and product, but different coenzymes) and alternative catalysts (i. e. enzymes with different reaction mechanism, cofactor, subunit composition, etc. catalyzing the same reaction); alternative reactions are assigned the same number in the pathway. The presence or absence of IMG pathways in an organism is established based on the presence of all gene products associated with all reactions in a pathway: only when all of them are present the pathway is considered to be present. However, in the case of alternative reactions and alternative catalysts the presence of all gene products associated with only one of the alternative reactions or all gene products associated with just one alternative catalyst is sufficient.

IMG Pathways are linked together through the common metabolites or macromolecular complexes to form IMG Networks. IMG Networks correspond to the fragments of a metabolic map that are known to perform certain physiological role: for instance, either of the two pathways of L-homocysteine synthesis can be combined with either of L-homocysteine methylation pathways to produce L-methionine. Thus, these pathways connecting via the common intermediate, L-homocysteine, will be combined into an IMG network of "L-methionine synthesis". IMG Networks can be linked to the higher level IMG Networks based, again, on a common physiological role. For example, all networks for amino acid biosynthesis can be combined into a higher level network, "Synthesis of L-amino acids and glycine", which in turn becomes part of another network, "Amino acid synthesis".

Not all meaningful biological groups of functions fit into the pathway concept, where an ordered sequence of reactions is associated with the appropriate IMG Terms. To handle these cases we have created IMG Parts Lists, which consist of a list of IMG Terms. In some biological processes, the sequence of events is unknown, or the steps may not necessarily occur in a particular order. An example of this is the IMG Parts List "Bacterial ribosome biogenesis" which contains several nucleases, acetyltransferases, and pseudouridine synthases. It is currently unknown if the order of modification is important for this process. In other cases it is useful to have a catalog of related functions that are not involved together in a pathway but can be grouped together in a meaningful way. For example the IMG Parts List "Sigma Factors" allows the user to profile all sigma factors in an organism or group of organisms.

References

[1] Caspi et al. 2006, MetaCyc: A Multiorganism Database of Metabolic Pathways and Enzymes, Nucleic Acids Res., 34:D511-D516 2006, http://metacyc.org/.

[2] Kanehisa, M., Goto, S., Kawashima, S. Okuno, Y., and Hattori, M. 2004. The KEGG Resource for Deciphering the Genome. Nucleic Acids Research 32, D277-D280.

[3] Tatusov, R.L., Koonin, E.V., and Lipman, D.J. 1997. A Genomic Perspective on Protein Families, Science, 278, 631-637.