pmc logo imageJournal ListSearchpmc logo image
Logo of bmcsbBioMed Central web siteReference to the article.Search.Manuscript submission.Registration.Journal front page.
BMC Struct Biol. 2004; 4: 3.
Published online 2004 February 27. doi: 10.1186/1472-6807-4-3.
PMCID: PMC395836
COMe: the ontology of bioinorganic proteins
Kirill Degtyarenkocorresponding author1 and Sergio Contrino1
1European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambs CB10 1SD, United Kingdom
corresponding authorCorresponding author.
Kirill Degtyarenko: kirill/at/ebi.ac.uk; Sergio Contrino: contrino/at/ebi.ac.uk
Received September 18, 2003; Accepted February 27, 2004.
Abstract
Background
Many characterised proteins contain metal ions, small organic molecules or modified residues. In contrast, the huge amount of data generated by genome projects consists exclusively of sequences with almost no annotation. One of the goals of the structural genomics initiative is to provide representative three-dimensional (3-D) structures for as many protein/domain folds as possible to allow successful homology modelling. However, important functional features such as metal co-ordination or a type of prosthetic group are not always conserved in homologous proteins. So far, the problem of correct annotation of bioinorganic proteins has been largely ignored by the bioinformatics community and information on bioinorganic centres obtained by methods other than crystallography or NMR is only available in literature databases.
Results
COMe (Co-Ordination of Metals) represents the ontology for bioinorganic and other small molecule centres in complex proteins. COMe consists of three types of entities: 'bioinorganic motif' (BIM), 'molecule' (MOL), and 'complex proteins' (PRX), with each entity being assigned a unique identifier. A BIM consists of at least one centre (metal atom, inorganic cluster, organic molecule) and two or more endogenous and/or exogenous ligands. BIMs are represented as one-dimensional (1-D) strings and 2-D diagrams. A MOL entity represents a 'small molecule' which, when in complex with one or more polypeptides, forms a functional protein. The PRX entities refer to the functional proteins as well as to separate protein domains and subunits. The complex proteins in COMe are subdivided into three categories: (i) metalloproteins, (ii) organic prosthetic group proteins and (iii) modified amino acid proteins. The data are currently stored in both XML format and a relational database and are available at http://www.ebi.ac.uk/come/.
Conclusion
COMe provides the classification of proteins according to their 'bioinorganic' features and thus is orthogonal to other classification schemes, such as those based on sequence similarity, 3-D fold, enzyme activity, or biological process. The hierarchical organisation of the controlled vocabulary allows both for annotation and querying at different levels of granularity.
Background
Many characterised proteins contain metal ions, small organic molecules or modified residues. In contrast, the huge amount of data generated by genome projects consists exclusively of sequences with almost no annotation. One of the goals of the structural genomics initiative is to provide representative three-dimensional (3-D) structures for as many protein/domain folds as possible to allow successful homology modelling [1]. However, important functional features such as metal co-ordination or the type of prosthetic group are not always conserved in homologous proteins.
So far, the problem of correct annotation of bioinorganic proteins has been largely ignored by the bioinformatics community. The only comprehensive database of metal sites in proteins, Metalloprotein Database and Browser (MDB) [2], is automatically built from the structures available at the Protein Data Bank (PDB) [3]. Although crystallography is the single most informative method for studying protein structure, it has a number of limitations as far as bioinorganic chemists are concerned [4,5]. Many structures deposited at the PDB contain metal ions and molecules that are not present in native proteins. On the other hand, the information on bioinorganic/small molecule centres obtained by other (mostly spectroscopic) methods is not available to the scientific community in any form apart from literature databases.
'Ontology' is a formal definition of concepts (such as entities and relationships) of a given area of knowledge, described in a standardised form [6]. It can be organised as a structured vocabulary in the form of a directed acyclic graph or a network in which each term may be a 'child' of one or more 'parent' [7]. In this paper, we describe COMe (Co-Ordination of Metals), the ontology for bioinorganic proteins and their features.
A previous version of this manuscript was made available before peer review at http://preprint.chemweb.com/biochem/0307002/.
Results
COMe version 4.01 contains data on 1280 'bioinorganic proteins', 470 'bioinorganic motifs' and 174 'molecules'. The data exist in two formats: as a collection of XML files and in a relational database. This relational database implementation is complete with a web-based interface and provides an easy way to navigate the ontology.
The data in COMe are gathered from the literature. Every COMe entry is manually edited and each ontological relationship is manually assigned; thus the pitfalls of automatically generated datasets are avoided (e.g. the centres containing non-native metals are not included). COMe does not aim to list all known bioinorganic proteins but rather to provide a controlled vocabulary and classification to allow better annotation of them in comprehensive databases. Representative examples (instances) of every protein family are included. Each instance has a cross-reference either to literature citation or to a publicly available database.
Data types
There are three types of entries in COMe: 'bioinorganic protein' (PRX), 'bioinorganic motif' (BIM), and 'molecule' (MOL). Here, 'bioinorganic protein' is any complex protein, such as a metal-binding protein, an organic molecule-binding protein, a protein containing post-translational modifications, or a combination of any of these classes. Likewise, the original definition of 'bioinorganic motif' [8] is extended to include organic prosthetic groups and modified amino acids. Bioinorganic motif is now defined as a common structural feature shared by functionally related, but not necessarily homologous, proteins, and consisting of either
(i) metal atom(s) and first coordination shell ligands, linked to polypeptide-derived groups by covalent or ionic bonds;
(ii) organic molecule, linked to polypeptide-derived group(s) by covalent bond(s);
(iii) covalently modified amino acid residue(s), or
(iv) combination of any of the above.
As mentioned before, the data in COMe are derived from the literature. Thus, identification of ligands and binding mode in a BIM is based on assessment of the authors and the curator, and not on distance thresholds, like in automatically generated collection (see discussion of Example 4.1 below).
'Molecule' can form a permanent part of a complex protein, either directly (if no covalent or ionic bond is defined between the amino acid residue and the molecule; e.g. non-covalently bound FAD is a permanent part of a flavoprotein) or otherwise as a part of a BIM (e.g. covalently bound phycobilin is a permanent part of a phycobiliprotein).
Data structure
Any COMe entry (PRX, BIM or MOL) minimally includes an identifier (ID) and a term. Each entry is related to at least one other entry (see Entry relationships). In addition, external cross-references (Xref) to a number of other databases are provided (Table 1).
Table 1Table 1
On-line databases cross-referenced in COMe
Since COMe contains classes as well as individual entities, we take care to provide the most suitable cross-reference for a given level. For example, a protein homology family will be cross-referenced to the corresponding InterPro family [13]; a protein subunit to a Swiss-Prot [11] or TrEMBL [12] sequence (SPTR); a functional multisubunit enzyme to an EC number [14]; an instance of a metalloprotein in a particular state to a PDB entry [3], etc.
Protein
The protein (PRX) entity refers to the functional protein as well as to separate protein domains and subunits. A typical low-level PRX entity is shown in Table 2. The PRX entity minimally consists of ID and term. It also may include instance as well as MOL, BIM and other PRX entities. The instance always has a species attribute. The role of instance is to provide the external evidence that the protein in question does exist in a particular organism. Currently, instance has no separate ID, but external Xref(s) should be provided (while Xref is not always available for the parent term). The other attributes of instance are centre (for proteins containing more than one bioinorganic/small molecule centre) and state (e.g. "reduced", "CO-bound", etc.).
Table 2Table 2
Example of protein entry
Molecule
MOL is an entity representing a 'small molecule' (as opposed to a macromolecule) or atom, which, in complex with one or more polypeptides, forms a functional protein. An example MOL entry is shown in Table 3. As a rule, there are no systematic names or chemical formulae. Instead, the terms in MOL entries are cross-referenced to chemical or bibliography databases (Table 1).
Table 3Table 3
Example of molecule entry
Bioinorganic motif
As mentioned before, a Bioinorganic Motif (BIM) can include both metallic and non-metallic centres. In COMe representation, every BIM consists of at least one centre and two or more ligands. The complete lists of centres, ligands and polyhedral symbols are given in the Additional Material.
Neither the existing coordination chemistry nomenclature [16] nor the IUPAC Recommendations on bioinorganic terms [17] provides a suitable guide for describing bioinorganic centres in proteins. We have developed a 1-D 'shorthand' representation for intrinsically 2-D structures such as BIMs. This qualitative representation is intuitively straightforward since it is based on similar descriptions employed by bioinorganic chemists in the literature [4]. A 1-D string has also the advantage of allowing quick text searches. In the rest of this section, we illustrate the use of this 'BIM language' with a number of examples (Table 4).
Table 4Table 4
Examples of bioinorganic motifs
In a BIM for a mononuclear centre, the central atom (metal) is written first, followed by the endogenous ligands (amino acid residues), and then the exogenous ligands, e.g. water. If the ligating atom needs to be indicated to avoid ambiguity, the symbol for this is separated from the ligand symbol by a dot, e.g. NE.His stands for the Nε atom of a His residue. This is a simplistic description that does not take into account the stereochemistry at the metal atom. The polyhedral symbol is not mandatory for it may be unknown. It also does not make sense for polynuclear metal centres (see below).
Example 4.1 in Table 4 shows a mononuclear centre found in the blue copper protein azurin. In this centre, one copper atom is coordinated by the Nδ atoms of two His residues, one mainchain oxygen derived from Gly, one Sδ atom of a Met residue and one Sγ atom of a Cys residue. The coordination geometry is trigonal bipyramidal (TBPY-5; the polyhedral symbols used are as in Table S3 in Additional file 3).
It can be observed that the Met and mainchain O ligands from the azurin copper centre are in the BIM, even if they are positioned further away than some conventional cut-off distance (e.g. 3 Å). This is because it is known from the literature that a protein family is characterised by a metal atom surrounded by specific ligands forming a recurrent structural motif, and so the groups referred to as 'ligands' are included in the BIM.
In Example 4.2, the zinc atom is tetrahedrally coordinated [T-4] by the Nδ atoms of two His residues and by two atoms of a Cys residue, the mainchain oxygen and Sγ (the k2 prefix designates didentate binding).
In dinuclear metal centres such as CuA copper centre (Example 4.3), each metal atom and its unique ligands are enclosed in braces and followed by the bridging ligand(s) designated by a μ prefix.
Some dinuclear metal proteins, notably diiron–carboxylate proteins, undergo a change of metal coordination by the carboxylate ligands upon oxidation/reduction (the so-called carboxylate shift) [18]. Therefore, two or more BIMs can be assigned to the same protein, as in methane monooxygenase hydroxylase (Examples 4.4 and 4.5 in Table 4). Note that the protonation is explicitly stated, i.e. OH2, OH- and O are different ligands in BIMs. This information is taken from the literature and not PDB.
There are no central atoms in polynuclear metal centres. Therefore, a cluster (such as the cubane iron-sulphur unit) takes the place of the central atom (Example 4.6).
In a BIM for a centre containing a metal–organic group complex, first comes the metal, then the organic group, the amino acid residues, and finally the exogenous ligands (e.g. CO). Examples 4.7 and 4.8 show BIMs for metal–tetrapyrrole (haem, chlorophyll, cobalamin) proteins, where Crn = corrin, por = porphyrin. Tetrapyrrole compounds are assumed always to be tetradentate. For pyranopterin-containing centres such as molybdenum cofactor [19], BIMs look like the one in Example 4.9 (dtpp = ene-dithiol pyranopterin).
Finally, for a purely organic prosthetic group such as FMN covalently attached to the polypeptide (Example 4.10), the same approach is used (except that the metal is absent and the organic group now takes the first place).
Entry relationships
The relationships between entities are not made explicit in XML, but can be deduced using the set of rules. The relational implementation, however, provides explicit relationships. Several examples in Table 5 show fragments of COMe ontology.
Table 5Table 5
Examples of entry relationships in COMe
IsA
The Example 5.1 in Table 5 illustrates the IsA (child to parent) relationship (also known as the IsKindOf relationship). 'Fe2S2 protein' is kind of 'iron-sulphur protein' which is kind of 'iron protein' which is kind of 'metalloprotein'. This relationship occurs between entities of the same class (PRX to PRX, BIM to BIM, MOL to MOL).
An important feature of this relationship is inheritance. For example, all proteins belonging to the 'Fe2S2 protein' class inherit the substructure (Fe2S2)(SG.Cys)2* (BIM000063).
An entity may have more than one parent. In the example (Figure 1), carbon monoxide oxidase inherits features from each subunit, viz. two different Fe2S2 clusters, molybdenum cofactor (Mo-pyranopterin complex) and FAD.
Figure 1Figure 1
Fragment of ontology for carbon monoxide oxidase.
Examples 5.2. and 5.3 show how IsA relationships are used to create ontologies of BIM and MOL entities. Note the asterisk (wildcard) in BIM000025 and MOL000041!
From Example 5.2, one can get an impression that every observed exogenous ligand can give rise to a separate BIM. However, this is not the case. Although PDB contains numerous instances of enzyme-inhibitor complexes or substituted metalloproteins, these will not be included in COMe. Since the entries are manually annotated, only biologically relevant motifs (e.g. cited or confirmed as such in the literature) are included.
IsPartOf
This relationship can occur between entities of the same class (BIM to BIM, MOL to MOL) or different classes (MOL to BIM, BIM to PRX). Example 5.4 in Table 5 illustrates the IsPartOf relationship (BIM to BIM).
In this example, each of the two mononuclear centres is part of the dinuclear metal centre. For the mononuclear centres, it is possible to indicate their coordination geometry. Note the different representation of mono- or didentate (bidentate) coordination modes: monodentate in Zn(OE.Glu)4 and didentate in Fe{k2-(OE,OE.Glu)}. On the other hand, BIM000352 is not just a sum of BIM000353 and BIM000354. Nothing in BIM000353 or BIM000354 indicates which ligands bridge the two metal atoms.
IsPartOf and IsA can be used alternatively. Multisubunit proteins illustrate the difference between the two approaches. The 'Cellular Component' part of Gene Ontology [7] contains macromolecular complexes, with the relationship 'subunit A IsPartOf complex C'. In COMe, multisubunit proteins follow the pattern 'complex C IsA subunit A'. The reasoning is that a complex inherits all the properties of its constituents, as in our example (Figure 1). To the bioinorganic chemist, it is made clear that carbon monoxide oxidase is a molybdenum iron–sulphur flavoprotein. In this respect, protein subunits are completely analogous to protein domains.
IsBoundTo
This special relationship occurs in the case MOL to PRX. It is used because the molecule which IsBoundTo a protein can be changed chemically and, strictly speaking, become a different entity. For instance (Example 5.5 in Table 5), one can say that the protein binds pyridoxal 5'-phosphate (MOL000108), but the resulting substructure has no aldehyde group and therefore is different from 'free' pyridoxal 5'-phosphate, and so MOL000108 is not part of PRX000808. However, BIM000270 IsPartOf PRX000808.
The same ligation pattern may apply to different prosthetic groups and vice versa. The combined use of both BIM and MOL entries to characterise such bioinorganic/small molecule centres and classify the bioinorganic proteins is illustrated in the following examples. In Example 5.6, bacterioferritins have different prosthetic groups but share the co-ordination mode of haem iron. Example 5.7 shows that the same prosthetic group can have different axial ligands.
It is important to stress that the data relationships in COMe are not automatically derived from any primary database. Each one is either a statement found in literature or a curator's judgement that, for example, the protein family PRX y is characterised by BIM x, with the curator assigning the logical relationship BIM x IsPartOf PRX y.
A summary of the relationships is presented in Table 6.
Table 6Table 6
Relationships in COMe
Search tools
COMe has a web-based query interface that utilises Java Servlets technology [20]. Several basic queries are available: by COMe identifier, by text (both case insensitive and case sensitive), and by external reference identifier. The textual searches are also utilised to build a set of predefined queries of general interest (by residue, by a restricted vocabulary of keywords, by chemical element, vitamins and enzyme cofactor). In addition, it is possible to query the database for all the possible paths through a particular entry.
Graphical representations
The 1-D 'shorthand' representation of a BIM is not always unambiguous. Work is in progress on providing every BIM and MOL with a 2-D diagram.
An active graphical map of all paths through every entry of the ontology is also available.
Conclusions
There is little interaction between genomics and bioinformatics, on the one hand, and bioinorganic chemistry, on the other. Bioinorganic protein chemistry deals with at least three types of objects: metalloproteins and other complex proteins; naturally occurring 'small molecules' which can have different functional roles (e.g. prosthetic group, substrate, inhibitor); and bioinorganic models, which are artificial mimics of protein active sites. Neither 'small molecules' nor bioinorganic models occupy a central place in bioinformatics, while in the absence of experimental evidence the features of complex proteins are assigned on the basis of sequence similarity. The situation is further exacerbated by the absence of a definitive terminology shared by scientists in these fields.
COMe (Co-Ordination of Metals) provides a manually edited ontology for bioinorganic proteins and their features. The main groups of proteins in COMe are (i) metalloproteins, (ii) organic prosthetic group proteins and (iii) modified amino acid proteins. The classification of proteins according to these features is orthogonal to other classification schemes, such as those based on sequence similarity [13], 3-D fold [21], enzyme activity [14], or biological process [7]. The organisation of the controlled vocabulary allows both for annotation and querying at different levels of granularity. The controlled vocabulary can be used for structural and functional annotation of proteins, e.g. in sequence databases. The data are currently stored in both XML format and a relational database and are available at http://www.ebi.ac.uk/come/.
An intuitive nomenclature for 1-D representation of a 2-D bioinorganic motif (BIM) has been developed. This 'shorthand' representation of a BIM is not always unambiguous (for example, no stereochemistry data at the metal centre is included), but it is useful for quick searches. In future, the nomenclature could be extended, e.g. an explicit definition of every ligand at every position of a coordination polyhedron can be given while the 'shorthand' description of the BIM could be generated 'on the fly'.
Methods
The main source of data in COMe is the literature. Every COMe entry is manually edited as a separate XML file and ontological relationships are assigned via references to other XML files. The relational implementation of COMe is built on the original XML version. The conversion utilises a SAX parser [22] and some loading scripts. The parser also builds a table of relationships between pairs of COMe entries. COMe has the typical ontological structure of a Directed Acyclic Graph (DAG) [7]. In practice this means that each node of the ontology (apart from the root) has one or more parents. This structure is represented in COMe with a table containing all the possible paths (from the root, the 'complex protein' entity, to all the leaves) in the ontology. This table is filled by a program that reads the table of relationships between the pairs of entities generated by the parser.
First, all the leaves (nodes without children) are selected, then the DAG is explored starting from each leaf and ascending to the root. If there is branching in the graph, the partial 'common' path is stored and the program will in turn explore all the branches, and so on. This structure allows a quick retrieval of the path information. It is also used to create an active graphical map for each path in the DAG allowing easy navigation through COMe. The maps are built with the GraphViz package [23].
List of abbreviations
1-D, one-dimensional
2-D, two-dimensional
3-D, three-dimensional
BIM, bioinorganic motif
DAG, Directed Acyclic Graph
MDB, Metalloprotein Database and Browser
NMR, nuclear magnetic resonance
PDB, Protein Data Bank
SPTR, Swiss-Prot/TrEMBL database
XML, Extensible Markup Language
Xref, cross-reference
Authors' contributions
KD: concept, research and data curation. SC: all software and database implementation. Both authors read and approved the final manuscript.
Glossary
Bridging ligand, atom or chemical group linking two or more different metal atoms in polynuclear centres; indicated by the symbol μ.
Corrin, a macrocycle containing four pyrrole rings. It differs from porphyrin in that one of the single carbon bridges is replaced by a direct C–C bond. Naturally occurring complexes of corrin derivatives (corrinoids) with cobalt include cobalamin (vitamin B12).
Coordination geometry, arrangement of the ligands around the central atom.
Coordination shell (first coordination shell), the collective name for the ligands surrounding the central atom(s).
Didentate (bidentate), containing two binding sites for a single metal atom.
Diiron-carboxylate proteins, a group of proteins characterised by a dinuclear iron centre bridged by carboxylate group(s) of Asp or Glu and oxide/hydroxide group(s).
Dinuclear (binuclear), containing two or more metal atoms within a single coordination shell.
Directed Acyclic Graph (DAG), a graph with one-way edges where no path starts and ends at the same node.
Endogenous, polypeptide-derived.
Enzyme, a protein catalyst.
Exogenous, not derived from a polypeptide.
Haem, an iron-porphyrin complex.
Homology, common evolutionary ancestry.
IsA (IsKindOf), semantic relationship of subsumption. If term A IsA term B, then A has a more specific meaning. The inverse relationship is Includes.
IsBoundTo, relationship between 'small molecule' A and macromolecule B in functional complex. Since both molecules may change chemically upon complex formation, IsBoundTo is not identical with IsPartOf. The inverse relationship is Binds.
IsPartOf, part/whole semantic relationship. The inverse relationship is Contains.
Ligand (in coordination chemistry), one of the atoms or chemical groups bound to the metal atom, usually by the donation of a lone-pair of electrons.
Molybdenum cofactor (Moco), the metal (Mo or W) complex of pyranopterin. Moco functions as the prosthetic group of a number of oxidoreductases.
Monodentate, containing a single binding site for a metal atom.
Mononuclear, containing one metal atom within a coordination shell.
Polydentate, containing two or more binding sites for a single metal atom.
Polynuclear, containing two or more metal atoms within a single coordination shell.
Porphyrin, a macrocycle containing four pyrrole rings each linked by single carbon atom bridges. Naturally occurring porphyrins form tight complexes with metal ions, such as Fe (haems), Mg (chlorophylls) and Ni (F430).
Prosthetic group, a non-polypeptide compound that conveys specific biological function to a protein. Single metal ions, inorganic compounds, organic compounds and metal-organic complexes all may function as prosthetic groups.
Supplementary Material
Additional File 1
Table S1. Ligands in bioinorganic motifs.
Additional File 2
Table S2. Inorganic and organic centres in bioinorganic motifs.
Additional File 3
Table S3. Coordination polyhedra.
Acknowledgements
We thank Gillian Adams, Marcus Ennis and John S. Garavelli for their helpful comments and suggestions on the manuscript.
References