Bioinformatics Primer

An Introduction to the NCI-Nature Pathway Interaction Database

Pathway Interaction Database (09 November 2006) | doi:10.1038/PID.2006.001

Designed for biologists and bioinformaticians, the Pathway Interaction Database provides high quality information about signalling pathways in human cells. It also provides a set of user-friendly tools to allow the pathways to be explored, visualized, and mined.

Signalling pathway diagrams published in textbooks, research articles, and on an increasing number of web sites are useful for organizing knowledge and increasing our understanding of biological processes. Despite their simplicity and abstraction, their ubiquity and persistence attest to their value. Nevertheless, their limitations are obvious. The boundaries of individual pathways are often defined arbitrarily, and there is often redundancy among predefined pathways as sequences of connected interactions may be common to multiple processes. Furthermore, a diagram may represent individual pathway steps ambiguously, failing to differentiate necessary from sufficient causes or inputs from outputs. Perhaps most critically, the pictorial representation may not be accompanied by a computable representation that allows users to search for network connectivity.

Despite these limitations, collections of pathway diagrams represent sizeable bodies of knowledge. The National Cancer Institute (NCI) has embarked on a project to devise a simple, unambiguous, computable representation of pathway data and to translate existing collections of pathway data into this representation. The vision was not to create a single, authoritative repository of pathway data, but to create a database that would support novel approaches to analyzing pathways and, ultimately, facilitate the identification of candidate molecular targets for cancer therapies. The NCI also sought to create a data representation that could handle the whole range of cellular processes (metabolic, signalling, and regulatory processes) support the dynamic composition of pathway networks from an underlying database of interactions, and support queries based on connectivity. For example, the representation should be able to ask questions such as: What downstream interactions could be affected, directly or indirectly, by a mutation in a particular protein or by a change in the abundance of a particular protein? How many parallel, independent pathways are known to lead to the same event (e.g. activation of a particular protein)? What anomalies (mutation, increased expression, decreased expression) might theoretically result in a failure of the DNA repair mechanism; might these same anomalies disrupt other processes? The result of this project is the Pathway Interaction Database (PID), a free resource intended to accelerate the pace of discovery and facilitate the understanding of cellular networks.

Data Representation

In the Pathway Interaction Database a biochemical process is modeled as a directed graph with labeled nodes and edges. A node represents a molecule or a process. An edge connects a molecule to a process and represents the role of the molecule in the process. An interaction consists of one process node and its adjacent molecule nodes. A pathway is a set of interactions; a single interaction is a minimal pathway. Not every set of interactions will form a single, connected graph. Furthermore, while the formal pathway object is defined as a set of interactions, it is clearly not the case that every set of interactions models a biologically meaningful entity. The data model provides for predefined sets of interactions corresponding to established pathways.

Molecule types

There are four molecule types: RNA, protein, complex, and compound (used broadly to include anything that is not an RNA, protein, or complex). The database has a unique internal identifier for each protein that is a participant in an interaction or complex. Thus each protein isoform has a unique identifier, as does each cleaved subunit, which is explicitly related to its precursor. Covalently modified forms of a protein on the other hand, share the same internal molecule identifier, the modifications being represented by other means. The database also accommodates the definition of molecule families. A family is a descriptive convenience; it may be used to group different isoforms from a single gene or protein products of different genes with similar function. A molecule may be assigned any number of names and external identifiers such as Entrez Gene or UniProt identifiers.

Interaction types

There are four interaction types: reaction, binding, transcription, and translocation. ‘Reaction’ comprises protein–small molecule associations, including enzymatic process. ‘Binding’ comprises protein–protein interactions including protein modification and complex association/disassociation. ‘Transcription’ broadly encompasses all transcription/translation events, including nuclear export. An mRNA product is explicitly defined only when it has specific regulatory roles. ‘Translocation’ includes events resulting in a change in cellular location. All these interaction types can be reversible or irreversible. In addition to these four basic interaction types, there are also macroprocesses, unanalyzed multi-step events encapsulated in a single node and annotated with any term from the Gene Ontology (GO) Biological Process vocabulary (http://www.geneontology.org). Pathway diagrams found in the Pathway Interaction Database often contain a mix of individual interactions and unanalyzed macroprocesses in the same diagram.

Edges

There are four edge types, defining the role of a molecule in an interaction: input, agent, inhibitor, and output. Input, agent, inhibitor are incoming edges, while output is an outgoing edge. An input is transformed by the process, whereas agents and inhibitors are regulatory participants in the process that are not themselves transformed.

Labels

Labels provide additional information about molecules, processes and edges. All labels are optional. Molecule labels specify location and activity state. Location labels are drawn from the GO Cellular Component vocabulary. Activity state labels on molecules can be abstract or concrete. Abstract activity labels are drawn from a set of abstract terms including ‘inactive’, ‘active’ and an open-ended sequence of terms representing subtypes of ‘active’ (‘active1’, ‘active2’, etc.). Concrete activity state labels specify physical post-translational covalent modifications on specific residues. A process may have one or more condition labels, drawn from the GO Biological Process vocabulary, to specify biological prerequisites for the process (e.g. ’response to oxidative stress,’ GO:0006979). An edge may have one or more function labels, drawn from the GO Molecular Function vocabulary. A function label assigns a specific role to an edge (e.g. ’casein kinase activity,’ GO:0004680).

Complexes and interactions

A complex consists of several components. Each component has an internal molecule identifier and may have location and activity state labels. The complex as a whole may also have location and (abstract) activity state labels. Two complexes have the same internal molecule identifier only if they have the same number of components, and if for each component in one complex there is an equivalent component in the other complex (see below for a definition of equivalent components).

A set of interactions defined in the database as a pathway can itself be used as a process node in a higher-level pathway. Thus, it is possible to define subnetworks that are involved in multiple pathways. For example, a subnetwork representing the MAPKKK cascade can be inserted as a unit into the Kit pathway since the activation of Kit results in activation of the MAPKKK cascade.

Each interaction can be annotated with one or more literature citations and one or more evidence codes. Evidence codes include the GO codes:

IC (Inferred by Curator)

IDA (Inferred from Direct Assay)

IGI (Inferred from Genetic Interaction)

IMP (Inferred from Mutant Phenotype)

IPI (Inferred from Physical Interaction)

RCA (Inferred from Reviewed Computational Analysis)

plus the following:

IAE (Inferred from Array Experiments)

IFC (Inferred from Functional Complementation)

IOS (Inferred from Other Species)

RGE (Inferred from Reporter Gene Expression).

Finally, each pathway and each interaction is associated with a data source. At present, the database has two sources of data: BioCarta pathways and NCI-Nature Curated pathways (see Contents of the database).

Equivalence

An important goal of the PID is to unambiguously indicate (as a query output) when two molecules or interactions are the same and when they are different. In the graphic representation of a pathway there will be exactly one node for particular molecular species. This goal presupposes an explicit definition of equivalence. Two molecule nodes are equivalent if they have the same internal molecule identifier and if they have the same set of labels. Similarly, two complex components are equivalent if they have the same internal molecule identifier and the same set of labels. Two edges are equivalent if they have the same basic edge type, the same set of labels and if their attached molecules are equivalent. Two interactions are equivalent if they have the same basic process type, the same set of condition labels, the same number of edges and if for each edge in one interaction there is an equivalent edge in the other interaction. Two pathways are equivalent if for each interaction in one pathway there is an equivalent interaction in the other pathway.

An interaction is considered to require all of its agents and inputs, to be inhibited by any of its inhibitors, and to produce all of its outputs. Therefore, if there are known to be two alternative means by which a given input is modified to produce a given output, then these are represented in two separate interactions; they are not combined into a single interaction. This is shown in Figure 1, where there are two alternative paths by which inactive CHEK1 is modified to produce active CHEK1+. If interactions 102384 and 100606 were combined in a single interaction the interpretation would be that both active ATR+ and active ATM+ would be required to modify CHEK1 to CHEK1+.

Current Contents of the Database

At present (October, 2006), the Pathway Interaction Database contains data from two sources. The older source consists of pathways posted on the BioCarta web site (http://www.biocarta.com) prior to June 2004. The original BioCarta pathways were artistic renderings; the protein constituents were hyperlinked to corresponding gene identifiers but the connections between gene products were indicated only graphically. Thus, it was necessary to manually encode the implied connectivity in order to enter the pathways and interactions into the database. Curation of the imported BioCarta data was very limited as gene products were identified only to the level of the corresponding gene — no physical post-translational modifications, citations to the literature or evidence codes were recorded, and there was no formal review process.

The newer data source is the ongoing curation of pathway information by Nature Publishing Group under contract to the National Cancer Institute. The ‘NCI-Nature Curated’ data currently comprises more than 30 pathways, 1200 proteins, 500 complexes and 1300 interactions. Interactions in the NCI-Nature Curated data are annotated with citations and evidence codes, and capture relevant post-translational modifications of proteins. Each pathway is verified by one or more experts in the field. All data currently in the database is human, but the database is capable of storing data for other organisms. Currently the data is typically derived from published focused studies and small-scale experiments; however, there is no reason why the database could not include data derived from high-throughput protein–protein interaction scans. Interactions can be joined into larger networks if they share identical molecules regardless of the source of the data or the general nature of the interaction (enzymatic reaction, signalling, transcription, translocation). However, since the proteins in the BioCarta data source are defined only at the gene level and most of the proteins in the newer curated data are defined at the protein level, the web query interface limits a query to either one of the data sources. The BioCarta data in PID consists of approximately 250 pathways, 4000 proteins, 800 complexes and 3000 interactions.

Searching the PID

Each search against the PID specifies a set of interactions and an output format. The basic set of interactions is defined by giving the names of one or more predefined pathways and/or the names of one or more molecules and/or the names of one or more macroprocesses. The basic set of interactions includes each interaction that belongs to one of the specified predefined pathways, or that involves one of the molecules (either directly as a simple molecule or as part of a complex that is used in the interaction), or whose type is one of the macroprocesses, or that has a condition whose type is one of the macroprocesses. This basic set of interactions may then be augmented by including interactions that are immediate graph predecessors or immediate graph successors in the global network defined by the database. An interaction A is a predecessor of interaction B if a molecular species output from A is input to B; successor is defined conversely. Finally, the basic set of interactions may be filtered by removing interactions that do not belong to the specified data source (NCI-Nature Curated or BioCarta) and by removing interactions that have not been annotated with specified evidence codes.

Search results can be obtained in either text or graphic formats. Text output is either BioPAX, Level 2 (http://www.biopax.org) or a simpler native PID XML format. BioPAX has the advantage of being a standard format. In addition, there is a BioPAX plugin for the freely available Cytoscape pathway analysis and visualization tool (http://www.cytoscape.org). Graphic output is either SVG or GIF. The application that generates the graphs checks the set of interactions for equivalence, removes duplicate interactions, and then creates a separate graph node for each unique molecular species, interaction and macroprocess. The result may be a single connected graph or a set of disjoint graphs. A set of disjoint graphs is ordered from the largest (containing the largest number of interactions) to the smallest. The final graphic output is generated by the freely available GraphViz program (http://www.graphviz.org). Each interaction and each molecule in the graphic output is hyperlinked to a text page of information. The interaction page contains the basic information about interaction type, the roles of participating molecules and the predefined pathway(s) in which the interaction is involved. In addition, the interaction page provides evidence codes and literature citations. Equivalent interactions in the database are cross-referenced from the interaction page. The molecule page contains information about the species used in this interaction (the identity of the molecule together with location and activity state labels). An additional hyperlinked text page contains information about the roles of the molecule in other interactions and complexes in the database. If the molecule is a complex, the text page provides the composition of the complex. If there are other complexes in the database with the same components, these are also identified.

Practical examples of queries

Searching predefined pathways

The PID allows browsing of interactions across the boundaries of predefined pathways. If a user is interested in ceramide signalling for instance, searching the collection of predefined pathways with the keyword ’ceramide‘ will retrieve the large predefined ’Ceramide Signalling Pathway‘ consisting of 56 interactions Figure 2). Using the hyperlinks in the graphic representation of this pathway, one can select a protein or complex of interest and easily navigate to lists or graphic displays of all interactions in the global network involving that protein or complex. For example, following the hyperlink on the BAD protein shows a role for the same protein in the BAD/PKC delta/Alpha-Synuclein complex and in several other predefined pathways (Figure 3).

Molecule queries

Instead of searching a particular predefined pathway, one could search the database with a molecule of interest, The molecule search of ’ceramide’ retrieves a smaller network of 19 interactions, all involving ceramide, across two predefined pathways (‘Ceramide Signalling Pathway‘ and ’FAS Signalling Pathway (CD95)’). searching on the highly-connected protein HRAS in the NCI-Nature Curated data retrieves 18 interactions across four pathways (‘Fc Epsilon Receptor I Signalling in Mast Cells‘, ’EPO Signalling Pathway‘, ’Signalling events mediated by stem cell factor receptor (c-Kit)’, and ’Signalling pathways activated by Hepatocyte Growth Factor Receptor (c-Met)’). In the BioCarta data, HRAS is found in more than 40 interactions across more than 30 pathways.

Upstream and downstream connections

The PID supports queries based on network connectivity, such as: If protein X is mutated, what downstream processes can be affected? Given a list of genes that are commonly mutated in a cancer phenotype, are the corresponding proteins close neighbors in the global interaction network? Are there multiple mechanisms leading to the activation of protein Y? Suppose, for example, that one wished to see the larger context of interactions involving ceramide, in particular the set of interactions that are immediately upstream of the 19 interactions involving ceramide. This is accomplished by adding the ‘upstream’ qualifier to the molecule search. In the resulting output, the upstream interaction nodes are colored brown to distinguish them from the 19 interactions that directly involve ceramide (Figure 4).

Connected molecule queries

It is also possible to search on multiple molecules of interest simultaneously. Suppose one wished to see whether two molecules, ceramide and CD22, might each be playing a role in a larger process. Searching the database for all interactions involving ceramide and CD22 results in two separate networks: the network of 19 interactions involving ceramide and a smaller network of two interactions involving CD22. Before concluding that these molecules are not co-players in a larger process, the context of the search can be expanded to include upstream and downstream interactions. This still shows disconnected subgraphs with all the ceramide interactions in one subgraph and all the CD22 interactions in another subgraph. The ‘connected molecules’ search, allows users to see if there is any possible path connecting two or more molecules of interest. This search is able to construct a single graph that connects ceramide and CD22; in this case, the graph suggests that they do not have a common upstream or downstream event (Figure 5). It should be noted that the connected molecules search finds one, but not all, possible connecting path.

Batch queries

A powerful use of the PID is to search larger numbers of molecules simultaneously, using the ’batch query‘. Input to the batch query consists of either one or two lists of molecules. The lists might be lists of genes found to be significantly over-expressed in a set of samples, a list of genes known to be mutated in a cancer phenotype, or a list of genes located in regions of genomic amplification or deletion. In the graphic output, the molecule names will be colored differently, blue or red, depending on which list they were in. Since there are no restrictions on the composition of the lists, the same molecule may appear on both lists, in which case the molecule name is displayed in purple. Names of molecules that are not in either list appear black in the graphic output. The query has two versions: pathway-oriented and molecule-oriented. In the pathway-oriented version, the user asks to see a predefined pathway colored according to the molecule lists. In the molecule-oriented version, the user asks to see a network constructed from interactions involving the molecules on the lists. In either version, the user can ask that upstream and/or downstream interactions also be included.

As an example of using a single molecule list, consider the list of 355 cancer genes in the Sanger Institute’s Cancer Gene Census [1]. Applying this list to the Signalling pathways activated by Hepatocyte Growth Factor Receptor (c-Met), we find that 8 genes (HRAS, SHP2, CBL, SEK1, Beta Catenin, MET, PTEN, and PI3K catalytic alpha polypeptide) appear in the pathway, affecting a considerable number of the interactions (Figure 6). If instead of specifying a single pathway, we ask for the network created by all the molecules on the list, we find that 56 of the molecules are represented in the curated data; the result is several disconnected subgraphs, the largest of which has several hundred interconnected interactions. As a final example, consider the lists of genes found to be mutated in breast and colon cancer [2]. The current list of genes in the census can be found at http://www.sanger.ac.uk/genetics/CGP/Census. 24 of the genes on these lists are involved in interactions in BioCarta data. Again, the interactions form one very large connected graph and several smaller disjoint graphs. The largest graph (Figure 7) contains interactions that are particular to the genes on each list as well as a small number of interactions that involve genes on both lists.

The application of PID to complex biological problems

High-throughput sequencing and expression measurements provide a framework for understanding the organization of genes and their products, and for recording observations about specific phenotypes. Thus, the linear map of the genome allows us to record the fact that loss of heterozygosity in a region of 17q21 is found in 30% of primary breast tumors [3], and expression analysis can identify clusters of genes that separate BRCA1-positive from BRCA2-positive breast cancer [4]. The organization of molecules in signalling networks offers a different framework in which to record observations, a framework based not on linear contiguity or principal component analysis, but on basic cause/effect and producer/consumer relations. This framework has the significant advantage that it can interrelate observations from multiple kinds of molecular abnormalities, based on their potential biological effects. Using the example in Figure 1, the network can relate independently observed anomalies, such as mutation of ATR and decreased expression ATM, by showing that each of these anomalies targets the same process, the activation of CHEK1. This kind of reasoning is expected to play an increasingly important role in the discovery of novel therapeutic targets.

Conclusion

The Pathway Interaction Database is an innovative resource which aims to provide new insights into biological problems by exploiting biomolecular interaction data. The database contains fully curated, network-level representations of interactions and pathways. A carefully designed web search interface supports simple browsing across predefined pathways, construction of larger networks around molecules and predefined pathways, as well as analysis and visualization of lists of targeted molecules in the context of predefined and novel networks.

Author: Carl Schaefer

Author: National Cancer Institute Center for Bioinformatics, National Institutes of Health, Rockville, MD, 20852 USA

Original Research Papers

Futreal, Andrew et al. A census of human cancer genes. Nat Rev Cancer. 4: 177–183 | Article|

Tobias, Sjöblom et al. The consensus coding sequences of human breast and colorectal cancers. Science 317: 268–274 | Article|

DeMarchis, L et al. Candidate target genes for loss of heterozygosity on human chromosome 17q21. British Journal of Cancer 12: 2384–2389 | Article|

Hedenfalk, Ingridet al. Gene-Expression Profiles in Hereditary Breast Cancer. The New England journal of Medicine 344: 539–548 | Article|

Bioinformatics Primer

An Introduction to the NCI-Nature Pathway Interaction Database

Standfirst

Data Representation

Molecule types

Interaction types

Edges

Labels

Complexes and interactions

Equivalence

Current Contents of the Database

Searching the PID

Practical examples of queries

Searching predefined pathways

Molecule queries

Upstream and downstream connections

Connected molecule queries

Batch queries

The application of PID to complex biological problems

Conclusion

Original Research Papers

Site navigation

Pathway Interaction Database

Pathway updates

Information

About us

User Guide

Related links

NCI Resources

Nature Resources

Extra navigation

Article navigation