NCA1 HMM Vignette

NASA Center for ASTROBIOINFORMATICS

Exploration through Research

Biological Hidden Markov Models (HMM's)

A Profile is essentialy a multiple sequence alignment. Multiple nucleotide or amino sequence alignment techniques are usually performed for one or more of the following reasons:

To characterize protein families, identify shared regions of homology in a multiple sequence alignment; (this happens generally when a sequence search revealed homologies to several sequences).
To determination of the consensus sequence of several aligned sequences to develop a single representation of that group.
To create profile Hidden Markov Model (HMM), a statistical model representing the family of sequences (i.e. Pfam).
As an aid in the prediction of the secondary and tertiary structures of new sequences.
As a preliminary step in molecular evolution analysis using Phylogenetic methods for constructing phylogenetic trees.

What are Biological HMM's? A first order, discrete HMM is a stochastic method for generating sequences of characters from some finite alphabet. It is characterized by a set of states S, an alphabet A, a probability transition matrix T = Tji which describes the probability of a transition out of state i into the next state j, and a probability emission matrix E = EiX , which emits symbol 'X' when in state i. For DNA, A consists of the set {T,C,G,A}; whereas for proteins, the alphabet consists of one-letter codes for the 20 naturally occurring amino acids. One can also come up with unique structural letters, which stand for HELIX, or BETA SHEET, etc, and include these in the alphabet. The particular alphabet chosen will depend on the system being modeled, and the information being encoded. The elements of the transition matrix (Tji) describe probabilities of moving from any one state i to the next state j. Upon moving from one state to the next, the system emits symbols from the alphabet according to the emission probability distribution E, thus E2G describes the probability of emitting the symbol 'G' in state 2. A cartoon of a simple HMM will help clarify the picture:

Figure of a simple 2-state HMM

In Fig. 1, there are two states (1) and (2). For each state, the following transition probabilities are defined as a transition matrix T:

(T) 0.99	(T1) 0.10
(T2) 0.01	(T2) 0.90

The emission matrix E looks like this:

(E1A) 0.4	(E2A) 0.05
(E1C) 0.1	(E2C) 0.4
(E1G) 0.1	(E2G) 0.5
(E1T) 0.4	(E2T) 0.05

Note, that the sum of all the transition probabilities out of a particular state is 1: T21 + T11 = 1. Also, the HMM is first order, meaning that in any given state, the system knows only about the previous state it was in. Higher order HMM's are possible, but are not discussed in this report. Note that in this model, state (1) generates AT rich sequence, whereas state (2) generates GC rich sequence. The state sequence shown in the figure represents a path through the HMM. In general, this path is unknown. The only observable quantity is the emission of characters (the DNA sequence that is generated by the model). This is the reason behind the name "hidden" markov model. The reason why the state sequence is hidden is because the system chooses randomly which state it will go into next, and which symbol this new state will emit. The random choices are biased by the probability distributions T, and E, in a manner similar to a loaded dice. Imagine you are playing a game with a pair of dice which are loaded to favor rolling pairs of 6's. You don't expect that 100% of the time you will roll 2 6's, however you know that most of the time you will. The HMM operates the same way on the probability distributions that define it's behavior.

We cannot know the exact state sequence that produces a particular observed sequence, however we can ask what the most likely state sequence is, given a particular sequence. To do this, we must align the sequence to the HMM. In general, there are three fundamental questions we can ask with regards to HMM's:

Given a sequence, and an existing HMM, what is the probability that the HMM could generate the sequence? This is called the scoring problem.
Given a sequence which is modeled by a particular HMM, what is the most likely state sequence (path through the HMM) that could have given rise to the observed sequence? This is sometimes referred to as the alignment problem.
Assuming that the transition and emission parameters for our HMM are not known for certain, how can we use a large amount of sequence data to revise our assumptions about these parameters, and refine the model. This is called the training problem.

A More Useful Example: The example HMM shown above is useful for gaining a basic understanding of HMM's, however it would not be useful in practice, since in modeling a set of sequences, we expect to make use of insertions and deletions. Thus, a model which incorporates insert and delete states will be required. We also would like an architecture more applicable to sequence data. Thus, we need to be able to account for different length sequences, and also variation in base (or amino acid) composition as a function of sequence position. The following graph is an example of one type of HMM architecture that will accomplish this.

Biological profile-HMM architecture.

In Fig. 2, red lines indicate higher probability transitions, and represent a main line through the HMM. In this architecture, triplets of states comprised of one delete (d), insert (i), and match (m) state are grouped together into nodes. Each node represents a position along the sequence. Note also, that we have explicitly represented the begin and end states. Each match state has a set of emission probabilities which determine which characters from the alphabet are most likely. The sequence of match states forms a model for a family of sequences, or alternatively, forms the columns of a multiple alignment. Delete states allow the HMM to pass through a node without emitting any symbols,and thus are able to account for deleted regions in a sequence, or in a multiple alignment. Insertions are used when a small number of sequences have regions that are not found in the other proteins. This, for example, could occur when the model includes sequences of different length. Below is an example alignment done using the SAM software package. The HMM which produced it is also shown below. The sequences aligned were:

	Sequence_A = ATGCAAAGTGT
	Sequence_B = AGGCGTGT

And the resulting multiple alignment was:

Sequence A: A T G C a a A G T G T
Sequence B: A G G C . . - G T G T

where a "." means that there were no characters in the original sequence at this position, a "-" indicates a deletion, and lowercase letters are for characters in one sequence that have no corresponding matches in the other sequences. The corresponding HMM is pictured below, with red lines indicating the path through the HMM, and colored characters representing emissions in the various states

: Graph representation of a simple HMM

Note that there is a delete state in sequence where there should be a "." at this position, since there is no corresponding character in the sequence. During the building of the model, nodes are added and deleted in an attempt to find the best match. In this case, the system chose to emit the letter 'A' for sequence A and a delete character '-' for sequence B, rather than continue in the insertion mode with '3a' instead of '2a' in node 4. The behavior of the system is dependent on the transition frequencies, as well as penalties for creating gaps. These can be adjusted through the use of regularizers.

HMM parametrization: A profile HMM can model all possible sequences from a given alphabet, thus it defines a probability distribution over the entire space of sequences. Parametrization attempts to make this distribution peak around members of the family of interest, i.e. G-coupled receptors, cytokines, etc. The values of the probabilities, and the length of the model are the parameters in a profile HMM. How do we obtain these parameters at the outset? There are a number of different approaches, the maximum likelihood method being the most straightforward. In this method, the maximum likelihood estimate for, say, emission probabilities would depend on the observed frequencies for a particular residue to be in a certain location. If fja is the frequency of residue a in position j in the alignment, then the maximum likelihood estimates for the emission probabilities would be:

equation 1 (1)

These can be calculated from an initial alignment (often called a seed alignment) of the proteins in the family being modeled. A problem arises when the frequency for a given residue in a particular column is zero. This could happen easily in a protein alignment, since there are twenty amino acids, and not all of these will be represented in a given column. A way around this is by adding small probabilities to the frequency counts for a given column, and renormalizing. This is the method of pseudocounts. The maximum likelihood for emission probabilities using pseudocounts is:

equation 2 (2)

where qa is the pseudocount for character a, and A is a weighting factor (usually around 20 for protein alignments). Note that if very little data is available, eMj(a) is approximately equal to qa, because all of the real frequencies are small compared to A. However, as more data becomes available, the emphasis on the pseudocounts diminishes. The addition of pseudocounts is a method of regularization; it incorporates prior information about the family of sequences we want to mode without seeing all the data in the form of an alignment. This seems to make intuitive sense, and is usually easy to implement, however in order to obtain reasonable estimates of the parameters, one needs to include on the order of 50 or more sequences in the alignment, the more the better. Another way of including prior information is through the use of what are known as Dirichlet mixtures. These are a mixture of Dirichlet distributions, which are multinomial distributions having the form:

equation 3 (3)

where

equation 4 (4)

In these two equations, the sum is over the k Dirichlet distributions in the mixture, the fj are vectors associated with the frequency of all the different characters at position j, and alpha are parameters. Equation 4 is an application of Bayes rule, where pk are prior probabilities of each of the k mixture components, and P(k | fj) is the probability of the data (frequencies) according to the Dirichlet mixture k. You might ask, though, on what are the Dirichlet distributions based? What is the biological significance of using such a complex form for the prior information? Usually, these distributions are based on analyses of very large data sets, coming from multiple sequence alignments of many different protein families. In the case of the SAM software, alignments in the BLOCKS, and HSSP databases are used. The use of Dirichlet mixtures has been shown to reduce the number of false positives and false negatives in testing sequences against the model.

Scoring of Sequences: Once a model has been build, one might want to know how well a certain sequence fits the model. This amounts to asking what the probability is of a particular sequence being generated by the model, represented as P(s | M) (read the probability of s given M). Such a process of identifying sequences is know as scoring. There are many functions available for scoring, none of them representing the single best method. Two methods commonly used with HMM's are Viterbi, and Baum-Welch algorithms. The Viterbi method calculates the best path through the HMM, whereas the Baum-Welch algorithm uses a sum over all paths. Viterbi is usually not recommended for database discrimination, but is used widely in alignment.

Scores are often represented in different ways, however the negative logarithm of P(s | M) , called the negative log likelihood cost (NLL cost) is consistent with the Bayesian statistical framework that HMM's are built upon. The NLL cost, however, is strongly dependent on the sequence length, with the result that longer sequences receive better scores. Thus, scores are often represented as a log-odds score, which is the NLL cost minus the NLL cost for a null model. The log-odds scores are such that more negative scores correspond to better matches fo the sequence to the model. The null model attempts to match all of the sequences in the universe of possible sequences.

Results for Voltage-Gated Potassium Channels: Below is an example using a family of voltage-gated ion channel proteins (VGIC's) which I performed in 1998. These proteins are among the earliest proteins, and are thought to have diverged from a common ancestor over 2.5 billion years ago. Voltage gated channels help control transmembrane ion gradients, and can be coupled to synthesis of ATP. They also play a role in many signaling pathways. In the following, I have focused on voltage-gated potassium channels in this section, because they are among the most simple channels.

To build the initial model, 87 potassium channel sequences were downloaded from the Entrez database, and placed in a flat file. I used the SAM software to build the first model using Dirichlet priors, and including all of the sequences in the training set . An initial multiple alignment of all 87 sequences resulted in many gaps and insertions, and was generally not satisfactory. One reason for this is that, beause of highly variable loop regions, the sequences vary greatly in length.In addition, potassium channels consist of a number of subfamilies, which vary in structure of the transmenbrane (TM) regions. Therefore, I focused the alignment effort on the inward rectifier potassium channels.

By aligning only these sequences using the model built with all 87 sequences, I was able to obtain a very good alignment, and a reasonable phylogenetic tree. This alignment agreed well with a multiple alignment of this family of proteins available in the PFAM database. As can be seen in the tree, the members of the sub-families of different organisms group together (i.e. IRK 9 MOUSE, RAT, and HUMAN all group together, as expected). This simple example shows that a more general model, trained on channels which span multiple subfamilies, can be used to effectively create a multiple alignment for a subset of those families. Also included here is the complete alignment of the whole set of sequences, made by retraining the model with some of the SAM options, such as model surgery, and noise, turned off. It should be noted that building a model to find remote homologies is slightly different from building a model to do multiple alignments. A model which is fairly general, and does not overrepresent very similar classes of proteins is best for scoring a database for remote homologies. One way to acheive this is to use internal weighting, to de-emphasize sequences that fit the model too well. Another way is to select many related proteins, for example, by selecting the protein neighbors from the Entrez database, for the training.

References:

Bioinformatics. The Machine Learning Approach. Pierre Baldi and Soren Brunak MIT Press, Cambridge Mass. 1998 ISBN 0-262-02442-X

Biological Sequence Analysis. Probablistic Models of Proteins and Nucleic Acids. R. Durbin, S. Eddy, A. Krogh, G. Mitchison Cambridge University Press 1998 ISBN 0 521 62041 4

Karplus, K.; Barrett, C. ; Hughey, R.; Hidden Markov Models for Detecting Remote Protein Homologies PREPRINT to appear in Bioinformatics, 1999