Sequence-Structure Relationships through Phylogenetic Analysis

	
	William J. Bruno, Ph.D.
	Theoretical Biology and Biophysics, Mail Stop K-710
	Los Alamos National Laboratory
	Los Alamos, NM 87545
	(505) 665-3802, FAX: (505) 665-3493
	billb@lanl.gov
	www.t10.lanl.gov/billb


In the world of sequence analysis, two trends have been clear:
the number of sequences in the databases keeps going up, and
the use of multiple sequence alignments--tacitly implying 
the use of evolutionary analysis--is of increasing importance.
This combination of trends has lead beneficially to some real progress
in 3D protein structure prediction, exemplified by a largely correct
prediction of a new fold in CASP3. This was achieved by the Baker
group for target 56, using a method heavily dependent on finding
evolutionary patterns in multiple alignments that correlate with
protein structure.

This is a great achievement, but significant improvements are still
needed, especially for dealing with larger proteins.  Even for target
56, a small protein of 114 residues, the correct structure was ranked
5th in the Baker's group list of 5 candidates.  For larger proteins,
the number of high-scoring candidates tends to explode, and the size
of the search space from which the candidates are drawn creates a
heavy computational burden as well.

Including evolutionary covariation analysis, whenever a good multiple
alignment can be constructed, offers a way to improve on the Baker
method (or other methods) and extend it to larger proteins.  To the
extent that the Baker method includes evolutionary information, it
does so only very locally in the sequence.  If coevolution of two
residues far apart on the sequence indicates that they are close
together in 3D, this provides a powerful additional constraint.  Such
constraints can help make the problem tractable for large problems by
cutting down the size of the conformational search space.  They can
also be used to improve the rankings of candidates for proteins of all
sizes.

Many methods exist and have been used to detect correlated mutations
in multiple alignments.  However, protein researchers tend to regard
evolution as ``an annoyance'' that, while known to undermine the
statistical foundation of most such methods, is swept under the rug.
The problem is that counting the mutations that lead to residue
replacement requires an evolutionary analysis, and the statistics of
the covariation problem are too sensitive to allow this to be ignored.

We have created a program called Rind2, based on the same principles
and model as the Rind program.  Rind analyzes the amino acid usage in
a protein alignment one column at a time, using a likelihood model on
an evolutionary tree.  Rind2 does essentially the same thing, except
that it considers two columns at a time.  To use this program for
prediction of contacts also requires a way of computing statistical
significance, which we have implemented with Dr. Aaron Halpern of the
University of New Mexico.

We submitted predictions from Rind2 into the CASP3 experiment last
year for target 62.  Unfortunately, the coordinates for target 62 were
not ready in time, and it was dropped from the experiment.  Thus, in
the absence of true prediction results, we can only discuss some very
promising ``post-diction'' results.  Our best results are for the SH3
domain, where we predicted 11 contact pairs, 9 of which have Van der
Waals surfaces that come within 1.1 Angstroms.  The other 2 pairs have
highly variable distances depending on which structure one uses, and
one of them forms a Van der Waals contact in one NMR structure.  We
have also had success with much longer alignments, including the
tryptophan synthase proteins A and B in one long alignment of 633
residues.  In this case, no contact pairs can be predicted with 95
percent confidence.  However, of the 10 highest scoring pairs, 5 are
in direct contact.  With careful use of statistics, we believe these
pairs can be very useful.  We also have plans to improve the
sensitivity of the method using additional sources information, which
could help us determine ahead of time which 5 of the 10 are the real
contacts.

Our work on this problem builds on our experience with the original
Rind, in addition to our work on constructing phylogenetic trees.  We
have found that the widely used Neighbor Joining (NJ) program is not
suitable for building the trees we need for our analyses when long
branches are present.  This prompted us to develop Weighted Neighbor
Joining, called Weighbor for short.  Weighbor correctly deals with the
large variances associated with long evolutionary distances, and
creates trees with much higher likelihoods, less bias, and better
resolving power than NJ.  Running the Rind programs with Weighbor
gives them better convergence properties and better results.  Weighbor
has also attracted the interest of evolutionary biologists, helping
us to recruit David Pollock, who now holds a LANL Director's Fellow
Postdoc in our lab.  Dr. Pollock has published papers showing the
importance of bona fide evolutionary modeling in using covariation to
predict protein structure, and recently published a likelihood ratio
test for this problem in JMB.

Essential to the task of detecting covariation is having a good
multiple alignment.  Creating the multiple alignment is perhaps the
most critical step, because existing automated methods often work so
poorly that a lot of manual intervention must be added.  
Dr. Ian Holmes, who got his Ph.D. with Richard Durbin at the Sanger
Center, is working in our lab as a Fulbright Scholar.  He is applying
his expertise in Hidden Markov Models to the multiple sequence
alignment problem.  This ultimately may lead to a fully automated
method, but in the meantime should at least offer a rigorous method
for identifying portions of a multiple alignment that are unreliable
and should be improved or discarded.

Another key person on the project will be Charlie Strauss of
CST-division.  Dr Strauss will return shortly from his sabbatical with
the Baker group in Seattle, where he learned to use their ab initio
prediction program and has been investigating the effect of adding
tertiary contacts, including those predicted by Rind2.  His
preliminary results are very encouraging; they indicate that even if
some of the predicted contacts are incorrect, the correct contacts do
help select the correct structure from the other candidates.  David
Baker has also expressed interest in our contact predictions, and we
anticipate formation of a collaboration.

An interesting question is whether the contact predictions alone,
combined with basic notions like hard sphere repulsion, can sometimes be
sufficient to fold a protein.  Dr. Roger Sayle, formerly of Glaxo and
best known as the author of the Rasmol program, now works for a small
company called Metaphorics and is interested in collaborating on this
question.  Specifically, he believes that recent results in distance
geometry suggest efficient ways of converting predicted contacts into
3D structures.  This approach has the advantage that one can start
with the most basic, fundamental constraints, and gradually add
more information as needed.  This might be important for determining
structures of proteins that are unusual compared to most of PDB,
such as membrane proteins.

To summarize, the main objective of this project is to predict
contacts useful for structure prediction, where useful may be defined
by the CASP experiment.  The essential difficulties: dealing with
evolutionary non-independence of the sequences, and the statistics
associated with the problem, have already been overcome.  To make the
method more automated and reliable will require work on multiple
sequence alignment, and implementation of better prior statistics.  
As we find more examples of true structure-induced covariation, we can
look for examples that fit the established pattern, thereby increasing
the sensitivity.  We also intend to improve the speed of the method.
Avenues we are exploring for this include porting to parallel
machines, and developing a new method to incorporate importance
sampling into our Monte Carlo statistics program.