Sequence-Structure Relationships through Phylogenetic Analysis William J. Bruno, Ph.D. Theoretical Biology and Biophysics, Mail Stop K-710 Los Alamos National Laboratory Los Alamos, NM 87545 (505) 665-3802, FAX: (505) 665-3493 billb@lanl.gov www.t10.lanl.gov/billb In the world of sequence analysis, two trends have been clear: the number of sequences in the databases keeps going up, and the use of multiple sequence alignments--tacitly implying the use of evolutionary analysis--is of increasing importance. This combination of trends has lead beneficially to some real progress in 3D protein structure prediction, exemplified by a largely correct prediction of a new fold in CASP3. This was achieved by the Baker group for target 56, using a method heavily dependent on finding evolutionary patterns in multiple alignments that correlate with protein structure. This is a great achievement, but significant improvements are still needed, especially for dealing with larger proteins. Even for target 56, a small protein of 114 residues, the correct structure was ranked 5th in the Baker's group list of 5 candidates. For larger proteins, the number of high-scoring candidates tends to explode, and the size of the search space from which the candidates are drawn creates a heavy computational burden as well. Including evolutionary covariation analysis, whenever a good multiple alignment can be constructed, offers a way to improve on the Baker method (or other methods) and extend it to larger proteins. To the extent that the Baker method includes evolutionary information, it does so only very locally in the sequence. If coevolution of two residues far apart on the sequence indicates that they are close together in 3D, this provides a powerful additional constraint. Such constraints can help make the problem tractable for large problems by cutting down the size of the conformational search space. They can also be used to improve the rankings of candidates for proteins of all sizes. Many methods exist and have been used to detect correlated mutations in multiple alignments. However, protein researchers tend to regard evolution as ``an annoyance'' that, while known to undermine the statistical foundation of most such methods, is swept under the rug. The problem is that counting the mutations that lead to residue replacement requires an evolutionary analysis, and the statistics of the covariation problem are too sensitive to allow this to be ignored. We have created a program called Rind2, based on the same principles and model as the Rind program. Rind analyzes the amino acid usage in a protein alignment one column at a time, using a likelihood model on an evolutionary tree. Rind2 does essentially the same thing, except that it considers two columns at a time. To use this program for prediction of contacts also requires a way of computing statistical significance, which we have implemented with Dr. Aaron Halpern of the University of New Mexico. We submitted predictions from Rind2 into the CASP3 experiment last year for target 62. Unfortunately, the coordinates for target 62 were not ready in time, and it was dropped from the experiment. Thus, in the absence of true prediction results, we can only discuss some very promising ``post-diction'' results. Our best results are for the SH3 domain, where we predicted 11 contact pairs, 9 of which have Van der Waals surfaces that come within 1.1 Angstroms. The other 2 pairs have highly variable distances depending on which structure one uses, and one of them forms a Van der Waals contact in one NMR structure. We have also had success with much longer alignments, including the tryptophan synthase proteins A and B in one long alignment of 633 residues. In this case, no contact pairs can be predicted with 95 percent confidence. However, of the 10 highest scoring pairs, 5 are in direct contact. With careful use of statistics, we believe these pairs can be very useful. We also have plans to improve the sensitivity of the method using additional sources information, which could help us determine ahead of time which 5 of the 10 are the real contacts. Our work on this problem builds on our experience with the original Rind, in addition to our work on constructing phylogenetic trees. We have found that the widely used Neighbor Joining (NJ) program is not suitable for building the trees we need for our analyses when long branches are present. This prompted us to develop Weighted Neighbor Joining, called Weighbor for short. Weighbor correctly deals with the large variances associated with long evolutionary distances, and creates trees with much higher likelihoods, less bias, and better resolving power than NJ. Running the Rind programs with Weighbor gives them better convergence properties and better results. Weighbor has also attracted the interest of evolutionary biologists, helping us to recruit David Pollock, who now holds a LANL Director's Fellow Postdoc in our lab. Dr. Pollock has published papers showing the importance of bona fide evolutionary modeling in using covariation to predict protein structure, and recently published a likelihood ratio test for this problem in JMB. Essential to the task of detecting covariation is having a good multiple alignment. Creating the multiple alignment is perhaps the most critical step, because existing automated methods often work so poorly that a lot of manual intervention must be added. Dr. Ian Holmes, who got his Ph.D. with Richard Durbin at the Sanger Center, is working in our lab as a Fulbright Scholar. He is applying his expertise in Hidden Markov Models to the multiple sequence alignment problem. This ultimately may lead to a fully automated method, but in the meantime should at least offer a rigorous method for identifying portions of a multiple alignment that are unreliable and should be improved or discarded. Another key person on the project will be Charlie Strauss of CST-division. Dr Strauss will return shortly from his sabbatical with the Baker group in Seattle, where he learned to use their ab initio prediction program and has been investigating the effect of adding tertiary contacts, including those predicted by Rind2. His preliminary results are very encouraging; they indicate that even if some of the predicted contacts are incorrect, the correct contacts do help select the correct structure from the other candidates. David Baker has also expressed interest in our contact predictions, and we anticipate formation of a collaboration. An interesting question is whether the contact predictions alone, combined with basic notions like hard sphere repulsion, can sometimes be sufficient to fold a protein. Dr. Roger Sayle, formerly of Glaxo and best known as the author of the Rasmol program, now works for a small company called Metaphorics and is interested in collaborating on this question. Specifically, he believes that recent results in distance geometry suggest efficient ways of converting predicted contacts into 3D structures. This approach has the advantage that one can start with the most basic, fundamental constraints, and gradually add more information as needed. This might be important for determining structures of proteins that are unusual compared to most of PDB, such as membrane proteins. To summarize, the main objective of this project is to predict contacts useful for structure prediction, where useful may be defined by the CASP experiment. The essential difficulties: dealing with evolutionary non-independence of the sequences, and the statistics associated with the problem, have already been overcome. To make the method more automated and reliable will require work on multiple sequence alignment, and implementation of better prior statistics. As we find more examples of true structure-induced covariation, we can look for examples that fit the established pattern, thereby increasing the sensitivity. We also intend to improve the speed of the method. Avenues we are exploring for this include porting to parallel machines, and developing a new method to incorporate importance sampling into our Monte Carlo statistics program.