Phylogenetic Trees

#raggedright431#

Gary Olsen, along with a group at ANL and Hideo Matsuda of Kobe University, created a fast implementation of the maximum likelihood algorithm for constructing phylogenetic trees from an alignment of sequence data. This program, called fastDNAml, runs on a wide class of uniprocessors, on networks of workstations, and on several of the massively parallel systems (most notably, the DELTA). The algorithm used in this application uses automatic load balancing and very large-grain parallelism, so that the efficiency of the message-passing system in not significant. It currently does not use any of the switch-based transport mechanisms. This application consumed by far the largest number of hours on the SP1 during its first few months at Argonne. % The Ribosomal Database Project at the University of Illinois at Urbana-Champaign [#Olsen91##1#] distributes alignments of small subunit ribosomal RNA (rRNA) sequences from both prokaryotic microorganisms and eukaryotes. In order to better understand the organisms themselves and the evolution and function of the ribosome, we wished to infer a consistent, high-quality phylogenetic tree relating all of these sequences. Such a phylogenetic tree plays a fundamental role in supporting interpretation of molecular sequence data. Phylogenetic tree inference based on maximum likelihood is appealing from both biological and statistical perspectives. Felsenstein [#Felsenstein90##1#] has written a computer program that implements such a method. However, maximum likelihood is computationally demanding, so the program had been used only for inferring relationships among small numbers of sequences (up to about 20). Analyses indicated that execution time (for small trees) rises as the third power of the number of sequences. Gary Olsen and Carl Woese of the Ribosomal Database Project at the University of Illinois at Urbana have been creating an alignment of the rRNA from the small subunit of the ribosome. This alignment has become one of the fundamental tools for phylogenetic research. Working closely with Olsen and with Carl Woese (co-leader of the Ribosomal Database Project), we are trying to construct a phylogentic tree for all of the organisms included in the current alignments. Currently, this number is approaching 2000. Our research is focusing on developing on a number of critical issues:

  1. Phylogenetic computations are sensitive to the rate of change in specific columns. Gary Olsen has developed a set of tools to reflect varying rates of change. We need to first verify that this tool does, in fact, improve the sensitivity of a standard maximum likelihood computation. The task of gathering data to evaluate this question has just been completed on the SPI at Argonne. The computation required execution of over 450 runs, each of which consumed between 12 and 36 hours on single nodes. We are now evaluating this data and plan to write up the results as soon as possible. Our intent is to establish the basic utility of the approach that Olsen has implemented in his fastDNAml tools.
  2. We are also establishing a basic algorithm for creating relatively huge trees. It is not practical to simply make one huge run and produce a reliable tree. Rather, the tree must be computed in steps. First, we have created an initial tree composed of 473 organisms. We then developed a tool for sequentially inserting new sequences into the tree. This produces an ``initial tree'' that must then be ``optimized'' by performing thousands of local maximum likelihood computations (which can produce local rearrangements within the tree).
  3. Finally, the tree produced by maximum likelihood is subjected to critical analysis by other experts in the field, specific questionable areas are isolated, and a detailed analysis is done of these locations. This analysis is done using a variety of techniques and can be used to establish the limitations and advantages of the maximum likelihood tool that is in development.
We developed both sequential and parallel versions of the fastDNAml package; the sequential version is now being distributed through the RDP server, and we have helped a limited set of institutions instal the parallel version (which is based on the <#229#> p4<#229#> package of routines for writing portable parallel programs [#p4manual##1#]). The phylogenetic tree that we are developing is of major scientific interest. It is the culmination of decades of work to develop the technology, gather the rRNA sequences, and careful align them for analysis. If it can be established that the maximum likelihood approach is actually more accurate, then bringing the required computational resources will produce a clear instance in which massive computational resources do advance a fundamental area of science. %