#raggedright431#
Gary Olsen, along with a group at ANL and Hideo Matsuda of Kobe
University, created a fast implementation of the maximum
likelihood algorithm for constructing phylogenetic trees from an
alignment of sequence data. This program, called fastDNAml, runs
on a wide class of uniprocessors, on networks of workstations, and on
several of the massively parallel systems (most notably, the DELTA).
The algorithm used in this application uses automatic load balancing and
very large-grain parallelism, so that the efficiency of the message-passing
system in not significant. It currently does not use any of the switch-based
transport mechanisms. This application consumed by far the largest number
of hours on the SP1 during its first few months at Argonne.
%
The Ribosomal Database Project
at the University of Illinois at Urbana-Champaign
[#Olsen91##1#] distributes alignments of
small
subunit ribosomal RNA (rRNA) sequences from both prokaryotic
microorganisms and eukaryotes. In order to better understand the
organisms themselves and the evolution and function of the ribosome,
we wished to infer a consistent, high-quality phylogenetic tree
relating all of these sequences. Such a phylogenetic tree plays a
fundamental role in supporting interpretation of molecular sequence
data.
Phylogenetic tree inference based on maximum likelihood is appealing from both
biological and statistical perspectives. Felsenstein [#Felsenstein90##1#] has
written a
computer program that implements such a method. However, maximum likelihood
is computationally demanding, so the program had been used only for inferring
relationships among small numbers of sequences (up to about 20). Analyses
indicated that execution time (for small trees) rises as the third power of
the number of sequences.
Gary Olsen and Carl Woese of the Ribosomal Database Project at the
University of Illinois at Urbana have been creating an alignment of
the rRNA from the small subunit of the ribosome. This alignment has
become one of the fundamental tools for phylogenetic research.
Working closely with Olsen and with Carl Woese (co-leader of the
Ribosomal Database Project), we are trying to construct a phylogentic tree
for all of the organisms included in the current alignments.
Currently, this number is approaching 2000. Our research is focusing
on developing on a number of critical issues:
- Phylogenetic computations are sensitive to the rate of
change in specific columns. Gary Olsen has developed a set
of tools to reflect varying rates of change. We need to
first verify that this tool does, in fact, improve the
sensitivity of a standard maximum likelihood computation.
The task of gathering data to evaluate this question has
just been completed on the SPI at Argonne. The computation
required execution of over 450 runs, each of which consumed
between 12 and 36 hours on single nodes. We are now
evaluating this data and plan to write up the results as
soon as possible. Our intent is to
establish the basic utility of the approach that Olsen has
implemented in his fastDNAml tools.
- We are also establishing a basic algorithm for creating
relatively huge trees. It is not practical to simply make
one huge run and produce a reliable tree. Rather, the
tree must be computed in steps. First, we have created an
initial tree composed of 473 organisms. We then developed
a tool for sequentially inserting new sequences into the
tree. This produces an ``initial tree'' that must then be
``optimized'' by performing thousands of local maximum
likelihood computations (which can produce local
rearrangements within the tree).
- Finally, the tree produced by maximum likelihood is
subjected to critical analysis by other experts in the
field, specific questionable areas are isolated, and a
detailed analysis is done of these locations. This
analysis is done using a variety of techniques and can be
used to establish the limitations and advantages of the
maximum likelihood tool that is in development.
We developed both sequential and parallel versions of the
fastDNAml package; the sequential version is now being distributed
through the RDP server, and we have helped a limited set of
institutions instal the parallel version (which is based on the <#229#> p4<#229#>
package of routines for writing portable parallel programs [#p4manual##1#]).
The phylogenetic tree that we are developing is of major scientific
interest. It is the culmination of decades of work to develop the
technology, gather the rRNA sequences, and careful align them for
analysis. If it can be established that the maximum likelihood
approach is actually more accurate, then bringing the required
computational resources will produce a clear instance in which
massive computational resources do advance a fundamental area of
science.
%