CCST Logo
resources | software | research | pubs | information MCS Logo

In Pursuit of an Accurate Phylogenetic Framework for Biology

Ross Overbeek
Mathematics and Computer Science Division
Argonne National Laboratory

The central role played by evolution in analysis of biological systems is well understood. The growing significance of molecular phylogeny is, perhaps, less well understood.

The seminal work of Carl Woese in developing the technology to support molecular phylogeny resulted in the Ribosomal Database Project (RDP), an effort that has produced (among other things) in large and growing alignments of rRNA sequences. By aligning the corresponding sequences from a diverse set of organisms, the curators of the RDP have created a tool for exploring phylogenetic relationships with far more accuracy than had been previously achieved. This tool has already revolutionized our understanding of microbial phylogeny, which will play a central role in our assessments of biodiversity, our unraveling of the historical development of life forms, and our attempts to reach a deeper understanding of biological systems through analysis of newly available genetic sequence data.

Our project initially involved use of the Intel Delta, through Argonne's participation in the Concurrent Supercomputing Consortium; more recently, we have been using the IBM SP to generate a phylogenetic tree based on the RDP data. We performed the work in close collaboration with members of the RDP; Gary Olsen of the University of Illinois at Urbana directed the effort.

Using an implementation of a maximum-likelihood method, we generated a tree containing over 3500 "leaves" (which represent distinct organisms). Maximum likelihood was recognized as being a relatively accurate method for computing such trees, but the computational costs had led other researchers to restrict its use to small trees. Indeed, it appears that we were the first to employ it in constructing trees with more than 20 or so leaves.

Our effort required the execution of thousands of jobs, many of which ran for several days on distinct processors. We had to develop techniques for computing trees of about 50-80 organisms and then merging the output of all of these runs into a single coherent picture. The tree that was produced is widely distributed by the RDP and has become a basic reference for researchers in phylogeny, taxonomy, ecological diversity, medicine, and other areas within biology. It will form the framework for analysis of the evolution of protein families, as well as the more general task of comprehending the implications of data resulting from the numerous DNA sequencing efforts that are now producing a wealth of data.

This effort is ongoing in the sense that more relavent data is arriving each month. Indeed, our first tree (produced about two years ago) contained less than 500 organisms. Now, we have over 4000 distinct sequences that have been aligned, and it appears that there will be 50-100,000 new sequences produced during the next 3-4 years.

The computational requirements needed to integrate the new data and refine the growing tree grow each year. In view of the central role played by such a tree, it is reasonable to emphasize the need to continue this effort and to acknowledge the central role played by the availability of the massive computational resources that we have employed to date.

[ Account Request | Quad | Denali | Yukon | Tundra | ADSM | Announcements | CCST | MCS ]
Last updated on January 20, 2000
webmaster@mcs.anl.gov