welcome
petaflops | workshop95 | case studies

Reconstruction of DNA of Extinct Species

Long-Term Vision

The genome of contemporary species encodes a great deal of information about extinct species. For example, the genome of a pair of primate species clearly gives information about their common ancestor. Thus, looking at a large set set of current species gives a great deal of information about extinct species---in some cases, perhaps enough to reconstruct past species (shades of Jurassic Park).

Independent of this extreme possibility, constructing the genome of extinct species would revolutionize paleontology, in much the same way that computation is in the process of revolutionizing molecular biology.

State of the Art

The main tool used for reconstructing phylogenetic trees is the ``maximum likelihood'' algorithm. This algorithm constructs the ``optimal'' phylogenetic tree to connecting n contemporary species, and simultaneously produces the gene sequence at each node of this tree. These are the genes of the extinct ancestors of present species. They may also be the ancestors of other species, now extinct.

It is estimated that by the year 2010 we will have reconstructed the complete genomes of not only humans but perhaps a million contemporary species. This database can then be used to construct a huge phylogenetic tree and hence the genomes of vast numbers of extinct species. But this possibility depends on the availability of large amounts of computer power.

The running time of the maximum likelihood algorithm goes as where n is the number of contemporary species and m is the size of the genome in question. In a recent test, a phylogenetic tree for a gene sequence of about 2000 base pairs, for 2000 species, was computed on a 5 gigaFLOPS SP2 in about two hours. Thus, with a peta-ops architecture, in a few hours one could compute either (i) the complete phylogenetic tree for img1.gif (873 bytes)species, based on a gene sequence of about 1000 pairs, or (2) the phylogenetic tree for perhaps 1000 species, based on their entire genome. The second estimate is valid for complex animals, like mammals, with genomes on the order of base pairs.

Technical Barriers

Computing the optimal phylogenetic tree based on the entire genome of species will remain intractable even with peta-ops architectures.

On the other hand, storage requirements for this class of algorithm are not large. One needs to store not only the initial data but also the genome at each intermediate node of the phylogenetic tree. However, as usual, the number of leaves dominates the number of intermediate nodes, so that the total amount of storage needed is proportional to the amount of input data. In either of the above cases, this is at most a few terabytes:

species * 1000 base pairs -> a few gigabytes

1000 species * base pairs -> a few terabytes


[MCS | Announcement | Case Studies | Report | Attendees]
Last updated on June 26, 1998
webmaster@mcs.anl.gov