| |
| Plant Physiol. 2003 August; 132(4): 1790–1800. doi: 10.1104/pp.103.022509. | PMCID: PMC526274 |
Copyright © 2003, The American Society for Plant
Biologists The Role of Phylogenetics in Comparative
Genetics 1Douglas E. Soltis * and Pamela S. Soltis Department of Botany and the Genetics Institute, University of Florida,
Gainesville, Florida 32611; and Florida Museum of Natural History and the
Genetics Institute, University of Florida, Gainesville, Florida 32611 Received February 27, 2003; Revised March 30, 2003; Accepted May 12, 2003. |
WHY PHYLOGENY MATTERS Many biologists agree that a phylogenetic tree of relationships should be
the central underpinning of research in many areas of biology. Comparisons of
plant species or gene sequences in a phylogenetic context can provide the most
meaningful insights into biology. This important realization is now apparent
to researchers in diverse fields, including ecology, molecular biology, and
physiology (see recent papers in Plant Physiology, e.g.
Hall et al., 2002a; Doyle et
al., 2003). Examples of the importance of a phylogenetic framework to diverse
areas of plant research abound (for review, see
Soltis and Soltis, 2000;
Daly et al., 2001). One
obvious example is the value of placing model organisms in the appropriate
phylogenetic context to obtain a better understanding of both patterns and
processes of evolution. The fact that tomato (Lycopersicon
esculentum) and other species of this small genus actually are embedded
within a well-marked subclade Solanum (and, hence, are more
appropriately referred to as species of Solanum; tomato has been
renamed as of Solanum lycopersicon; e.g.
Spooner et al., 1993;
Olmstead et al., 1999) is a
powerful statement that is important to geneticists, molecular biologists, and
plant breeders in that it points to a few close relatives of S.
lycopersicon (out of a genus of several hundred species) as focal points
for comparative genetic/genomic research and for crop improvement. Snapdragon
(Antirrhinum majus) was historically part of a broadly defined
Scrophulariaceae, a family that is now known to be grossly polyphyletic (i.e.
not a single clade). Phylogenetic studies indicate that Scrophulariaceae
should be broken up into several families
(Olmstead et al., 2001), and
snapdragon and its closest relatives are part of a clade recognized as the
family Plantaginaceae. A phylogenetic framework has revealed the patterns of evolution of many
morphological and chemical characters, including complex pathways such as
nitrogen-fixing symbioses, mustard oil production, and chemical defense
mechanisms (for review, see Soltis and
Soltis, 2000; Daly et al.,
2001). However, the importance of phylogeny reconstruction applies
not only to the organisms that house genes but also to the evolutionary
history of the genes themselves. For example, are the genes under
investigation the members of a single well-defined clade, all members of which
appear to descend from a recent common ancestor as a direct result of
speciation (orthologous genes), or do the sequences represent one or more
ancient duplications (paralogous genes; see also
Doyle and Gaut, 2000)? Gene
families are, of course, the norm in studies of nuclear genes, but
investigators are often bewildered by the diversity of genes encountered in a
survey of a family of genes from a diverse array of plants. Phylogenetic
methodology offers several solutions by permitting inferences of putative
orthology among a set of sequences. Examples of the phylogenetic analysis of gene families abound (e.g. genes
encoding: heat shock proteins, Waters and
Vierling, 1999; phytochrome,
Kolukisaoglu et al., 1995;
Mathews and Sharrock, 1997;
and actin, McDowell et al.,
1996). A noteworthy recent example involves MADS box genes, which
encode transcription factors that control diverse developmental processes in
plants. Some of the best known examples of MADS box genes include the A, B,
and C class floral genes that control the identity of floral organs (for
review, see Ma and dePamphilis,
2000). Phylogenetic analyses indicate that a minimum of seven
different MADS box gene lineages were already present in the common ancestor
of extant seed plants approximately 300 million years ago (mya;
Becker et al., 2000). Thus, a
diverse tool kit of MADS box genes was available before the origin of the
angiosperms. A phylogenetic perspective also provides the basis for comparative genomics
(e.g. Soltis and Soltis, 2000;
Walbot, 2000;
Daly et al., 2001;
Kellogg, 2001;
Hall et al., 2002a;
Mitchell-Olds and Clauss,
2002; Pryer et al.,
2002; Doyle and Luckow,
2003). However, obtaining the appropriate phylogenetic perspective
may be difficult: What phylogenetic hypotheses are already available for the
group of interest? Are phylogenetic studies underway on a particular group,
and is it possible to obtain unpublished trees? Is the phylogenetic
underpinning for a lineage of interest sound enough for use in comparative
genetic/genomic analyses? Not all phylogenetic trees are of equal quality, and
the most fruitful phylogenomic comparisons will be those based on the
strongest phylogenetic inferences. We cannot address all of the crucial issues relating to the importance of
phylogeny in a comprehensive fashion and, therefore, will focus on a few main
topics. We provide: (a) phylogenetic summaries and references for major clades
of land plants, with an emphasis on angiosperm model systems; (b) a
“primer” of phylogenetic methods, including evaluation of
parsimony, distance, maximum likelihood (ML), and Bayesian methods, the
importance of measures of internal support in phylogenetic inference, and
methods of analysis of large data sets; and (c) use of molecular data to
estimate divergence times of genes or organisms. A major goal is to foster
increased interaction and communication between phylogeneticists and
physiologists/molecular geneticists by providing contacts and references for
those requiring a phylogenetic backbone for analyses. |
SELECTION OF TAXA AND PHYLOGENETIC TREES IN COMPARATIVE STUDIES. A
SUMMARY OF LAND PLANT PHYLOGENY One question that systematists are frequently asked is: Where would I find
the most recent phylogenetic tree for group (fill in the blank)? We provide a
brief summary of relevant trees below, with a focus on land plants. In
addition, selected trees for angiosperms can be found at
http:/www.mobot.org/MOBOT/research/APweb//,
http://www.flmnh.ufl.edu/deeptime/ and
http://plantsystematics.org/).
Researchers can also consult Tree of Life
(http://tolweb.org/tree/phylogeny.html)
and TreeBASE
(http://www.treebase.org/treebase).
Phylogenetic questions can also be posed directly to experts working on
various groups of plants; a partial list of phylogenetic consultants is
provided in Table I (for a
larger list, see also
http://www.flmnh.ufl.edu/deeptime/). | Table I. Partial list of phylogenetic experts for various clades of land
plants. |
Land Plants. Origin and Relationships Understanding patterns of gene and genome evolution across land plants
requires an understanding of the phylogeny of land plants, or embryophytes.
Molecular data indicate that the sister group (i.e. the closest relative; two
sister groups share a common ancestor not shared with any other group) of land
plants is Charales (stoneworts) from the charophycean lineage of green algae
( Karol et al., 2001;
Fig. 1; see also
http://www.flmnh.ufl.edu/deeptime/). | Figure 1.Summary of phylogenetic relationships among major lineages of embryophytes
(land plants). Charales are the sister group of the embryophytes. Within the
embryophytes, liverworts, hornworts, and mosses are the basal most lineages;
however, their precise (more ...) |
Plants colonized the land approximately 450 mya. Within the land plants,
the three lineages long known as the “bryophytes” (liverworts,
hornworts, and mosses) do not form a single clade in most analyses but instead
form a grade that subtends the tracheophytes
(Fig. 1). Furthermore, the
precise branching order of the three “bryophyte” lineages remains
ambiguous, with different topologies suggested by various data sets. A
branching order of liverworts, hornworts, and mosses has emerged as one
favored arrangement (e.g. Karol et al.,
2001); other data suggest that hornworts, followed by a clade of
mosses + liverworts, are the basal branches of the embryophytes
(Renzaglia et al., 2000). Tracheophytes Vascular plants (tracheophytes) constitute a large and well-defined clade
of land plants comprising the lycophytes (e.g. Lycopodium,
Selaginella, and Isoetes) as sister to two well-marked
clades—monilophytes and seed plants
( Pryer et al., 2001;
Fig. 1). Monilophytes (or Moniliforms) Both molecular and morphological analyses of tracheophytes have recognized
a clade of Equisetum, Marattiaceae, Psilotaceae, Ophioglossaceae, and
leptosporangiate ferns ( Kenrick and Crane,
1997; Pryer et al.,
2001). Kenrick and Crane
( 1997) first suggested the
presence of this clade (based on one morphological character) and designated
these plants Moniliformopses or “moniliforms”; they are now
referred to more commonly as monilophytes
( Judd et al., 2002). This
monilophyte clade unites ancient lineages not previously considered closely
related and is sister to a clade of all remaining tracheophytes—the seed
plants ( Fig. 1). Seed Plants Despite repeated efforts, it has been difficult to resolve phylogenetic
relationships among extant seed plants, that is, angiosperms and the four
lineages of living gymnosperms: cycads, Ginkgo biloba, conifers, and
Gnetales (for review, see Donoghue and
Doyle, 2000; Soltis et al.,
2002). Analyses of morphological data generally concur in
suggesting that angiosperms and Gnetales are sister groups (the
“anthophyte” hypothesis), with extant gymnosperms paraphyletic
(that is, not forming a clade but rather a grade;
Donoghue and Doyle, 2000). However, the sister group relationship of Gnetales and angiosperms has not
been supported by most molecular analyses. Analyses of combined data sets of
multiple genes representing all three plant genomes (plastid, mitochondrion,
and nucleus) have found strong support for a clade of extant gymnosperms
(Fig. 1; e.g.
Bowe et al., 2000;
Chaw et al., 2000;
Pryer et al., 2001;
Soltis et al., 2002). However,
some extinct gymnosperms (e.g. Caytoniales and Bennettitales) may be more
closely related to angiosperms than to any lineage of living gymnosperm
(Donoghue and Doyle, 2000).
Cycads and Ginkgo biloba are sisters to the remaining living
gymnosperms. The relationship between cycads and Ginkgo biloba is
unclear; in some analyses, cycads and Ginkgo biloba are successive
sisters to a clade of conifers and Gnetales, whereas in others, Ginkgo
biloba and cycads form a clade that is sister to other extant
gymnosperms. Some molecular analyses support a surprising placement of
Gnetales within conifers as sister to Pinaceae
(Bowe et al., 2000; the
“gne-pine” hypothesis of Chaw
et al., 2000). The placement of Gnetales within conifers is an excellent example of a
molecular phylogenetic result that must be viewed with caution, for several
reasons. First, the placement of Gnetales within conifers is supported largely
by mitochondrial genes; genes from other genomes do not place Gnetales within
conifers. Furthermore, there is conflict between first and second versus third
codon positions of cpDNA genes, with different positions supporting different
placements of Gnetales. In addition, because most analyses of seed plants have
involved small numbers of taxa, the gne-pine hypothesis may be an artifact of
inadequate taxon sampling in some analyses. Our current interpretation of
relationships among extant seed plants, showing Gnetales as sister to all
conifers, is depicted in Figure
1. Analysis of extant gymnosperms exemplifies the complexities
inherent in phylogenetic analysis of ancient lineages that have undergone
significant extinction. Angiosperms The impact of molecular phylogenetic analyses on the angiosperms (flowering
plants) has been particularly profound (e.g.
Qiu et al., 1999;
Graham and Olmstead, 2000;
Soltis et al., 2000;
Bremer et al., 2002; see
below). Because of the wealth of molecular phylogenetic data, angiosperms
became the first major group of organisms to be reclassified based largely on
molecular data ( Angiosperm Phylogeny Group
[APG], 1998); data have accumulated so rapidly that this
classification was recently revised ( APG
II, 2003). Readers will find that some family circumscriptions and
ordinal groups have changed considerably from traditional classifications
(e.g. Cronquist, 1981).
Comprehensive trees depicting family level relationships for nearly all of the
300+ angiosperm families (e.g. Soltis et
al., 2000; Zanis et al.,
2002) and the APG II classification are available at
http://www.flmnh.ufl.edu/deeptime/.
Although recent classifications (e.g.
Cronquist, 1981) may still
provide some useful family descriptions, these classifications do not depict
current concepts of phylogeny. For interpretations of data in a phylogenetic
context and for consistency, authors are urged to follow the APG II
( 2003) classification. Think “Eudicots.” Abandon “Dicots” The angiosperms, a clade of 260,000+ species
( Takhtajan, 1997), first
appeared in the fossil record, conservatively, approximately 130 mya
( Hughes, 1994). Standard
classifications divided the angiosperms into two large groups, typically
recognized at the Linnean rank of class: Magnoliopsida (dicots) and Liliopsida
(monocots). Thus, standard comparative studies of physiological pathways and
genetic/genomic data have spanned this “monocot-dicot split.”
However, even preliminary morphology-based studies of angiosperms suggested
that this “monocot-dicot split” did not accurately portray
relationships. Molecular phylogenetic analyses clearly indicate that the
traditional “dicots” are paraphyletic, with the monocots (a clade
of ±65,000 species) emerging from among the basal branches of
angiosperms ( Fig. 2). Following
this basal grade of monocots and traditional “primitive dicots”
(e.g. Amborellaceae, Nymphaeaceae, Austrobaileyales, and magnoliid clade) is a
well-supported clade, the eudicots ( Fig.
2). The eudicot clade contains 75% of all angiosperm species,
united by the shared feature of triaperturate pollen (pollen with three
grooves). The term “monocots” is still useful in that it
designates a clade. In contrast, the term “dicots” should be
abandoned because it does not correspond to a clade. This change in concept
and terminology has already been accepted by many entry level biology and
botany textbooks. Comparisons of genes or characters should be based on sister
groups, if possible or, minimally, on other monophyletic groups. For example,
because the sister group of the monocots remains uncertain, monocots could be
compared with members of eudicots or magnoliids. Most of the published
molecular comparisons of monocots and dicots have used eudicots as
placeholders (e.g. Arabidopsis, Brassica spp., and
Antirrhinum spp.) for the dicots. Thus, many such comparisons are
still valid, even if the terminology used (“dicot”) was
incorrect. | Figure 2.Summary tree of angiosperm relationships based on Soltis et al.
(2000, 2003), with basal
angiosperm relationships modified following Zanis et al.
(2002). Numbers are jackknife
values. Values for basal angiosperms are from Zanis et al.
(2002); the (more ...) |
|
NO SUBCLASSES Perhaps the best known classification of angiosperms is that of Cronquist
(1981), who recognized six
subclasses of dicots, Magnoliidae, Hamamelidae, Rosidae, Dilleniidae,
Caryophyllidae, and Asteridae, and five subclasses of monocots, although these
were followed less frequently. Molecular phylogenies indicate that these
subclasses, like the classes Magnoliopsida and Liliopsida, should also be
abandoned. The Magnoliidae are paraphyletic, and both the Hamamelidae and
Dilleniidae are grossly polyphyletic, with constituent members appearing
throughout much of the angiosperm tree. Thus, “Magnoliidae,”
“Hamamelidae,” and “Dilleniidae” do not refer to
monophyletic groups, and these names are no longer valid. Cronquist's concepts
of Rosidae, Asteridae, and Caryophyllidae must be expanded and revised to
correspond to monophyletic groups; these clades are the rosids, asterids, and
Caryophyllales sensu (APG II,
2003). Although Caryophyllales are recognized at the ordinal level
(see APG II, 2003), both rosids
and asterids are supraordinal groups that are not assigned a Linnean rank in
the APG II classification. It is important to note that deep-level angiosperm phylogeny is not yet
resolved. Relationships among the major clades of eudicots (e.g. rosids,
asterids, Caryophyllales, Saxifragales, Santalales, and a few smaller clades)
are unresolved (Fig. 2),
presenting a limitation for many areas of comparative biology, including
comparative genomics. |
MODEL GROUPS. OPPORTUNITIES FOR COMPARATIVE GENETICS AND GENOMICS IN
THE ANGIOSPERMS The phylogenetic trees available for many families of angiosperms
facilitate interpretation of the evolution of diverse characters (molecular,
physiological, and genetic). These trees also aid in the appropriate choice of
representative taxa for comparative studies (see also
Daly et al., 2001;
Hall et al., 2002a); it is
often useful to choose representative taxa from across the breadth of a clade
and not simply one or two taxa from only a small part of the diversity of that
clade. Because trees depicting organismal phylogenies have accumulated so rapidly,
it is often difficult for the nonexpert to know how to obtain a tree for a
group of interest. Unfortunately, there is no single source that serves as a
compendium of all intrafamilial phylogenetic trees. Judd et al.
(2002) provide trees and
relevant references for many families of tracheophytes. However, because it is
an entry level textbook, many families are not covered. Therefore, we provide
a short list of experts (Table
I) who can assist with phylogenetic questions for major groups of
embryophytes. A larger list is available on the Deep Time website. Monocots Molecular analyses have clarified many (but far from all) relationships
within monocots ( Chase et al.,
2000; Soltis et al.,
2000), and further analyses are underway (M. Chase and J. Davis,
personal communication). The sister group of the monocots remains unclear, but
the most comprehensive analyses suggest Ceratophyllaceae
( Zanis et al., 2002;
Fig. 2). Poaceae The Poaceae, or grass family, are an ideal focal point for comparative
genetic/genomic research ( Kellogg,
2001). The Grass Phylogeny Working Group
( 2001) has provided the most
comprehensive and best supported tree for the grass family. Complete
sequencing of the rice ( Oryza sativa) genome and of entire cpDNA
genomes for some genera, as well as extensive genetic/genomic data for crops
including wheat ( Triticum aestivum), sorghum ( Sorghum
bicolor), and maize ( Zea mays), make tribe Triticeae of
particular interest; a firm phylogenetic framework is available not only for
the tribe ( Kellogg, 2001) but
also for individual genera, such as Hordeum ( Petersen and Seberg,
2003). Antirrhinum Spp. (Snapdragon and Relatives) Snapdragon (Plantaginaceae and Lamiales) is one of the best model systems
for the study of floral developmental genetics and offers numerous
opportunities for comparative genetic and genomic research. Although
Antirrhinum spp. have long been placed in the family
Scrophulariaceae, molecular phylogenetic studies indicate that the
traditionally recognized Scrophulariaceae are not a single clade but actually
represent a number of distinct clades: Scrophulariaceae in the strict sense;
Plantaginaceae, which includes Antirrhinum, Plantago, and
Veronica; Orobanchaceae, which contains all of the parasitic taxa
formerly placed in either Orobanchaceae or Scrophulariaceae; the new family
Calceolariaceae; an expanded Stilbaceae; and an expanded Phyrmaceae
( Olmstead et al., 2001). Solanaceae Solanaceae contain a number of model organisms, including tomato and potato
( Solanum tuberosum), tobacco ( Nicotiana tabacum), peppers
( Capsicum annuum), and petunia ( Petunia hybrida). The family
has also served as a model for studies of reproductive incompatibility and
organization of the nuclear genome. A molecular phylogenetic framework and a
provisional reclassification are now available for the family
( Olmstead et al., 1999).
Molecular studies have also confirmed that Convolvulaceae represent the sister
group of Solanaceae ( Soltis et al.,
2000). As noted, tomato (formerly Lycopersicon) is
clearly embedded within the large genus Solanum, which also includes
potatoes. Thus, potato and tomato share very similar linkage maps (e.g.
Tanksley et al., 1988;
Doganlar et al., 2002) because
they share a recent common ancestor. Legumes (Fabaceae) The closest relative of the Fabaceae has long been considered a mystery.
Phylogenetic analyses have recently shown the closest relatives of Fabaceae to
be Surianaceae and Polygalaceae ( Soltis et
al., 2000). Considerable progress has been made in recent years in
clarifying relationships across the family as a whole and also within
subclades within the family ( Doyle and
Luckow, 2003). Recent analyses have also identified the closest
relatives of several important crop genera, including Medicago,
Gycine, and Pisum (e.g.
Kajita et al., 2001;
Hu et al., 2002; for review,
see Doyle and Luckow,
2003). Brassicaceae Brassicaceae offer important opportunities in comparative genomics by
extending out from the complete genome sequence of Arabidopsis (e.g.
Hall et al., 2002a;
Mitchell-Olds and Clauss,
2002). Initial molecular phylogenetic analyses indicated the
presence of a broadly defined Brassicaceae (Brassicaceae sensu lato) that also
include Capparaceae. More recently, Hall et al.
( 2002b) found evidence for
three well-supported clades within Brassicaceae sensu lato—Capparaceae
subfamily Capparoideae, Capparaceae subfamily Cleomoideae, and Brassicaceae
sensu stricto—with the latter two clades as sister groups. Rather than a
single broadly defined family Brassicaceae, it may be more appropriate to
recognize three families: Capparaceae, Cleomaceae, and Brassicaceae
( Hall et al., 2002b). The
model plants Brassica sp. and Arabidopsis are in Brassicaceae. It may
be informative to include members of Capparaceae (e.g. Capparis spp.)
and Cleomaceae ( Cleome spp.) in comparative genetic and genomic
analyses. Recent phylogenetic studies of Arabidopsis and relatives (Koch et al.,
1999,
2001,
2003;
Koch, 2003;
O'Kane and Al-Shehbaz, 2003)
have provided an initial tree for Brassicaceae sensu stricto and identified an
Arabidopsis clade that contains the closest relatives of Arabidopsis. However,
a more comprehensive analysis of the family is required and is well underway
(M. Beilstein, E. Kellogg, and I. Al-Shehbaz, personal communication). Brassicales Brassicaceae are part of a well-supported Brassicales (i.e.
“glucosinolate clade”; e.g.
Rodman et al., 1998;
Soltis et al., 2000), a clade
of 15 families that were not considered closely related in recent
classifications (e.g. Cronquist,
1981). The order offers the opportunity to investigate the
evolution of a host of features considered characteristic of Brassicaceae.
Some aspects of genomic and genic diversification will be better understood by
extending out from Brassicaceae to relatives in Brassicales. |
PHYLOGENY RECONSTRUCTION. A PRIMER Alignment (“Garbage in; Garbage out”) Alignment of nucleotide and amino acid sequences is a major consideration,
particularly in studies of genes from divergent taxa (e.g. rice and
Arabidopsis). It seems obvious to state that the phylogenetic analysis of
sequences begins with the appropriate alignment of the data themselves, yet
alignment remains one of the most difficult and poorly understood facets of
molecular data analysis. Detailed coverage of the topic is beyond the scope of
this Update, but excellent overviews are provided by Doyle and Gaut
( 2000) and Simmons and
Ochoterena ( 2000). We will
simply restate, as Doyle and Gaut
( 2000) stress, that
researchers should not accept alignments produced with the default settings of
any computer algorithm without a critical evaluation by eye. Furthermore,
there may be multiple “good” alignments, and all of these should
be subjected to phylogenetic analysis. Life after Neighbor Joining (NJ) Inferences of orthology require phylogenetic analysis. Although expression
patterns and knowledge of function may provide clues to orthology
relationships, orthology, by definition, requires historical analysis to
disentangle the products of gene duplication and speciation (for useful review
of orthology and paralogy, see Doyle and
Gaut, 2000; Jensen,
2001; Koonin,
2001). Thus, molecular biologists and geneticists suddenly need to
become phylogeneticists. Although molecular phylogeny reconstruction is a
relatively young discipline, it nonetheless has a rich and sometimes
contentious background, encompassing diverse philosophies and methodologies
that are not necessarily apparent to users of most available computer
packages. Several approaches can be used in phylogeny reconstruction of
molecular sequences: maximum parsimony (MP), maximum likelihood (ML),
distance-based methods such as NJ, and Bayesian inference (BI), a new method
of phylogenetic inference ( Huelsenbeck et
al., 2002). All of these methods have strengths and weaknesses
(e.g. Swofford et al., 1996;
Lewis, 1998;
Doyle and Gaut, 2000;
Huelsenbeck et al., 2002;
Nei and Kumar, 2000), some of
which are summarized in Table
II. | Table II. Comparison of methods of phylogeny reconstruction |
Although there is a desire among many investigators for rapid phylogeny
reconstruction and “instant tree,” it may be prudent to explore
several methods (e.g. Swofford et al.,
1996; Doyle and Gaut,
2000; Nei and Kumar,
2000). There remains a tendency to place more trust in
phylogenetic results supported by multiple approaches
(Doyle and Gaut, 2000).
Regardless of method of phylogenetic inference, however, some measure of
internal support (e.g. bootstrap, jackknife, and posterior probabilities; see
below) is essential. Many non-systematists employ NJ to the exclusion of other methods (Nei and
Kumar, 2002). The distance measures used in NJ and other distance methods are
typically based on models of nucleotide substitution. The NJ algorithm is fast
and readily available in software packages such as MEGA
(http://www.megasoftware.net/)
and PAUP*. However, it also has important weaknesses. For example, NJ provides
only a single tree, precluding comparison with other topologies. In reality,
many optimal trees may be found in MP and ML analyses, depending on the data
set, and these methods allow all optimal or near-optimal trees to be compared.
Furthermore, different trees can be obtained with NJ depending on the entry
order of the taxa (Farris et al.,
1996; see Table
II). One solution is to run multiple NJ analyses with different
random entry orders of the taxa, accompanied by bootstrap or jackknife
analysis (see below). Finally, because sequence differences are summarized as
distance values, it is impossible to identify the specific character changes
that support a branch. Although proponents of NJ, Nei and Kumar
(2000) nonetheless argue for a
pluralistic approach. Other methods of phylogenetic inference should be
explored in addition to NJ. MP is preferred by many phylogeneticists because of its theoretical basis
and the diagnosable units it produces. The advantages of parsimony over NJ are
several (Table II), an
important one being that parsimony seeks to recover all shortest trees.
Depending on the data set, a parsimony search may yield one (or a few) to
hundreds or thousands of equally short trees. These shortest trees can be
summarized in a strict consensus tree, which depicts only the nodes present in
all equally short trees. In addition, MP analysis provides diagnoses (i.e.
specific sets of characters) for each clade and branch lengths in terms of the
number of steps (or changes) on each branch of a tree. Statistical methods of phylogeny reconstruction, incorporating models of
nucleotide (or amino acid) substitution, are preferred by many molecular
phylogeneticists (see Lewis,
1998). Both ML and BI rely on such models to reconstruct both
topology and branch lengths and, thus, are computationally intensive. ML
analysis finds the likelihood of the data, given a tree and a model of
molecular evolution. Like ML, BI has had a long tenure in statistics. However,
it has only recently been introduced into phylogenetics (see Huelsenbeck et
al., 2001,
2002). Although BI uses the
same models of evolution as some other methods of phylogenetic analyses (e.g.
ML and NJ), it represents a powerful tool and perhaps the wave of the future
in phylogenetic inference. BI is based on a quantity referred to as the
posterior probability of a tree, a value that can be interpreted as the
probability that a tree is correct, given the data. BI uses a likelihood
function to compute the posterior probability. Although BI allows the
researcher to specify a prior belief in relationships
(Table II; Huelsenbeck et al.,
2001,
2002), this option has not
been explored extensively to date, and Bayesian analyses typically assign
equal prior probability values to all possible trees. Whereas ML is not
feasible for large data sets (more than perhaps 50 taxa), BI (as implemented
in Mr-Bayes; see Huelsenbeck et al.,
2001) incorporates a faster search strategy (using Markov chains)
and can be used on data sets of several hundred taxa to find tree, branch
lengths, and support (but see Suzuki et al., 2002). Certainly a frustrating aspect of phylogenetic analysis to those outside of
the field is the number of inference methods available. NJ is widely used, in
part, because of its speed and ready availability in computer packages such as
MEGA. It also is part of alignment packages such as MegAlign
(http://www.dnastar.com/cgi-bin/php.cgi?r10.php).
However, parsimony can be readily implemented using PAUP*
(Swofford, 1998; NJ and ML are
also part of the PAUP package). PAUP* is often not employed by molecular
biologists, however, because the user friendly version with pull-down menus is
made for Macintosh, not Windows, operating systems. Internal Support for Clades Some measure of internal support for clades should be provided on all
phylogenetic trees. Resampling approaches, such as the bootstrap and the
jackknife, are easily computed using PAUP* for parsimony, NJ, and ML analyses,
and parsimony jackknifing is performed by Jac ( Farris et al., 1996). The
pros and cons of the jackknife versus bootstrap have been discussed (e.g.
Farris et al., 1996;
Soltis and Soltis, 2003). A
reasonable number of replications should be employed, but
“reasonable” varies with the size of the data set, the
specifications of the analysis, and the patience of the investigator. It has
been argued ( Farris et al.,
1996) that resampling methods should maximize the number of
replicates at the expense of detailed searches in each replicate. Thus, with
“fast” methods that conduct little or no branch swapping per
replicate, 1,000 or more replicates are quickly obtained. A smaller number of
replicates (e.g. 100) may be suitable for bootstrap and jackknife analyses
that include detailed searches per replicate. Interpretations of bootstrap and jackknife values vary (for review, see
Soltis and Soltis, 2003),
although few view these values in a strict statistical sense. Bootstrap values
are conservative, but biased, measures of phylogenetic accuracy
(Hillis and Bull, 1993), with
values of 70% or greater corresponding to “true” clades in
experimental phylogenies (Hillis and Bull,
1993). Thus, some consider values of 70% or more as indicators of
strong support, whereas others reserve “strong support” for values
of 90 or 95% and above. Although different phylogenetic methods may yield
different optimal topologies, the differences generally involve poorly
supported clades. Those clades that are strongly supported generally appear in
topologies regardless of the method of phylogenetic inference. Additional
measures of support include the decay index or Bremer support
(Bremer, 1994) for parsimony
analyses and the posterior probabilities generated in BI. Measures of internal support indicate those relationships in which we
should, and should not, have confidence. A recently identified clade of
MADS-box genes appears as the sister group to the well-known B class floral
genes that specify the identity of petals and stamens in Arabidopsis and
snapdragon. Becker et al.
(2002) termed this new clade
Bsister and determined that these genes are present in diverse seed
plants. Although the monophyly of the Bsister clade received 92%
bootstrap support, the placement of the Bsister clade as sister to
the clade of B class genes received only 77% bootstrap support. With this
level of support, it is reasonable to question whether the Bsister clade is really the sister group of the clade of B class genes. Increased
sampling of Bsister genes from additional taxa and more rigorous
analyses are needed to establish with certainty the placement of the
Bsister clade within the MADS box genes of plants. |
MOLECULAR CLOCKS. RATES AND DATES OF GENE DIVERSIFICATION Many efforts to date evolutionary divergences using a molecular clock have
yielded age estimates that are grossly inconsistent with the fossil record,
regardless of method of tree construction. For example, molecular-based
estimates of divergence times in plants reveal a vast range of dates. Using
molecular data, the age of the angiosperms has been estimated as 350 to 420
mya, greater than 319, 200, to 140 to 190 mya (for review, see
Sanderson and Doyle, 2001).
However, the oldest unequivocal angiosperm fossils are 125 to 135 mya (for
review, see Soltis et al.,
2002). Many sources of error and bias can affect molecular-based estimates of
divergence times (see Sanderson and Doyle,
2001; Soltis et al.,
2002). Obviously, an incorrect topology will yield erroneous
estimates, with the magnitude of the problem depending on the extent of the
topological error (Sanderson and Doyle,
2001). Inaccurate calibration will bias the resulting estimates.
Also problematic are heterogeneous rates of evolution among lineages (see
Sanderson and Doyle, 2001;
Soltis et al., 2002).
Inadequate taxon sampling can compound the problem. Estimates of divergence
times can also vary among genes or other data partitions (e.g. among codon
positions). Another potential source of error is the method used to estimate
divergence dates. Sanderson and Doyle
(2001) used molecular data to
examine angiosperm divergences and found that the age of crown group
angiosperms ranges from 68 to 281 mya, depending on data, tree, and
assumptions, with most estimates falling between 140 and 190 mya. Given that rate heterogeneity among lineages is common in most
molecular-based trees, can we reliably use molecular data to estimate
divergence times? Simple clock-based approaches to estimating divergence times
are not likely to yield meaningful estimates. However, several approaches have
been proposed when the assumption of rate constancy is violated: linearized
trees (Takezaki et al., 1995),
nonparametric rate smoothing (Sanderson,
1997,
1998), penalized likelihood
(Sanderson, 2002), Bayesian
approaches (e.g. Huelsenbeck et al.,
2002; Thorne and Kishino,
2002), and “PATH”
(Britton et al., 2002; for
review of methods and instructions for implementing nonparametric rate
smoothing, see
http://www.flmnh.ufl.edu/deeptime/).
Although methods to accommodate deviations from a steady molecular clock are
still under development, it is nonetheless possible to estimate dates of
divergence, given: (a) a reliable calibration point or points, (b) adequate
sampling of taxa and characters, and (c) a method that is robust to rate
heterogeneity. Confidence intervals for the estimated dates and consistency
with the fossil record provide means for assessing the reliability of age
estimates. Despite attempts to accommodate deviations from constant
evolutionary rates, however, confidence intervals are typically large, and
divergence times should be interpreted carefully. |
SUMMARY AND FUTURE PROSPECTS An exciting recent development is the merging of phylogenetics and
genomics. Phylogenetic hypotheses have become the framework for the choice of
organisms in genomic analyses, and more and more molecular biologists are
using phylogenetic trees to guide their sampling of taxa for comparative
research. This trend will continue. Systematics is moving rapidly; therefore,
molecular biologists are encouraged to contact systematics
“experts” for help in obtaining the best supported trees for a
given clade of interest. We stress the importance of a rigorous phylogenetic
analysis of data. It is ironic, for example, that researchers may spend years
gathering gene sequence data, but then want an immediate phylogenetic
“answer” within seconds or minutes. A thorough phylogenetic
analysis, evaluating alternative alignments, exon versus intron boundaries,
using different phylogenetic methods, and obtaining estimates of internal
support, may take several weeks or more, and this should not be considered an
unreasonable investment of time. Our review of issues relating to phylogeny
reconstruction also illustrates the need for more “quick courses”
in phylogeny reconstruction for molecular biologists interested in
constructing gene trees. |
Acknowledgments We thank Jeff Doyle, Bernie Hauser, Alice Harmon, and two anonymous
reviewers for helpful comments on earlier drafts of this paper. |
Notes |
References - Angiosperm Phylogeny Group (1998) An ordinal
classification for the families of flowering plants. Ann Missouri Bot
Gard 85: 531–553.
- Angiosperm Phylogeny Group II (2003) An updated
classification of the angiosperms. Bot J Linn Soc 141: 399–436.
- Becker A, Kaufmann K, Freialdenhoven A, Vincent C, Li MA,
Saedler H, Theissen G (2002) A novel MADS-box gene
subfamily with a sister-group relationship to class B floral homeotic genes.
Mol Genet Genomics 266: 942–950 [PubMed].
- Becker A, Winter K-U, Meyer B, Saedler H, Theissen G (2000) MADS-box gene diversity in seed plants 300 million years
ago. Mol Biol Evol 17: 1425–1434 [PubMed].
- Bowe LM, Coat G, DePamphilis CW (2000)
Phylogeny of seed plants based on all three genomic compartments: extant
gymnosperms are monophyletic and Gnetales' closest relatives are conifers.
Proc Natl Acad Sci USA 97: 4092–4097 [PubMed].
- Bremer K (1994) Branch support and tree
stability. Cladistics 10: 295–304.
- Bremer B, Bremer K, Heidari N, Erixon P, Olmstead RG,
Källersjö M, Anderberg A, Barkhordarian E (2002) Phylogenetics of asterids based on 3 coding and 3
non-coding chloroplast DNA markers and the utility of non-coding DNA at higher
taxonomic levels. Mol Phylogenet Evol 24: 274–301 [PubMed].
- Britton T, Oxelman B, Vinnersten A, Bremer K (2002) Phylogenetic dating with confidence intervals using mean
path lengths Mol Phylogenet Evol 24: 58–65 [PubMed].
- Chase MW, Soltis DE, Soltis PS, Rudall PJ, Fay MF, Hahn WJ,
Sullivan S, Joseph J, Molvray M, Kores PJ et al. (2000) Higher-level systematics of the monocutyledons: an
assessment of current knowledge and a new classification. In KL
Wilson, DA Morrison, eds, Monacots: Systematics and Evolution.
CSIRO Publishing, Victoria, Australia, pp
3–16.
- Chaw SM, Parkinson CL, Cheng Y, Vincent TM, Palmer JD (2000) Seed plant phylogeny inferred from all three plant
genomes: monophyly of extant gymnosperms and origin of Gnetales from conifers.
Proc Natl Acad Sci USA 97: 4086–4091 [PubMed].
- Cronquist A (1981) An Integrated System
of Classification of Flowering Plants. Columbia University Press, New
York.
- Daly DC, Cameron KM, Stevenson DW (2001) Plant
systematics in the age of genomics. Plant Physiol 127: 1328–1333 [PubMed].
- Doganlar S, Frary A, Daunay MC, Lester RN, Tanksley SD (2002) A comparative genetic linkage map of eggplant (Solanum
melongena) and its implications for genome evolution in the Solanaceae.
Genetics 161: 1697–16711 [PubMed].
- Donoghue MJ, Doyle JA (2000) Seed plant
phylogeny: demise of the anthophyte hypothesis? Curr Biol 10: R106–R109 [PubMed].
- Doyle JJ, Gaut B (2000) Evolution of genes and
taxa: a primer. Plant Mol Biol 42: 1–23 [PubMed].
- Doyle JJ, Luckow MS (2003) The rest of the
iceberg: legume diversity and evolution in a phylogenetic context.
Plant Physiol (in press).
- Farris JS, Albert VA, Källersjö M, Lipscomb D, Kluge
AG (1996) Parsimony jackknifing outperforms neighbor-joining.
Cladistics 12: 99–124.
- Graham SW, Olmstead RG (2000) Utility of 17
chloroplast genes for inferring the phylogeny of the basal angiosperms.
Am J Bot 87: 1712–1730 [PubMed].
- Grass Phylogeny Working Group (2001) Phylogeny
and subfamilial classification of the grasses (Poaceae). Ann Missouri
Bot Gard 88: 373–457.
- Hall AE, Fiebig A, Preuss D (2002a) Beyond the
Arabidopsis genome: opportunities for comparative genomics.
Plant Physiol 129: 1439–1447 [PubMed].
- Hall JC, Sytsma KJ, Iltis HH (2002b) Phylogeny
of Capparaceae and Brassicaceae based on chloroplast sequence data. Am
J Bot 89: 1826–1842.
- Hillis DM, Bull JJ (1993) An empirical test of
bootstrapping as a method for assessing confidence in phylogenetic analysis.
Syst Biol 42: 182–192.
- Huelsenbeck JP, Ronquist F, Nielsen R, Bollback JP (2001) Bayesian inference of phylogeny and its impact on
evolutionary biology. Science 294: 2310–2314 [PubMed].
- Huelsenbeck JP, Larget B, Miller RE, Ronquist F (2002) Potential applications and pitfalls of Bayesian inference
of phylogeny. Syst Biol 51: 673–688 [PubMed].
- Hu J-M, Lavin M, Wojciechowski M, Sanderson MJ (2002) Phylogenetic analysis of nuclear ribosomal ITS/5.8S
sequences in the tribe Millettieae (Fabaceae):
Poecilanthe-Cyclolobium, the core Millettieae, and the
Callerya group. Syst Bot 27: 722–733.
- Hughes NF (1994) The Enigma of
Angiosperm Origins. Cambridge University Press, Cambridge,
UK.
- Jensen RA (2001) Orthologs and paralogs: we
need to get it right. Genome Biol 2: 1002.1–1002.3.
- Judd WS, Campbell CS, Kellogg EA, Stevens PF, Donoghue MJ (2002) Plant Systematics: A Phylogenetic Approach.
Sinauer Associates, Inc., Sunderland, MA.
- Kajita T, Ohashi H, Tateishi Y, Bailey CD, Doyle JJ (2001) rbcL and legume phylogeny with particular
reference to Phaseoleae, Millettieae, and allies. Syst Bot 26: 515–536.
- Karol KG, McCourt RM, Cimino MT, Delwiche CF (2001) The closest living relatives of land plants.
Science 294: 2351–2352 [PubMed].
- Kellogg EA (2001) Evolutionary history of the
grasses. Plant Physiol 125: 1198–1205 [PubMed].
- Kenrick P, Crane PR (1997) The Origin
and Early Evolution of Land Plants. Smithsonian Institution Press,
Washington, DC.
- Koch M (2003): Molecular Phylogenetics,
Evolution and Population Biology in the Brassicaceae. In AK Sharma, A
Sharma A, eds, Plant Genome: Biodiversity and Evolution, Vol
1: Phanerogams. Science Publishers, Inc., Enfield, NH
(in press).
- Koch M, Bishop J, Mitchell-Olds T (1999)
Molecular systematics and evolution of Arabidopsis and
Arabis. Plant Biol 1: 529–537.
- Koch M, Haubold B, Mitchell-Olds T (2001)
Molecular systematics of the Brassicaceae: evidence from coding plastid
matK and nuclear Chs sequences. Am J Bot 88: 534–544 [PubMed].
- Koch M, Mummenhoff K, Al-Shehbaz IA (2003):
Molecular systematics, evolution, and population biology in the mustard family
(Brassicaceae): a review of a decade of studies. Ann Missouri Bot
Gard (in press).
- Kolukisaoglu HM, Marx MS, Weigmann C, Hanelt S,
Schneider-Portsch AW (1995) Divergence of the
phytochrome gene family predates angiosperm evolution and suggests that
Selaginella and Equistem arose prior to Psilotum.
J Mol Evol 41: 329–337 [PubMed].
- Koonin EV (2001) An apology for orthologs: or
brave new memes. Genome Biol 2: 1005.1–1005.2.
- Lewis P (1998) Maximum likelihood as an
alternative to parsimony for inferring phylogeny using nucleotide sequence
data. In DE Soltis, PS Soltis, JJ Doyle, eds, Molecular
Systematics of Plants II: DNA Sequencing. Kluwer, Boston, pp
132–187.
- Ma H, dePamphilis C (2000) The ABCs of floral
evolution. Cell 101: 5–8 [PubMed].
- Mathews S, Sharrock RA (1997) Phytochrome gene
diversity. Plant Cell Environ 20: 666–671.
- McDowell JM, Huang S, McKinney EC, An Y-Q, Meacher RB (1996) Structure and evolution of the actin gene family in
Arabidopsis thaliana. Genetics 142: 587–602 [PubMed].
- Mitchell-Olds T, Clauss MJ (2002) Plant
evolutionary genomics. Curr Opin Plant Biol 5: 74–79 [PubMed].
- Nei M, Kumar S (2000) Molecular
Evolution and Phylogenetics. Oxford University Press,
Oxford.
- O'Kane SL, Al-Shehbaz IA (2003) Phylogenetic
position and generic limits of Arabidopsis (Brassicaceae) based on
sequences of nuclear ribosomal DNA. Ann Missouri Bot Gard (in
press).
- Olmstead RG, dePamphilis CW, Wolfe AD, Young ND, Elisons WJ, Reeves A (2001) Disintegration of the Scrophulariaceae.
Am J Bot 88: 348–361 [PubMed].
- Olmstead RG, Sweere JA, Spangler RE, Bohs L, Palmer J (1999) Phylogeny and provisional classification of the Solanaceae
based on chloroplast DNA. In M Nee, DE Symon, RN Lester, JP Jessop,
eds, Solanaceae IV. Royal Botanic Gardens, Kew, UK, pp
111–117.
- Petersen G, Seberg O (2003) Phylogenetic
analyses of the diploid species of Hordeum (Poaceae) and a revised
classification of the genus. Syst Bot 28: 293–306.
- Pryer KM, Schneider H, Smith AR, Cranfill R, Wolf P, Hunt JS,
Sipes SD (2001) Horsetails and ferns are a monophyletic group
and the closest living relatives to seed plants. Nature 409: 618–622 [PubMed].
- Pryer KM, Schneider H, Zimmer EA, Banks JA (2002) Deciding among green plants for whole genome studies.
Trends Plant Sci 7: 550–554 [PubMed].
- Qiu YL, Lee J, Bernasconi-Quadroni F, Soltis DE, Soltis PS,
Zanis M, Zimmer EA, Chen Z, Savolainen V, Chase MW (1999) The earliest angiosperms: evidence from mitochondrial,
plastid and nuclear genomes. Nature 402: 404–407 [PubMed].
- Renzaglia KS, Duff RJ, Nickrent DL, Garbary DJ (2000) Vegetative and reproductive innovations if early land
plants: implications for a unified phylogeny. Philos Trans R Soc Lond
B 355: 769–793 [PubMed].
- Rodman JE, Soltis PE, Sytsma KJ, Karol KG (1998) Parallel evolution of glucosinolate biosynthesis inferred
from congruent nuclear and plastid gene phylogenies. Am J Bot 85: 997–1006.
- Sanderson MJ (1997) A nonparametric approach to
estimating divergence times in the absence of rate constancy. Molec
Biol Evol 14: 1218–1231.
- Sanderson MJ (1998) Estimating rate and time in
molecular phylogenies: beyond the molecular clock. In DE Soltis, PS
Soltis, JJ Doyle, eds, Molecular Systematics of Plants II.
Kluwer, Boston, pp 242–264.
- Sanderson MJ (2002) Estimating absolute rates
of molecular evolution and divergence times: a penalized likelihood approach.
Mol Biol Evol 19: 101–109 [PubMed].
- Sanderson MJ, Doyle JA (2001) Sources of error
and confidence intervals in estimating the age of angiosperms from
rbcL and 18S rDNA data. Am J Bot 88: 1499–1516.
- Simmons MP, Ochoterena H (2000) Gaps as
characters in sequence-based phylogenetic analyses. Syst Biol 49: 369–381 [PubMed].
- Soltis DE, Soltis PS (2000) Contributions of
plant molecular systematics to studies of molecular evolution. Plant
Mol Biol 42: 45–75 [PubMed].
- Soltis DE, Soltis PS, Chase MW, Mort ME, Albach DC, Zanis M,
Savolainen V, Hahn WH, Hoot SB, Fay MF et al. (2000)
Angiosperm phylogeny inferred from a combined data set of 18S rDNA, rbcL and
atpB sequences. Bot J Linn Soc 133: 381–461.
- Soltis PS, Soltis DE (2003) The bootstrap in
phylogeny reconstruction. Am Stat (in press).
- Soltis PS, Soltis DE, Savolainen V, Crane PR, Barraclough T (2002) Rate heterogeneity among lineages of land plants:
integration of molecular and fossil data and evidence for molecular living
fossils. Proc Natl Acad Sci USA 99: 4430–4435 [PubMed].
- Spooner DM, Anderson GJ, Jansen RK (1993)
Chloroplast DNA evidence for the interrelationships of tomatoes, potatoes and
pepinos (Solanaceae). Am J Bot 80: 676–688.
- Susuki Y, Glazko GV, Nei M (2002)
Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics.
Proc Natl Acad Sci USA 99: 16138–16143 [PubMed].
- Swofford DL (1998) PAUP* 4.0:
Phylogenetic Analysis Using Parsimony (and Other Methods). Sinauer
Associates, Sunderland, MA.
- Swofford DL, Olsen GJ, Waddell PJ, Hillis DM (1996) Phylogenetic inference. In DM Hillis, C Moritz,
BK Mable, eds, Molecular Systematics. Sinauer Associates,
Sunderland, MA, pp 407–514.
- Takezaki N, Rzhetsky A, Nei M (1995)
Phylogenetic test of the molecular clock and linearized trees. Mol Biol
Evol 12: 823–833 [PubMed].
- Takhtajan A (1997) Diversity and
classification of flowering plants. Columbia University Press, New
York.
- Tanksley SD, Bernatsky R, Lapitan NL, Prince JP (1988) Conservation of gene repertoire but not gene order in
pepper and tomato. Proc Natl Acad Sci USA 84: 6419–6423.
- Thorne JL, Kishino H (2002) Divergence time and
evolutionary rate estimation with multilocus data. Syst Biol 51: 689–702 [PubMed].
- Walbot V (2000) A green chapter in the book of
life. Nature 408: 794–795 [PubMed].
- Waters ER, Vierling E (1999) The
diversification of plant cytosolic small heat shock proteins preceded the
divergence of mosses. Mol Biol Evol 16: 127–139 [PubMed].
- Zanis MJ, Soltis DE, Soltis PS, Mathews S, Donoghue MJ (2002) The root of the angiosperms revisited. Proc Natl
Acad Sci USA 99: 6848–6853 [PubMed].
|
| |