A Fast and Reliable Computational Method for Estimating Population Genetic Parameters

doi:10.1534/genetics.108.087049

Journal List > Genetics > v.179(2); Jun 2008

Genetics. 2008 June; 179(2): 951–963.

doi: 10.1534/genetics.108.087049.

PMCID: PMC2429888

A Fast and Reliable Computational Method for Estimating Population Genetic Parameters

Daniel A. Vasco¹

Odum School of Ecology, University of Georgia, Athens, Georgia 30602

¹Address for correspondence: Animal Genomics Group, S162 ASRC, Division of Animal Sciences, University of Missouri, 920 E. Campus Dr., Columbia, MO 65211-5300. E-mail: vasco.daniel.a/at/gmail.com

Communicating editor: R. Nielsen

Received January 14, 2008; Accepted March 31, 2008.

Abstract

The estimation of ancestral and current effective population sizes in expanding populations is a fundamental problem in population genetics. Recently it has become possible to scan entire genomes of several individuals within a population. These genomic data sets can be used to estimate basic population parameters such as the effective population size and population growth rate. Full-data-likelihood methods potentially offer a powerful statistical framework for inferring population genetic parameters. However, for large data sets, computationally intensive methods based upon full-likelihood estimates may encounter difficulties. First, the computational method may be prohibitively slow or difficult to implement for large data. Second, estimation bias may markedly affect the accuracy and reliability of parameter estimates, as suggested from past work on coalescent methods. To address these problems, a fast and computationally efficient least-squares method for estimating population parameters from genomic data is presented here. Instead of modeling genomic data using a full likelihood, this new approach uses an analogous function, in which the full data are replaced with a vector of summary statistics. Furthermore, these least-squares estimators may show significantly less estimation bias for growth rate and genetic diversity than a corresponding maximum-likelihood estimator for the same coalescent process. The least-squares statistics also scale up to genome-sized data sets with many nucleotides and loci. These results demonstrate that least-squares statistics will likely prove useful for nonlinear parameter estimation when the underlying population genomic processes have complex evolutionary dynamics involving interactions between mutation, selection, demography, and recombination.

THE estimation of ancestral and current effective population sizes in expanding populations is central to understanding the genetics of natural populations (Crandall et al. 1999). It is now possible to scan entire genomes of several individuals within a population (Nielsen et al. 2005; Schaffner et al. 2005; McVean and Spencer 2006). In this article I present a fast and reliable statistical method for estimating population parameters such as effective population size and growth rate using genomic data. Although the method is applicable to more complex population and selection models, I focus in this article on illustrating the method in a model of exponential population growth.

The problem of determining the parameters of a demographic expansion is a fundamental problem in population genetics theory and coalescents (Avise 2000). Several article have appeared addressing this problem, ranging from methods based upon summary statistics (Rogers and Harpending 1993) to those using the full data in a sample such as maximum-likelihood (ML) estimators (Griffiths and Tavaré 1994; Kuhner et al. 1998). However, for large data sets and complex models, computationally intensive methods may exhibit difficulties.

First, the methods may be prohibitively slow or difficult to implement on large data, especially when integrating the Felsenstein equation (Stephens 1999; Hey and Nielsen 2007). This often involves analysis of the “mixing properties” of a complex Markov chain Monte Carlo (MCMC) algorithm—a technically difficult task (Sisson 2007). In previous work, Vasco et al. (2001) demonstrated the close relationship of summary-statistics-based phylogenetic estimation methods to earlier coalescent methods (Watterson 1975; Tajima 1983, 1989; Fu 1995) as well as those based on full-likelihood approaches. They argued that instead of utilizing the full likelihood, an analogous function, determined by least-squares (LS) fitting of the data, may be computed in which the full data are replaced by a vector of summary statistics. These are still standard methods that are widely used for coalescent data analysis across the subdisciplines of evolutionary genetics (Avise 2000; Emerson et al. 2001; Knowles 2004; Hickerson et al. 2006). However, important questions remain: When does the minimum of the LS function coincide with the maximum of the ML approach? How do the performance and reliability of the two methods compare over various samples sizes of nucleotide sequences? How does performance compare for fast vs. slow population expansions? As data sets become larger and more complex, these kinds of questions are likely to loom large in the near future.

A second and related problem involves diagnosing systematic patterns of estimation bias for likelihood modeling of nucleotide sequence data. Recent work utilizing MCMC methods has indicated that likelihoods for infinite-sites coalescent models may be quite complicated for changing population size (Stephens 1999; Stephens and Donnelly 2000; Wakeley et al. 2001; Hey and Nielsen 2007). Similar observations have been made in likelihood modeling of microsatellite data (Beaumont 1999). These studies have suggested that there may be rugged topographies underlying the likelihood surface, rendering accurate estimation (or identifiability) of population parameters difficult or impossible in some regions. Recently, Sisson (2007) discussed the general problem of Bayesian computation with regard to intractable likelihoods for complex genetical data.

To address both of these problems I take the following approach. First, I present the coalescent model with a focus on its close relationship to summary genetic data in a sample (Vasco et al. 2001; Hein et al. 2005; Marjoram and Tavaré 2006). Next, I review two standard coalescent point estimation methods and from these discuss more computationally intensive, simulation-based estimation methods (Hjorth 1994; Gourieroux and Monfort 1996; Givens and Hoeting 2005). Simulation-based estimation is a natural extension of the computationally efficient LS coalescent point estimators developed in past work (Fu 1994a,b; Li and Fu 1994; Deng and Fu 1996; Vasco et al. 2001). I then compare the computationally intensive statistics with estimates obtained using EVE (Vasco et al. 2001), which gives a LS estimate, and FLUCTUATE (Kuhner et al. 1998), which gives a maximum-likelihood estimate. Both of these programs give point estimates of the genetic diversity parameter θ = 2Nμ (where N is the effective population size, and μ is the mutation rate) and the population growth rate given a sample of nucleotide sequences. Finally, I demonstrate that it is possible to obtain a nearly unbiased estimate of the population parameters by recursive use of the LS estimator. The estimator is applied to the African human mtDNA sample collected by Vigilant et al. (1991).

GENOMIC DATA, COALESCENTS, AND STOCHASTIC SIMULATION

Population genomic data can be characterized by three properties (see, for example, Christianini and Hahn 2007): first, the observed data are determined by a single stochastic realization of the evolutionary process; second, they consist of a sample space of many dimensions (i.e., genomic scans generate a large number of sequenced sites; also, a large number of individuals may be sequenced—the sample space maybe quite large); and third, stochastic simulation must often be used to infer genomic parameters since analytic methods rarely exist to describe complex stochastic processes. Inferring population parameters from genomic data requires linking together all three of these properties into a single statistical theory. Coalescent theory offers a means to do this by generating random samples from a Wright–Fisher model: one can efficiently undertake the joint simulation of genealogies and mutations (Vasco et al. 2001). This allows utilizing coalescent theory as the cornerstone for the statistical analysis of molecular population samples (Fu and Li 1999). In this section I develop a statistical coalescent model with a focus on its close relationship to summary genetic data in a sample of DNA sequences.

Figure 1A shows how to go from the complete data, a sampled set of five DNA sequences, to a summary of the data: each large dot represents a mutation (neutral gene substitution) on one of the sequences shown in Figure 1C (shown as smaller dots). The pattern of polymorphism for the sample can be observed as the number of neutral nucleotide substitutions (Kimura 1983) along a lineage of the tree, d_i, where each i corresponds to a branch in the coalescent tree. This quantity is useful as summary measure of the complete data (the set of sequences).

Figure 1.—

(A) DNA sequences equation M1

, with associated labeled mutations 1, 2, 3, 4, 5, 6. This schematic shows the structure of the summary statistics if the sequences evolve according to the coalescent genealogical structures shown in B–D. Comparing summary data (more ...)

Next, consider the coalescent tree of the sample itself, without recombination and migration, so that the sampled sequences are connected by a single genealogy as shown in Figure 1B. In the Wright–Fisher model each generation is discrete and formed randomly by sampling N parents with replacement from the current generation. In general, for any coalescent tree, it is possible to label the time intervals, t_n, for 2, …, n sequences and these are referred to here as the nth coalescent times. These intervals are random variables representing the time duration (the number of generations) required for coalescence from n sequences to n − 1 sequences. Thus, in Figure 1B each of the five sequences can be traced back in time first to n − 1 ancestral sequences, next to n − 2 sequences, and so on until a single ancestral sequence remains (the most recent common ancestor, MRCA).

We have now covered the first two criteria for developing a statistical theory of genomic data: it has been demonstrated how the complete data (a sampled set of sequences) are connected to a coalescent tree. Before the role of stochastic simulation and statistical inference can be introduced, I first develop some bookkeeping rules that are useful for large coalescent-based phylogenies.

Each branch length of the tree can be represented as a simple sum of the coalescent times between ancestors. It is not difficult to show that, since the jump process of a coalescent tree is fixed (Fu 1994a,b), the branch lengths of any coalescent tree can be computed as

(1)

where s_ik represents a set of index variables for each branch that bookkeeps the number of times the coalescent time contributes to the length of the ith branch. Thus, for branch i, one can define n − 1 index variables (k = 2, · · ·, n) such that s_ik = 1 if the branch has a segment of length t_k between the kth and the (k − 1)th coalescence and s_ik = 0 otherwise. Thus, the branch lengths over the entire topology of a tree for a sample of n genes can be quantitatively characterized in terms of the set of 2(n − 1)² variables and corresponding coalescent times. For example, the coalescent tree shown in Figure 1B has 2(5 − 1)² = 32 index variables. Detailed examples of how to use this bookkeeping device appear in Fu (1994a,b) as well as in Deng and Fu (1996). Here I use it to define a (2n − 2) × (n − 1) matrix J for the jump process of the coalescent, where each element is determined by the s_ik variables (see Figure 1). If the vector of coalescent times is defined as T = ( equation M3

)^T, where the superscript T signifies transposition, then any vector of branch lengths equation M4

, satisfying Equation 1, can be written in matrix form

This equation is an expression of the fact that the jump process of the coalescent, J, is invariant with respect to model assumptions regarding the distribution of coalescent times or branch lengths of a tree once its topology is fixed.

Consider next the effect of exponential population expansion on the coalescent times and branch lengths of the tree. Thus, N(t) = N₀e^gt, where g is the growth rate (g > 0), t is the time since the initial generation, and N₀ is the initial effective population size. In using a coalescent approach, it is useful to look backward in time at N(−t) and from this vantage it appears that the population exponentially declines in size as the population experiences coalescence toward its most recent common ancestor. In a coalescing population, currently at size N₀, one unit of time corresponds to 2N₀ generations. I now describe how to simulate coalescent trees under this model. The kth coalescent time, t_k, in a genealogy corresponds to a time interval in which exactly k ancestors exist in the sample. Each t_k is conditioned upon a time τ_k when there are more than k ancestors in the sample. The intervals between coalescent events can be efficiently estimated using the distribution

with τ_n = 0 (Griffiths and Tavaré 1994). The dependence of the time t_k upon τ_k has an intuitive basis: it occurs because each period t_k in a genealogy starts only when n sequences have coalesced to k sequences. For the cases of constant population size and exponential growth, the distribution for the t_k can be determined exactly (see Hudson 1990; Slatkin and Hudson 1991). The value t_k (k = 2, …, n) is an outcome of the coalescent with exponential growth g,

(2)

where

and U is randomly drawn from the uniform distribution on the open interval (0, 1). This last result determines the final property required to describe genomic data (stated earlier in this section): by generating random numbers stochastic simulation may be used to study the complex stochastic processes by which coalescent trees are formed. The result can be combined with a model of DNA sequence evolution to simulate the evolution of sampled sequences on the tree itself as explained below. Before this is done, however, I show how to develop a statistical coalescent model that may be used to estimate population parameters.

Assume now that the coalescent for a sample of five sequences in an exponentially growing population has produced a coalescent tree with parameters (θ, g). Although this genealogy is the product of a stochastic process, the branch lengths, l_i, and number of mutations on a branch, d_i, for a single genealogical history can be observed exactly for a simulation or accurately reconstructed for a set of sequences. Throughout this article the d_i's are the components of a data vector equation M9 , where the convention upon numbering and ordering is shown in Figure 1. I now show how to approximate the average branch lengths and number of mutations on each lineage of such a tree given its topology. For the moment I assume that the tree topology is known exactly. To compute the expected branch lengths, one simulates the coalescent time distribution using Equation 2 many times and averages the results. Each average coalescent time, equation M10 , is computed using

(3)

The quantity G represents the number of genealogies that are simulated to obtain the average kth coalescent time. Equation 3 can be used to study the statistical properties of genealogies under several different kinds of models involving population and selective change. Substitution of Equation 3 into Equation 1 yields

(4)

Utilizing a UPGMA phylogenetic reconstruction algorithm (see below) gives the expected number of mutations between pairs of coalescing sequences. The key point is that it is possible to fix the tree topology and rapidly compute an approximation of the number of mutations on each branch of the tree.

STATISTICAL MODEL

In this article I assume the infinite-sites model of mutation and that all mutations are selectively neutral. Each offspring differs from its parent by a Poisson-distributed number of mutations. Hence for individual lineages of length l_i, the number of mutations on that lineage, d_i, follows a Poisson distribution with parameter θl_i conditional on l_i, where θ = 2N₀μ. All moments of d_i (including the mean and variance) are determined by the moments of t_i. The expected number of mutations can be found using Equation 4 and is equal to

(5)

Thus one may compute the expected number of neutral mutational changes d_i (see Figure 1A) along an expected branch length l_i (recall Figure 1D) for a sample of DNA sequences. So for the entire coalescent tree, the vector of mutations

is conditioned on the vector of branch lengths equation M14

by the Poisson distribution with expectation

(6)

and variance

(7)

where

is the diagonal matrix for which nonzero entries correspond to l_i with respect to a given d_i.

Given a fixed genealogy with J and E(T), Equations 4–7 can be used to define the following nonlinear mutation model,

(8)

where

is the vector of expected branch lengths, and the model error term is equation M20

, with the ith component equation M21

Expectations for the coalescent model can be efficiently computed over the entire genealogy:

(9)

Variances

for coalescent genealogies satisfying the mutation model can likewise be computed, but are not utilized in the present article. For each branch of a coalescent genealogy the data d_i can be computed using distance, parsimony, or any other method that gives an accurate estimate of the number of mutations on a branch (see below). These data are stored in the vector

. All other quantities such as equation M24

and E(JT) represent theoretical expectations that can be efficiently computed using Monte Carlo integration of sampled genealogies based upon Equations 3 and 4.

STATISTICS AND COMPUTATION-BASED INFERENCE

In this section I discuss how to compute point estimates of targets g and θ using a LS and a ML method, how to use computational statistics and simulation to quantify the bias associated with a targeted estimate, and how to correct the bias of the targeted estimate.

Least-squares estimation: Assume the exponential growth coalescent model with parameters θ, g, N₀, where the effective population size N₀ is fixed at the time of sampling. Using (8) define the scalar function,

(10)

Then there exists a nonlinear LS estimator for (θ, g) upon solution of the optimization problem

with respect to θ and g. Application to multiple loci is straightforward: compute a function V_i(θ, g) using the coalescent history for each ith locus, sum over all i, and then minimize the sum with respect to (θ, g). This can be rapidly computed for large numbers of loci, for example. A computer program written in the C language, called Estimators for Variable Environments (EVE), was developed to perform these computations (Vasco et al. 2001). At the point g = 0, the LS minimum point determining θ for the EVE algorithm is expected to be the same as that computed by Fu's UPBLUE recursive estimator for θ when the effective population size is constant (Fu 1994a). Thus, this statistic can be considered an extension of Fu's method to the case of variable population size.

Maximum-likelihood estimation: For the problem of estimation of θ and g from sequence data Kuhner et al. (1998) proposed the maximum-likelihood computation program FLUCTUATE. This program computes θ using a transformation in which the coalescent structure of a genealogy becomes identical to the constant population expectation of the program COALESCE (Kuhner et al. 1995). The conversion between EVE's θ and FLUCTUATE's θ is made by multiplication of the FLUCTUATE θ by number of sites in a sequence. Since EVE assumes that coalescent times are scaled by 2N₀ generations, I rescaled FLUCTUATE growth rate estimates appropriately. In addition, growth rate is scaled in per mutation units of 1/μ. Note that to make this conversion and comparison with the unscaled EVE estimate of growth rate, essentially a priori information is required with regard to the constant population size expectation of θ₀ = 2N₀μ. When computing genealogies, FLUCTUATE takes a step that is the construction of a single genealogy; a “chain” for a set of genealogies is formed to compute parameter estimates. I used those suggested in Kuhner et al. (1998) for a single locus: 10 short chains of 1000 steps followed by 2 long chains of 15,000 steps. For an initial guess of the θ, Watterson's (1975) estimate was used. For an initial guess of g I used 10⁻⁵.

Computation-based estimation: Statistical problems such as severe bias create difficulties for obtaining accurate estimates of parameters for a point estimate. Simulation-based estimation is a powerful way to diagnose some of these difficulties if one has access to a fast estimator. In this section I describe three computationally intensive estimators based upon jackknifing, bootstrapping, and simulation principles (Efron and Tibshirani 1993; Shao and Tu 1995; Gourieroux and Monfort 1996) that are used investigate statistical properties of the point estimators EVE and FLUCTUATE.

Jackknife estimation of parameters: Quenouille (1956) proposed utilizing the jackknife to estimate the bias of an estimator by deleting one datum each time from the original data set and then recalculating the estimator on the basis of the rest of the data. The jackknife estimator of a parameter p (where p = g or θ in this article), utilizes the subsample estimates of an estimator equation M27 to quantify the bias-reduction process,

(11)

where

is the estimate computed applying a method like ML and LS to the whole sample and equation M30

is the jth subsample, respectively, with the ith sequence removed, this being denoted as the subscript −i. For subsampling of the set of sequences one can take the complete set equation M31

deleting a single sequence equation M32

at a time. Each equation M33

is then a statistic computed using 2n − 1 observations equation M34

with datum d_r removed (which is uniquely determined when equation M35

is removed from the sample, so that r is equal to the subscript of the external branch corresponding to the removed sequence drawn from

). To compute the estimator stated by Equation 11 one iteratively solves the LS problem

(12)

i = 1, …, n times. As before, −i represents an iteration index that is determined by the consecutive removal of a sequence from the total set but now it can be used to uniquely determine equation M37

and V_j_,−i. Since the removal of a sequence corresponds to the removal of an external branch, to find equation M38

it suffices to compute the s_ik matrix with 0's corresponding to coalescent time entries in the full s_ik matrix for the removed branch. Having determined equation M39

and

, one then performs the matrix multiplication given by Equation 12 to obtain V_j_,−i. Minimizing V_j_,−i with respect to (θ, g) gives a jackknifed LS estimate for the sample. The subscript j bookkeeps the computation of the jackknifed LS statistic for each equation M41

. Summing the estimates with respect to j over all n subsamples then gives equation M42

in Equation 11.

Parametric bootstrap-known tree: In a parametric bootstrap, samples (replicate sets of nucleotide sequences) are simulated from a specified value of g and θ. The estimator then is fed the known tree topology (J), the known branch lengths equation M43 , and the known number of mutations on each branch for a given MC replicate. From these data, which correspond to perfect knowledge of the phylogenetic tree of the coalescent, a LS estimate is computed for each replicate. The bootstrap expectation equation M44 , where equation M45 is the MC replicate. I computed the MC estimated standard deviation equation M46 as equation M47 .

Parametric bootstrap-unknown tree: This case corresponds to that stated above except that for each MC replicate the tree topology is estimated using UPGMA (see, for example, Swofford and Olsen 1990). For each replicate, UPGMA is used to compute an approximation to the branch lengths of the tree, giving an estimate of the observed number of mutations between pairs of coalescing sequences. Statistics for the MC replicates are computed as stated in the last section, with the notational change equation M48 and equation M49 for the mean and the MC replicate, respectively.

Simulation of sampled sequences: When implementing simulation studies I assumed a Jukes–Cantor model of DNA nucleotide sequence evolution (Li 1997). This assumes that all four nucleotides are equally frequent and all substitution types are equally likely in a sample. I assumed a sequence length of 1000 sites and a per locus mutation rate μ in simulations when the target parameters were known. When simulating the Vigilant et al. (1991) data I assumed a sequence length of 610 sites. Under the infinite-sites model, the number of mutations separating a pair of sequences is simply the number of nucleotide differences between the pair of sequences (Fu 1994a). In this case the number of substitutions between sequences i and j corresponds to the number of mutations separating two coalescing sequences d_ij (Tajima 1983). Once the distance matrix for a sample of sequences is computed the UPGMA method can be applied to compute the tree. When applied to an empirical set of nucleotide sequences one computes the genetic distance and applies UPGMA to the sample and then inputs this into EVE (see Vasco et al. 2001 for an example using HIV sequences).

RESULTS

In this section two different ways of assessing statistical truth with respect to LS and ML estimation are investigated. The first involves the case where a target parameter p = p_T is known from simulation. For this case I investigate how accurately the known value of p can be recovered using LS or ML statistics. The robustness of the LS statistic is then investigated under recombination. The second involves the case where only an estimate equation M50 is known for p_T. For this case all that one can hope for is that equation M51 is sufficiently close to p_T since p_T itself is unknown. Nature gives researchers a single stochastic realization when fitting their stochastic simulation models to genomic data. So an important statistical question is how to robustly characterize the statistical variation associated with that realization.

To investigate the second case I used an African human mtDNA sample drawn from Vigilant et al. (1991) to compute an LS point estimate (the mtDNA control region with a sequence length of 610 bp and a sample size of n = 97). I then use simulation-based estimation to compute Monte Carlo replicates. This allows using the parametric bootstrap means to quantify bias. The bootstrap replicates also allow estimation of the variance associated with the biased mean. Finally I use the LS statistic to assess bias and correct for it. As noted by Efron and Tibshirani (1993), bias correction is a worthwhile pursuit but a dangerous one. Here I show it is possible to compute a nearly unbiased estimate of equation M52 . For the purpose of computational approaches to population genetics data analysis this result clearly demonstrates that the LS statistic has a peak at or near an estimated quantity (Cabrera and Fernholz 1999). Elimination of troublesome sources of bias allows computing reliable distributions of statistical errors for the sample that can be utilized in subsequent statistical studies such as the development of power tests and model prediction and selection (Hjorth 1994).

Performance of the LS statistic: Tables 1 and 2 show the performance of the LS method in estimating g and θ for the case θ = 40. In these tables the “size” of the table, in terms of increasing g, corresponds to the range of consistency for estimation of g. Thus, higher sample sizes generally gave better estimates of g. Table 2 shows that simultaneous estimation of θ is nearly unbiased over a wide range of g—the reason being that θ appears in Equation 8 linearly. It is also shown that a significant reduction in MC standard deviation of the estimators can be achieved by increasing sample size. In all cases, the MC standard deviation of the estimators decreases with increased numbers of sequences. There is a slight bias in estimation near g = 0.001 because of the singularity at g = 0 in Equation 2. As noted previously, however, one may use the UPBLUE method to estimate θ by Fu (1994a) at g = 0 (or at least its assumption of constant population size to compute the expected branch lengths in Equation 10).

TABLE 1

EVE g estimates

TABLE 2

EVE θ-estimates

Figure 2 graphically shows the effect of sample size on the performance of LS estimation. For higher substitution rate (θ > 40), a significant range of growth rates can be consistently estimated. Better knowledge of the gene tree always allows expansion of the range of consistent estimation. Figure 2A shows that for a range of g from 0.001 to 6, consistent estimation of g was possible for the unknown tree. Lack of knowledge of the actual gene tree significantly affects the ability to consistently estimate g > 5 for θ = 40. Figure 2B shows that the range of consistent estimation for g in the case θ = 40 is in the range of g from 0.001 to 80 for the known gene tree. Since many rapid population expansions, such as for humans and viruses, may lie in this range it is important to correct the bias for any point estimates obtained that lie in this range. Below, I show how to address this problem using the jackknife bias-correction statistic. I then compare this statistic to those obtained using the more standard parametric bootstrap method.

Figure 2.—

Performance of least-squares statistics. Lines are connected between computed mean estimates using 10,000 Monte Carlo replicates. Values used for simulating each replicate were θ = 40, μ = 0.001, N₀ = 20,000. Sequence (more ...)

Comparison of LS and ML statistics: In this section, I compare the LS and ML statistics, utilizing the LS surface as a tool to diagnose performance. A ML surface should exhibit a maximum with respect to the minimum on the LS surface. Therefore, it is possible to investigate how the statistics behave by visually scanning the distribution of MC replicates around the LS minimum in (θ, g) parameter space. I show that there are two scales in parameter space that determine the properties of the estimators. The first is determined by a “local” region surrounding a LS minimum point that determines EVE estimates. This is determined by the LS criterion, i.e., the minimization of Equation 10. In this region, one would expect the distribution of MC replicates to cluster around the minimum, with density highest over the target parameters. In fact, a second “global” region over (θ, g) parameter space plays a role with respect to the LS surface. This happens because of the appearance of a large plateau in parameter space. In this section I show how this plateau affects performance of the ML vs. the LS estimator.

Figure 3C shows the distribution of Monte Carlo replicates for the target g = 5 and θ = 60 at a sample size n = 10. Figure 3, D–F, shows the distribution of Monte Carlo replicates for the target g = 15 and θ = 60 at three different sample sizes n = 10, 30, and 100. For both g = 5 and g = 15, the sample size of n = 10 shows a cloud of LS and ML replicates close to the targets (see Figure 3, C and D). Also several outliers appear for both the LS and the ML estimates. However, the LS outliers appear to track the valley in the LS surface, while the ML estimators tend to be dispersed on a steeply rising ridge in the LS surface as shown in Figure 3A. The height of these ridges requires blowing up the region around the targets to see contours of the valley (see Figure 3B). FLUCTUATE outperforms EVE for the case shown in Figure 3C. It has a MC mean closer to the target than EVE, with smaller standard deviation. However, for the cases shown in Figure 3, D–F, the presence of outliers shifts the mean MC value for θ away from its target. This gets progressively worse as the sample size increases—see Figure 3, E (n = 30) and F (n = 100). In contrast to the severe estimation bias observed for θ, FLUCTUATE estimation of g appears to be nearly unbiased. However, this conclusion must be accepted with the proviso that perfect knowledge of θ₀ = 2N₀μ is available. Table 3 gives the MC means and standard deviation for the comparison between LS and ML statistics.

Figure 3.—

Comparison of least-squares (EVE) and maximum-likelihood (FLUCTUATE) estimation of growth rate (g) and θ. All Monte Carlo replicates were computed using parametric bootstrapping (with the phylogenetic tree unknown). (A) The LS surfaces are shown (more ...)

TABLE 3

Comparison between LS and ML statistics

For larger sample sizes the distributions of the ML and LS estimates move further apart. Overall, these results imply that the ML estimator FLUCTUATE does not exhibit consistency at high growth rates and sample sizes when estimating θ. However, at small growth rate and small sample size FLUCTUATE outperforms EVE.

Previous MCMC studies for infinite-sites models (Wakeley et al. 2001) have shown that the likelihood surface for changing population size has a complex topography: the parameters become nonidentifiable in some regions and plateaus appear. Beaumont (1999) made similar observations with likelihood modeling of microsatellite data. The findings of this section lend support to this earlier work. Under rapid population expansion the majority of the mutations are in the terminal branches of the tree. In a population sample, this appears as a sequence with a common ancestral haplotype and with the remaining haplotypes appearing as singletons. Such a star-shaped coalescent tree gives rise to a difficult data set for an MCMC coalescent likelihood method. For these data the MCMC computation may generate the same star-shaped pattern for many choices of θ and growth rate. Indeed, Stephens (1999) and Stephens and Donnelly (2000) have previously predicted such computational difficulties for certain classes of MCMC methods. They noted that some MCMC algorithms, such as FLUCTUATE, use a fixed or “driving” value of θ. This gives rise to the possibility that such algorithms find modes in an essentially flat surface or ridge where none are present. Figure 3A shows a steeply rising LS contour up to a long flat plateau. The FLUCTUATE MC replicates project onto the contours of such a plateau (see Figure 3F). The large pink dot centered at the cloud of replicates suggests that it may be at the center of a mode “found” by FLUCTUATE where clearly none is present.

Effect of recombination on the LS statistic: Schierup and Hein (2000a) demonstrated that even small amounts of recombination can produce biases in tree reconstruction. These biases can affect phylogenetic inferences using reconstructed trees for nucleotide sequence data. For example, bias from recombination can cause trees to superficially resemble phylogenies for sequences from an exponentially growing population. They also found that recombination created a large overestimation of rate heterogeneity and that standard ML methods [the DNAml and DNAmlk programs of PHYLIP (Felsenstein 1995)] could not infer the presence of a molecular clock from sequence data (Schierup and Hein 2000b). Given that the LS method uses the shape of a single phylogenetic tree to estimate parameters, I tested the LS estimator on several thousand simulated samples of nucleotide sequences exhibiting varying amounts of recombination. I used the coalescent algorithm with recombination described in Hudson (1983), modified so as to include exponential population growth. Since the generation of the coalescence and recombination events are independent, I was able to use the approach described in this article to efficiently simulate the coalescent time distribution under recombination (Fearnhead and Donnelly 2001). The mutation model used was the Jukes–Cantor model (Li 1997) and was applied to simulate nucleotide data sets under recombination as described in the article by Schierup and Hein (2000a).

I examined the performance of estimation under two scenarios. First, if the growth rate is zero, how well will the LS estimator perform given varying rates of recombination? Second, in an exponentially growing population with high growth rate, how well does the LS estimator perform under varying rates of recombination?

The results for the first case are shown in Figure 4. Setting the growth rate to zero in the simulations allows generating a null distribution on the estimators to determine the range of bias created by recombination. For low to moderate rates of recombination (ρ = 0–15) the LS estimator performs reasonably well. One observes growth rates close to zero in each case. The range of estimation bias lies well within the range of that observed to be caused by intrinsic bias in the coalescent time distribution. For ρ > 15, however, the bias due to recombination gradually increases with the amount of recombination. Each point in Figure 4 was computed from the mean over 1000 replicated data sets.

Figure 4.—

Reliability of estimating (A) growth rate (g) and (B) θ under recombination. Each point represents an estimate of g = 0 under a given level of recombination rate (ρ). The level of bias under recombination is about the same as expected (more ...)

For the second case I simulated samples of nucleotide sequences under moderate growth rate (g = 15) and varying rates of recombination ρ = 0–50. The results are shown in Figure 5. As in the previous case, in the region (ρ = 0–15) the LS estimator appears to perform reasonably well. In this region the estimates of g and θ exhibit low bias, on the order of that observed in the absence of recombination. For ρ > 15, the bias caused by recombination starts to increase the estimation bias of the observed estimate of g, causing it to become more and more inflated. This bias appears to rise more rapidly in the presence of high growth than in the absence of growth (compare Figures 4 and 5). Each point in Figure 5 was computed from the mean over 1000 replicated data sets.

Figure 5.—

Reliability of estimating (A) growth rate (g = 15) and (B) θ (= 60) under a given level of recombination (ρ), with n = 100. The level of bias under recombination is about the same as expected as that produced by (more ...)

Inference on sample with parameters unknown: LS statistics were computed for the African subset of the sampled human mtDNA sequenced by Vigilant et al. (1991). This gave the LS point estimates equation M53 , which imply a significant recent population expansion (see Figure 6A). The distributions of the MC replicates for all three simulation-based estimators track the LS surface. This makes sense in light of the results shown earlier (see Tables 1 and 2, Figure 2) for the case when the target is known.

Figure 6.—

Comparison of simulation-based estimators of population growth rate (g) and genetic diversity (θ) for the Vigilant et al. (1991) African human mtDNA sample (n = 97). Each case is followed by the Monte Carlo (MC) mean estimate and standard (more ...)

Consider next the quantification of estimation bias. Parametric bootstrap estimates of the point estimate were computed with the tree assumed unknown. The results for this are shown in Figure 6B. The mean is shown as the large dot at (θ, g) = (197.82 ± 39.35, 39.70 ± 13.13), and the small blue dots correspond to MC replicates. Although extreme bias is present in this estimator the spread of the replicates closely tracks the LS surface determined by the point estimate. Next parametric bootstraps were computed of the point estimate with the tree assumed known. The results are shown in Figure 6C. The MC mean is shown as the large dot at (θ, g) = (259.52 ± 66.78, 69.88 ± 30.25). This estimator shows substantially less bias. Finally, the nonparametric jackknife estimator gives a nearly unbiased estimate of the target. The small dots in Figure 6D are the MC jackknife replicates and these cluster tightly around the target. Although the estimated population growth rate for the African sample was very high it was still possible to obtain a nearly unbiased estimate of the point estimate using the jackknife.

The jackknifing statistic allows taking advantage of the intuitive idea that most of the mutations with information with regard to computing accurate estimates of θ and g lie on the tips of the coalescent tree. Thus, jackknifing may be regarded as a computationally cheap way to take advantage of this property to obtain nearly unbiased estimates of the parameters. It is essentially a recalibration method that allows reweighting the estimator using the residuals of the nonlinear mutation model. The jackknifing is so efficient it effectively weights each external branch as if it were adding genetic information from many independent loci (see Figure 2 for the cases n = 30, l = 10). However, in this case the only information in the sample is from a single locus—the human mtDNA.

DISCUSSION

In this article I show that it is possible to infer sources of bias with regard to coalescent estimation of a sample of sequences in an expanding population. Once the bias has been quantified, it is then possible to rapidly compute a bias-corrected estimator. In this section, I address two issues associated with these problems: LS estimation vs. ML estimation and the relationship between the results of this article and ongoing work in MCMC estimation methods.

Least-squares vs. maximum-likelihood estimation: In many situations a ML estimator is preferable over a LS estimator because the former is more flexible and can incorporate statistical tests more readily than the latter. However, estimation of demographic and selective parameters using the coalescent is an inherently nonlinear problem. That is, the surface to be minimized or maximized is likely to have multiple optima and a complex topography. Estimating likelihoods when the tree is not known or approximated is extremely computationally intensive in determining this surface. Despite this, using a parametric bootstrap it is possible to compare the ML and LS estimators. One generates Monte Carlo replicate data sets and superimposes the resulting estimates on top of each other. The LS estimates should cluster around the minimum and the ML should cluster around a maximum. The minima and the maxima should be approximately centered over each other. Thus the two clusters should lie roughly on top of each other with centers over the target. This is true for the LS estimator (EVE) but not always for the ML estimator analyzed in this article (FLUCTUATE). As sample size increases the correlation between optima located on the LS surface and optima located on the ML surface becomes lost. As sample size increases one would expect that an efficient estimator should become more accurate in predicting the target. My results show that the LS estimator does have a minimum over the target and becomes more accurate in predicting the target as sample size increases. I also show that the LS point estimator can be efficiently bias corrected if a nonparametric resampling estimation method is used.

Measurement error, summary statistics, and MCMC: Recently Sisson (2007) pointed out some of the technical difficulties associated with developing MCMC algorithms for complex genetic data. He reviews computational problems associated with application of standard Metropolis–Hastings samplers to population genetics problems involving intractable likelihoods. As noted above, Stephens (1999) earlier pointed to similar problems, specifically centered around integration of the Felsenstein equation (Hey and Nielsen 2007). An important part of Sisson's critique lies in evaluating how to construct proposals for MCMC algorithms on the basis of summary statistics that allow efficient “mixing” of the algorithm. He points to the analysis of Tanaka et al. (2006), who constructed a stochastic model at the genetic level that allows direct application of the Metropolis–Hastings approach of Marjoram et al. (2003) to a summary statistic. Such an approach is related to the methods developed in the present article. However, the method would have been substantially slower and I would never have been quite sure when to stop the algorithm. Instead I implemented a cheaper LS model that allowed obtaining a fast answer. The frequentist framework developed here can be expressed within the framework of Bayesian sensitivity analysis (Oakley and O'Hagan 2004). Whatever framework one chooses, however, the present article offers up a cheap and fast method of obtaining parameter estimates for complex genetic data sets that can be compared to computationally expensive MCMC methods. Future work should lie in bringing together these two approaches into a more complete and useful statistical theory of genomic data.

Acknowledgments

I thank the associate editor, two anonymous reviewers, Romulus Breban, Gordon Harkins, and Paul Schliekelman for their work on helping me to improve the overall presentation and quality of the manuscript. This work was partly supported by grants from the National Institutes of Health (R01 HD34350-01A1) to Keith A. Crandall and (R29 GM50428) to Y.-X. Fu.

References

Avise, J. C., 2000 Phylogeography: The History and Formation of Species. Harvard University Press, Cambridge, MA.
Beaumont, M. A., 1999 Detecting population expansion and decline using microsatellites. Genetics 153: 2013–2029. [PubMed].
Cabrera, J., and L. T. Fernholz, 1999 Target estimation for bias and mean square error reduction. Ann. Stat. 27:(3): 1080–1104.
Crandall, K., D. Posada and D. A. Vasco, 1999 Effective population sizes: missing measures and missing concepts. Anim. Conserv. 2: 317–319.
Christianini, N., and M. W. Hahn, 2007 Introduction to Computational Genomics. Cambridge University Press, Cambridge, UK.
Deng, W.-H., and Y.-X. Fu, 1996 The effect of variable mutation rates across sites on phylogenetic estimation of effective population size. Genetics 144: 1271–1281. [PubMed].
Efron, B., and R. J. Tibshirani, 1993 An Introduction to the Bootstrap (Monographs on Statistics and Applied Probability 57). Chapman & Hall, London.
Emerson, B. C., E. Paradis and C. Thebaud, 2001 Revealing the demographic histories of species using DNA sequences. Trends Ecol. Evol. 16:(12): 707.
Fearnhead, P., and D. P. Donnelly, 2001 Estimating recombination rates from population genetic data. Genetics 159: 1231–1241. [PubMed].
Felsenstein, J., 1995 PHYLIP (Phylogeny Inference Package), Version 3.572. University of Washington, Seattle.
Fu, Y.-X., 1994a A phylogenetic estimator of effective population size or mutation rate. Genetics 136: 685–692. [PubMed].
Fu, Y.-X., 1994b Estimating effective population size or mutation rate using the frequencies of mutations of various classes in a sample of DNA sequences. Genetics 138: 1375–1386. [PubMed].
Fu, Y.-X., 1995 Statistical properties of segregating sites. Theor. Popul. Biol. 48: 172–197. [PubMed].
Fu, Y.-X., and W.-H. Li, 1999 Coalescing into the 21st century: an overview and prospects of coalescent theory. Theor. Popul. Biol. 56: 1–10. [PubMed].
Givens, G. H., and J. A. Hoeting, 2005 Computational Statistics. Wiley-Interscience, Hoboken, NJ.
Gourieroux, C., and A. Monfort, 1996 Simulation-Based Econometric Methods. Oxford University Press, Oxford.
Griffiths, R. C., and S. Tavaré, 1994 Sampling theory for neutral alleles in a varying environment. Philos. Trans. R. Soc. Lond. B 344: 403–410. [PubMed].
Hein, J., M. H. Schierup and C. Wiuf, 2005 Gene Genealogies, Variation and Evolution. A Primer in Coalescent Theory. Oxford University Press. Oxford.
Hey, J., and R. Nielsen, 2007 Integration with the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc. Natl. Acad. Sci. USA 104: 2785–2790. [PubMed].
Hickerson, M. J., G. Dolman and C. Moritz, 2006 Comparative phylogeographic summary statistics for testing simultaneous vicariance. Mol. Ecol. 15: 209. [PubMed].
Hjorth, J. S. U., 1994 Computer Intensive Statistical Methods: Validation Model Selection and Bootstrap. Chapman & Hall, London.
Hudson, R. R., 1983 Properties of a neutral allele model with intragenic recombination. Theor. Popul. Biol. 23: 183–201. [PubMed].
Hudson, R. R., 1990 Gene genealogies and the coalescent process, pp. 1–44 in Oxford Surveys in Evolutionary Biology, edited by P. H. Harvey and L. Partridge. Oxford University Press, London/New York/Oxford.
Kimura, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK.
Knowles, L. L., 2004 The burgeoning field of statistical phylogeography. J. Evol. Biol. 17: 1–10. [PubMed].
Kuhner, M. K., J. Yamoto and J. Felsenstein, 1995 Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. Genetics 140: 1421–1430. [PubMed].
Kuhner, M. K., J. Yamoto and J. Felsenstein, 1998 Maximum likelihood estimation of population growth rates based on the coalescent. Genetics 149: 429. [PubMed].
Li, W.-H., 1997 Molecular Evolution. Sinauer Associates, Sunderland, MA.
Li, W.-H., and Y.-X. Fu, 1994 Estimation of population parameters and detection of natural selection from DNA sequences, pp. 112–125 in Non-Neutral Evolution: Theories and Molecular Data, edited by B. Golding. Chapman & Hall, London.
Marjoram, P., and S. Tavaré, 2006 Modern computational approaches for analyzing molecular genetic variation data. Nat. Rev. Genet. 7: 759–770. [PubMed].
McVean, G., and C. C. A. Spencer, 2006 Scanning the human genome for signals of selection. Curr. Opin. Genet. Dev. 16: 624–629. [PubMed].
Nielsen, R., S. Williamson, Y. Kim, M. J. Hubisz, A. G. Clark et al., 2005 Genomic scans for selective sweeps using SNP data. Genome Res. 15: 1566–1575. [PubMed].
Oakley, J. E., and A. O'Hagan, 2004 Probabilistic sensitivity analysis of complex models: a Bayesian approach. J. R. Stat. Soc. B 66:(3): 751–769.
Quenouille, M. H., 1956 Notes on bias in estimation. Biometrika 43: 353–360.
Rogers, A. R., and H. Harpending, 1993 Population growth makes waves in the distribution of pairwise genetic differences. Mol. Biol. Evol. 9: 552–569.
Schaffner, S., C. Foo, S. Gabriel, D. Reich, M. J. Daly et al., 2005 Calibrating a coalescent simulation of human genome sequence variation. Genome Res. 15: 1576–1583. [PubMed].
Schierup, M., and J. Hein, 2000a Consequences of recombination on traditional phylogenetic analysis. Genetics 156: 879–891. [PubMed].
Schierup, M., and J. Hein, 2000b Recombination and the molecular clock. Mol. Biol. Evol. 17: 1578–1579. [PubMed].
Shao, J., and D. Tu, 1995 The Jackknife and the Bootstrap. Springer-Verlag, New York.
Sisson, S. A., 2007 Genetics and stochastic simulation do mix! Am. Stat. 61:(2): 112–119.
Slatkin, M., and R. R. Hudson, 1991 Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. Genetics 129: 555–562. [PubMed].
Stephens, M., 1999 Problems with computational methods in population genetics. Bulletin of the 52nd Session of the International Statistical Institute, Helsinki.
Stephens, M., and P. Donnelly, 2000 Inference in molecular population genetics. J. R. Stat. Soc. B 62: 605–635.
Swofford, D. L., and G. J. Olsen, 1990 Phylogenetic reconstruction, pp. 411–501 in Molecular Systematics, edited by D. M. Hillis and C. Moritz. Sinauer Associates, Sunderland, MA.
Tajima, F., 1983 Evolutionary relationship of DNA sequences in finite populations. Genetics 105: 437–460. [PubMed].
Tajima, F., 1989 The effect of change in population size on DNA polymorphism. Genetics 123: 597–601. [PubMed].
Tanaka, M. M., A. R. Francis, F. Luciani and S. A. Sisson, 2006 Estimating tuberculosis transmission parameters from genotype data using approximate Bayesian computation. Genetics 173: 1511–1520. [PubMed].
Wakeley, J., R. Nielsen, S. N. Liu-Cordero and K. Ardlie, 2001 The discovery of single-nucleotide polymorphisms—inferences about human demographic history. Am. J. Hum. Genet. 60: 1332–1347.
Watterson, G. A., 1975 On the number of segregating sites in genetic models without recombination. Theor. Popul. Biol. 7: 256–276. [PubMed].
Vasco, D. A., K. A. Crandall and Y.-X. Fu, 2001 Molecular population genetics: coalescent methods based on summary statistics, pp. 173–216 in Computational and Evolutionary Analysis of HIV Molecular Sequences, edited by A. G. Rodrigo and G. H. Learn, Jr. Kluwer Academic Publishers, Dordrecht, The Netherlands.
Vigilant, L. M., M. Stoneking, H. Harpending, K. Hawkes and A. C. Wilson, 1991 African populations and the evolution of human mtDNA. Science 253: 1503–1507. [PubMed].