To estimate ancestral intron losses and gains, Nguyen and coauthors use an exponential-time procedure, which is practical only for a few species. In reality, the estimation can be done in linear time [2], as described briefly below. We are modeling intron presence and absence in homologous sites across organisms related by a known phylogeny. Presence and absence are encoded by 1 and 0, respectively. Introns evolve independently, by a Markov model for a binary character. On branch e, an intron is lost with probability pe(1 → 0) and an intron is gained with probability pe(0 → 1) at every site. Assuming a continuous-time Markov process,
The posterior probability for state change x → y on an edge uv is computed as
The expected numbers of gains or losses are obtained by summing the probabilities quv(0 → 1) and quv(1 → 0) over all intron sites, respectively. Nguyen and coauthors consider instead all 2N state labeling of N internal nodes to compute the expected numbers of gains and losses. A Java package implements the more efficient procedure, and is publicly available at http://www.iro.umontreal.ca/~csuros/introns/.
Nguyen et al. [1] reiterate well-known concerns of identifiability. Their Proposition 1 echoes the Pulley Principle for ambiguous root placement [5]. Proposition 2 asserts that there are two possible parameter sets pe(x → y) for every branch, which can be combined to get exponentially many choices that give the same likelihood function. The continuous-time process of Equation 1 implies pe(0 → 1) + pe(1 → 0) < 1. Such constraint leads to unique parametrization (except for the root position), and is more natural than the one proposed by Nguyen and coauthors, which is based on the variance of intron gains and losses.
Nguyen and coauthors discuss an important study by Qiu, Schisler, and Stoltzfus [7]. Qiu and coauthors constructed multiple alignments of ten gene families. The families had 68 sequences and 49 intron sites on average. Using a Bayesian framework, Qiu and coauthors estimated two intron evolution parameters per family, assuming constant rates across sites and branches. The model's adequacy and some of the conclusions can certainly be debated, especially in view of the assumption of constant rates. Nguyen and coauthors, however, speculate that the data were insufficient for valid inference, since there are 268 possible intron presence–absence patterns for the average gene family, but only 49 intron sites. The argument is not sound: the number of patterns has little to do with inference (consider the case of a protein alignment with 20k possible patterns for k sequences). It is the number of parameters that matters.