Benchmarking Tools for the Alignment
of Functional Noncoding DNA

Pollard, Bergman, Stoye, Celniker & Eisen (2004) BMC Bioinformatics 5(1):6
Home
Figures
View the figures from the paper
Data
Download primary data
Supplement
New tools and additional analysis
Methods
Enhanced with links
People
Find people involved in this project
Methods from the paper and the supplemental analysis. [Please note that this page is currently being developed and will soon incorporate much more functionality]
Section headings:
Modelling input sequences for the simulation of Drosophila noncoding DNA.
Divergence estimates in flies, worms and mammals.
Simulating noncoding sequence divergence.
Tools for aligning noncoding DNA.
Alignment performance measures.
References.
Modelling input sequences for the simulation of Drosophila noncoding DNA.
To generate biologically relevant input sequences for our simulation, we estimated properties of noncoding sequences in the genome sequences of the fruitfly, D. melanogaster. First we extracted all noncoding regions from the Release 3 D. melanogaster genomic sequences based on annotations in the Gadfly database [1-3]. This was accomplished by masking all DNA corresponding to coding exons, producing inter-coding-exon intervals. Subsequent to extracting noncoding regions, transposable elements were masked using annotations in Gadfly to create "pre-integration" noncoding sequences. In our analysis, we chose to treat all noncoding sequences (intergenic, intronic, untranslated region) together since many noncoding sequences cannot be unambiguously categorized because of alternative splicing or alternative promoter usage. Moreover, previous results revealed that similar evolutionary constraints act on intergenic and intronic sequences in Drosophila [4]. Summary statistics of noncoding sequence lengths were calculated using the R statistical package (Figure 1) [5].

The probabilistic dependence of adjacent bases in D. melanogaster noncoding sequences was assessed by Markov chain analysis in order to create an accurate model of random noncoding sequences [6]. TE-masked noncoding sequences were concatenated, and n-mers of size 1 to 10 were counted. Counts of reverse complementing n-mers were averaged, and used to estimate frequencies of each n-mer [7]. Based on these counts and frequencies, we determined the likelihood of Markov chains of orders 1 through 9 describing Drosophila noncoding sequences, and evaluated the likelihood of each Markov chain using the Bayesian information criterion [6, 8]. This analysis revealed that D. melanogaster noncoding sequences are best modeled by a 7th-order Markov chain (data not shown). We therefore created the ancestral input sequences for our evolution simulations using a 7th-order Markov chain. We note that because our evolutionary simulation models bases independently (see below), the higher order structure of these ancestral input sequences was not maintained in the more divergent derived output sequences. Nevertheless, sequences generated by a 0th-order Markov chain gave qualitatively and quantitatively similar simulation and alignment results, with correlation among performance measures for the 0th-order and 7th-order generated sequences exceeding an r2 of 0.97 (data not shown).

Divergence estimates in flies, worms and mammals.
Estimates of silent site divergence (Ks) between H. sapiens vs. M. musculus, C. elegans vs. C. briggsae, and D. melanogaster vs. D. pseudoobscura were obtained using the yn00 method in PAML (version 3.13) [9, 10]. The mean and median of Ks were calculated for 29 fly, 193 worm, and 153 mammalian coding sequence alignments taken from references [11], [12] and [13], respectively.

Simulating noncoding sequence divergence.
Noncoding sequence evolution was simulated using a modified version of the sequence simulation program ROSE [14]. In general, in the absence of large datasets of noncoding sequences from closely related Drosophila species, we have taken estimates of noncoding evolution from previous results reported in the literature. Beginning with ancestral sequences, evolution occurred on two descendent branches of equal length under the HKY model of point substitution [15], with a transition/transversion bias of 2 to reflect the nucleotide and transition biases observed in Drosophila noncoding sequences [4, 16, 17]. The substitution rate was set to 0.01 such that a branch length unit was on average 0.01 substitutions per site. Total branch lengths spanned a range of divergence times from 0.25 to 5.0 substitutions per site. Insertion/deletion evolution was based on the length distribution of polymorphic indels estimated in [18], and occurred at a 10-fold lower rate than point substitution, approximating relative rates estimated in [19, 20].

To model the evolution of constrained blocks in noncoding sequences a modification of the ROSE sequence simulation program was developed to map constraints on ancestral sequences onto derived sequences (available for download as ROSE version 1.3 from [21]). Constraints on noncoding sequences were modelled as short blocks of highly conserved sequences typical of cis-regulatory sequences, and follow a lognormal distribution with parameters estimated in [4]. On average, interspersed blocks of constrained sites accounted for 20% of the sites in ancestral sequences, a conservative estimate of constraint in Drosophila noncoding DNA [4]. Parameters used in our simulations are summarized in Table 1.

Estimation of evolutionary distance for simulated alignments was performed using the F84 model of sequence evolution in the DnaDist program of the PHYLIP package [22] with a transition:transversion ratio of 1.0 (note that a transition:transversion ratio of 1.0 in PHYLIP is equivalent to a transition/transversion bias of 2 in ROSE, see discussion in [10]). Summary statistics for the simulations were calculated using the R statistical package (Figure 2) [5].

Tools for aligning noncoding DNA.
The alignment tools tested in this study were chosen based on the criteria that they are (1) publicly available, (2) run in batch mode from the command line and are able to produce (3) strictly co-linear, (4) error-free, pairwise genomic alignments of sequences (5) up to 10 Kb in length. Tools like BBA [23] (5), Bl2seq [24] (3), DBA [25] (4), MUMmer [26] (3), Owen [27] (2) and SSEARCH [28] (3) were not evaluated since they do not satisfy one of these criteria. We now briefly describe the tools that we tested.

Avid [29] is a pairwise global alignment tool whose general strategy for aligning two sequences is to anchor and align iteratively. A set of maximal (but not necessarily unique) matches between the sequences is constructed using a suffix tree. Dynamic programming is used to order and orient the longest matches, which are then fixed. For each subsequence remaining between the fixed matches, the process is repeated until every base is aligned. When sequences are short and the matches make up less than half of the total sequence, the program defaults to the Needleman-Wunsch algorithm [30].

The Chaos/Lagan [31, 32] suite of tools consists of a pairwise local alignment tool, Chaos, and a global alignment tool, Lagan. Chaos starts by finding all words between the two sequences of a specified length and a specified maximum number of mismatches. These words are then chained together if they are close together in both sequences. These maximal chains are then scored and all chains that are above a specified threshold are returned. Lagan starts by running Chaos with conservative parameter settings and then finds the optimal path through the maximal chains using dynamic programming. Lagan then recursively calls Chaos with increasingly more permissive parameters on the regions between each maximal chain in the optimal path. When the recursion has created a dense map of maximal chains that have been ordered with dynamic programming, Lagan runs the Needleman-Wunsch algorithm on the whole length of both sequences but puts close bounds around the maximal chains to provide the final global alignment. Chaos was run on default parameters as well as using parameters suggested by the authors: word length = 7, number of degeneracies = 1, score cut-off = 20 and extension mode on.

BlastZ [33] is a pairwise local alignment tool that is based on the gapped BLAST algorithm that has been redesigned for the alignment of long genomic sequences. BlastZ first removes lineage-specific interspersed repeats from each sequence, then searches for short near-perfect matches between the two sequences. Each match is extended first using gap-free dynamic programming and if it scores above a specified threshold it will be extended using dynamic programming with gaps; extended matches that score above a specified threshold are then kept. Part of the unique implementation of BlastZ is that it can be forced to return alignments that are both unique within each sequence as well as collinear with respect to each other. To satisfy our strict collinear requirement, we ran BlastZ with both of these options. Blastz was also run using the authorŐs suggestion of lowering the score cut-off (k) to 2000 (BlastZ-A).

DiAlign (v. 2.1) [34] is a segment-to-segment alignment algorithm. Like the BLAST algorithms, DiAlign looks for short ungapped segments that have a similarity that deviates from what would be expected by random chance, keeping segments with a score above a certain threshold. These high scoring segments are then aligned into a collinear global alignment using a dynamic programming algorithm. DiAlign produces a global alignment but distinguishes high confidence columns of an alignment from low confidence columns. We used DiAlign as both a global (DiAlign-G) and a local (DiAlign-L) alignment tool.

ClustalW (v. 1.8) [35] was used on default settings. ClustalW is a progressive multiple alignment tool that reduces to the Needleman-Wunsch algorithm in the pair-wise case with default parameters of a match score of 1.9, mismatch penalty of 0, a gap open penalty of 10 and a gap extension penalty of 0.1.

The second implementation of the Needleman-Wunsch algorithm used in this study is the needle program in the EMBOSS suite of tools [36]. needle was used with default parameter settings of a match score of 5, a mismatch penalty of 4, a gap open penalty of 10 and a gap extension penalty of 0.5.

The final tool tested, WABA [37], is a three-tier alignment algorithm. The first tier partitions the first sequence into overlapping windows of 2 Kb and then defines a synteny map of high scoring 2 Kb windows of the first sequence onto the second sequence. The second tier then carefully aligns syntenic regions using a seven-state, pair Hidden Markov Model that includes separate query and database insertion/deletion states, high and low noncoding conservation states, as well as three coding states (one for each position in a codon). The final tier then attempts to assemble individual alignments together into a more global alignment.

Alignment performance measures.
The performance of alignment tools was assessed using six basic measures: overall coverage, overall sensitivity, constraint coverage, constraint sensitivity, constraint specificity and local constraint sensitivity. Overall coverage and overall sensitivity were measured for all four evolutionary regimes (A-D) while the constraint measures were only measured in the two regimes that included constrained blocks (C, D). Alignments produced by each alignment tool were parsed to generate the statistics, which were then used to calculate each performance measure.

Each site in an alignment produced by a tool (a site being a base in one strand of a column of an alignment) can have two simulated alignment states, two constraint states, three tool alignment states, and two conditional tool alignment states. The two simulated alignment states are "homolog" (h), ungapped sites in the simulated alignments, and "no homolog" (nh), gapped sites in the simulated alignments. Simulations without indel evolution have only homolog sites since there are no gaps in the simulated alignments. The two constraint states are "constrained" (c), sites in constraint blocks, and "unconstrained" (u), sites not in constrained blocks. The three tool alignment states are "aligned" (a), sites aligned in the tool alignment, "gapped" (g), sites gapped in the tool alignment, and "not aligned" (na), sites not included in a local tool alignment. The two conditional tool alignment states are "aligned correctly" (ac), sites aligned to the same site in both the tool and simulated alignments, and "aligned incorrectly" (ai), sites aligned to different sites in the tool and simulated alignments. There are fourteen possible combinations of these states (e.g. homolog constrained aligned correctly, h_c_ac), giving us fourteen statistics to calculate for each estimated alignment. Counts for each statistic were used to calculate the following measures:

Overall coverage is the fraction of ungapped sites in a simulated alignment that are included in a tool alignment. Overall Coverage = (h_c_ac + h_c_ai + h_c_g + h_u_ac + h_u_ai + h_u_g) / (h_c_ac + h_c_ai + h_c_g + h_c_na + h_u_ac + h_u_ai + h_u_g + h_u_na)

Overall sensitivity is the fraction of ungapped sites in a simulated alignment that are aligned to the correct base in a tool alignment. Overall Sensitivity = (h_c_ac + h_u_ac) / (h_c_ac + h_c_ai + h_c_g + h_c_na + h_u_ac + h_u_ai + h_u_g + h_u_na)

Constraint coverage is the fraction of ungapped constrained sites in a simulated alignment that are included in a tool alignment. Constraint Coverage = (h_c_ac + h_c_ai + h_c_g) / (h_c_ac + h_c_ai + h_c_g + h_c_na)

Constraint sensitivity is the fraction of ungapped constrained sites in a simulated alignment that are aligned to the correct base in a tool alignment. Constraint Sensitivity = (h_c_ac) / (h_c_ac + h_c_ai + h_c_g + h_c_na)

Constraint specificity is the fraction of unconstrained sites in a simulated alignment that are gapped or not included in a tool alignment. Constraint Specificity = (h_u_g + h_u_na + nh_u_g + nh_u_na) / (h_u_ac + h_u_ai + h_u_g + h_u_na + nh_u_a + nh_u_g + nh_u_na)

Local constraint sensitivity is the fraction of sites that are both, contained in a tool alignment and are ungapped constrained sites in a simulated alignment, that are aligned to the correct base in the tool alignment. Local Constraint Sensitivity = (h_c_ac) / (h_c_ac + h_c_ai + h_c_g)

For each of these six measures, a mean and standard error of the mean were calculated for up to 1000 replicates (local tools do not always return an alignment and replicates which produced no alignment were not counted toward the mean) using R.

References.
1. Celniker SE, Wheeler DA, Kronmiller B, Carlson JW, Halpern A, Patel S, Adams M, Champe M, Dugan SP, Frise E, et al: Finishing a whole genome shotgun sequence assembly: release 3 of the Drosophila euchromatic genome sequence. Genome Biology 2002, 3:research0079.

2. Misra S, Crosby MA, Mungall CJ, Matthews BB, Campbell K, Hradecky P, Huang Y, Kaminker JS, Millburn GH, Prochnik SE, et al: Annotation of the Drosophila euchromatic genome: a systematic review. Genome Biology 2002, 3:research0083.

3. Mungall CJ, Misra S, Berman BP, Carlson J, Frise E, Harris N, Marshall B, Shu S, Kaminker JS, Prochnik SE, et al: An integrated computational pipeline and database to support whole genome sequence annotation. Genome Biology 2002, 3:research0081.1-research0081.11.

4. Bergman CM, Kreitman M: Analysis of conserved noncoding DNA in Drosophila reveals similar constraints in intergenic and intronic sequences. Genome Res 2001, 11:1335-45.

5. Comprehensive R Archive Network [http://cran.r-project.org/]

6. Weir BS: Genetic Data Analysis II. Sunderland, MA: Sinauer Associates, Inc.; 1996.

7. Burge C, Campbell, A.M., Karlin, S.: Over- and under-representation of short oligonucleotides in DNA sequences. Proc Natl Acad Sci U S A 1992, 89:1358-1362.

8. Katz RW: On some criteria for estimating the order of a Markov chain. Technometrics 1981, 23:243-249.

9. Yang Z, Nielsen R: Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol 2000, 17:32-43.

10. PAML (version 3.13) [http://abacus.gene.ucl.ac.uk/software/paml.html]

11. Bergman CM, Pfeiffer BD, Rinc—n-Limas DE, Hoskins RA, Gnirke A, Mungall CJ, Wang AM, Kronmiller B, Pacleb J, Park S, et al: Assessing the impact of comparative genomic sequences data on the functional annotation of the Drosophila genome. Genome Biology 2002, 3:research0086.

12. Castillo-Davis CI, Hartl DL: Genome evolution and developmental constraint in Caenorhabditis elegans. Mol Biol Evol 2002, 19:728-35.

13. Nekrutenko A, Makova KD, Li WH: The K(A)/K(S) ratio test for assessing the protein-coding potential of genomic regions: an empirical and simulation study. Genome Res 2002, 12:198-202.

14. Stoye J, Evers D, Meyer F: Rose: generating sequence families. Bioinformatics 1998, 14:157-63.

15. Hasegawa M, Kishino H, Yano T: Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol 1985, 22:160-74.

16. Moriyama EN, Hartl DL: Codon usage bias and base composition of nuclear genes in Drosophila. Genetics 1993, 134:847-858.

17. Moriyama EN, Powell JR: Intraspecific nuclear DNA variation in Drosophila. Mol Biol Evol 1996, 13:261-277.

18. Comeron JM, Kreitman M: The correlation between intron length and recombination in Drosophila. dynamic equilibrium between mutational and selective forces. Genetics 2000, 156:1175-1190.

19. Petrov DA, Lozovskaya ER, Hartl DL: High intrinsic rate of DNA loss in Drosophila. Nature 1996, 384:346-349.

20. Petrov DA, Hartl DL: High rate of DNA loss in the Drosophila melanogaster and Drosophila virilis species groups. Mol Biol Evol 1998, 15:293-302.

21. ROSE (version 1.3) [http://bibiserv.techfak.uni-bielefeld.de/rose/]

22. PHYLIP (version 3.5c) [http://evolution.genetics.washington.edu/phylip.html]

23. Zhu J, Liu JS, Lawrence CE: Bayesian adaptive sequence alignment algorithms. Bioinformatics 1998, 14:25-39.

24. Tatusova TA, Madden TL: BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol Lett 1999, 174:247-50.

25. Jareborg N, Birney E, Durbin R: Comparative analysis of noncoding regions of 77 orthologous mouse and human gene pairs. Genome Res 1999, 9:815-24.

26. Delcher AL, Phillippy A, Carlton J, Salzberg SL: Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res 2002, 30:2478-83.

27. Ogurtsov AY, Roytberg MA, Shabalina SA, Kondrashov AS: OWEN: aligning long collinear regions of genomes. Bioinformatics 2002, 18:1703-4.

28. Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988, 85:2444-8.

29. Bray N, Dubchak I, Pachter L: AVID: a global alignment program. Genome Res 2003, 13:97-102.

30. Needleman SB, Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48:443-53.

31. Brudno M, Chapman MA, Gottgens B, Batzoglou S, Morgenstern B: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinformatics 2003, 4:66.

32. Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Program NC, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple Alignment of Genomic DNA. Genome Res 2003.

33. Schwartz S, Kent WJ, Smit A, Zhang Z, Baertsch R, Hardison RC, Haussler D, Miller W: Human-mouse alignments with BLASTZ. Genome Res 2003, 13:103-7.

34. Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15:211-8.

35. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22:4673-80.

36. Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 2000, 16:276-7.

37. Kent WJ, Zahler AM: Conservation, regulation, synteny, and introns in a large-scale C. briggsae-C. elegans genomic alignment. Genome Res 2000, 10:1115-25.
This page was last updated on February 23rd, 2004
Problems with the website? Please contact dpollard@socrates.berkeley.edu.