Copyright © 2007 RNA Society Predicting helical coaxial stacking in RNA multibranch loops
Reprint requests to: David H. Mathews, Department of Biostatistics and Computational Biology, University of Rochester Medical Center, 601 Elmwood Avenue, Box 712, Rochester, NY 14642, USA; e-mail: david_mathews/at/urmc.rochester.edu; fax: (585) 275-6007. Received September 18, 2006; Accepted March 21, 2007. This article has been cited by other articles in PMC. | |||||||||
Abstract The hypothesis that RNA coaxial stacking can be predicted by free energy minimization using nearest-neighbor parameters is tested. The results show 58.2% positive predictive value (PPV) and 65.7% sensitivity for accuracy of the lowest free energy configuration compared with crystal structures. The probability of each stacking configuration can be predicted using a partition function calculation. Based on the dependence of accuracy on the calculated probability of the stacks, a probability threshold of 0.7 was chosen for predicting coaxial stacks. When scoring these likely stacks, the PPV was 66.7% at a sensitivity of 51.9%. It is observed that the coaxial stacks of helices that are not separated by unpaired nucleotides can be predicted with a significantly higher accuracy (74.0% PPV, 66.1% sensitivity) than the coaxial stacks mediated by noncanonical base pairs (55.9% PPV, 36.5% sensitivity). It is also shown that the prediction accuracy does not show any obvious trend with multibranch loop complexity as measured by three different parameters. Keywords: coaxial stacking, RNA structure, multibranch loop, helical junction, dynamic programming, partition function, nearest-neighbor model | |||||||||
INTRODUCTION RNA is important in research applications and cellular processes, such as translation (Nissen et al. 2000), catalysis (Doudna and Cech 2002), genetic regulation (Miranda-Rios et al. 2001; Weilbacher et al. 2003; Tucker and Breaker 2005), and RNA interference (Fire et al. 1998; Kim 2005). Therefore, an improved understanding of the RNA structure–function relationship and the ability to predict RNA tertiary structure from the nucleotide sequence are likely to provide insights into the basic mechanisms of life and the capability of finding cures for diseases that are currently not satisfactorily understood. Considerable success has been achieved in predicting the secondary structure of an RNA molecule through comparative sequence analysis (Gutell et al. 2002) of a large set of homologous sequences and free energy minimization using a single sequence (Mathews et al. 1999; Doshi et al. 2004) or multiple sequences (Gorodkin et al. 1997; Hofacker et al. 2002; Mathews and Turner 2002). The ability to predict the secondary structure by free energy minimization using nearest-neighbor parameters (Xia et al. 1998; Mathews et al. 1999, 2004) suggests that most of the tertiary contacts of RNA molecules contribute less to the stability of the structure compared with the sum of the contributions of the canonical base-pair interactions, i.e., the cis Watson–Crick/Watson–Crick-type (Leontis and Westhof 2001) base pairs GC, AU, and GU, which define the secondary structure and the stacking interactions of these base pairs. This conclusion is also strengthened by previous experimental results showing that the secondary structure is more stable than the tertiary structure (Crothers et al. 1974; Banerjee et al. 1993; Onoa et al. 2003) and that it folds faster (Woodson 2000). Determining how the secondary structure elements interact with each other to yield the native structure is the next logical step in the ab initio prediction of RNA tertiary structure. Moreover, introduction of reliable constraints on the conformation space available to the RNA improves the three-dimensional models that can be generated by structure modeling programs such as MC-SYM (Major et al. 1991) or by molecular mechanics approaches, e.g., YAMMP (Tan et al. 2006). A simple abstraction of the tertiary structure is the set of helical axes of the molecule and their orientation with respect to each other. In this view, the loops connect these helices to each other and constrain the configuration space these helical axes can explore. In this context, the most important secondary structure elements are bulge loops, size asymmetric internal loops, and multibranch loops (MBLs). The effect of these secondary structure elements on the overall conformation of the RNA has been studied by experimental techniques such as transient electric birefringence (TEB) (Zacharias and Hagerman 1995; Hagerman 1996), native gel electrophoresis (Lilley 2004), and fluorescence resonance energy transfer (FRET) (Lilley 2004). Bulge loops and size asymmetric internal loops introduce bends in the axis of the helix, thereby spatially constraining the molecule's tertiary structure. MBLs, also called helical junctions, join three or more helical branches and constrain the conformation space available to the molecule. The branching pattern of helices in MBLs is partly determined by base-pair–base-pair stacks between the terminal base pairs of any two of its branches. These interactions, called coaxial stacking interactions (Newcomb and Gellman 1994; Friedman and Honig 1995; Gellman et al. 1996; Norberg and Nilsson 1996; Sponer et al. 1997; Florian et al. 1999), provide thermodynamic stability to the molecule (Walter et al. 1994a,b; Kim et al. 1996), and their incorporation in the secondary structure prediction algorithms that minimize free energy has been shown to improve prediction accuracy (Walter et al. 1994a). Optical melting experiments on small RNA models suggested that there is no favorable free energy increment due to coaxial stacking where more than one noncanonical base pair needs to mediate the interactions between the two canonical base pairs on the termini (Fig. 1). The stacks where the two helical branches are directly stacked on each other without any intervening noncanonical pairs are called flush stacks (“F” in the figures and tables). Coaxial stacks with one or more noncanonical pairs intercalated between the terminal base pairs of stacking helices are called mismatch-mediated (MM) stacks (Kim et al. 1996). In this context the term “mismatch” is used to refer to noncanonical base pairs that occur in the native structure. These are illustrated in Figure 1. Here, the hypothesis is that the coaxial stacking arrangement of helices in an RNA MBL can be predicted by free energy minimization using the nearest-neighbor parameters (Walter et al. 1994a; Kim et al. 1996; Mathews et al. 1999) for a given secondary structure. The free energy change of F stacks is approximated with the stacking parameter for canonical helices as though the helix were uninterrupted. For MM coaxial stacking, there are two stacks that contribute to the free energy change, one where the backbone is interrupted and one where the backbone is continuous (Fig. 2). The free energy change of the stack for the interrupted backbone is approximated as 2.1 kcal/mol, regardless of the sequence. For the stack where the backbone is continuous, the sequence-dependent terminal mismatch parameter is used to approximate the free energy change (Walter et al. 1994a; Mathews et al. 1999). The predictions based on these free energy estimates are tested against high-resolution RNA structures that are available in the Nucleic Acid Database (Berman et al. 2002). A set of criteria was developed for identifying coaxial stacks in crystal structures. The examination of the hypothesis has only recently become possible because of the increase in the number of crystal structures determined for long RNA sequences. Although crystal structures are prone to packing artifacts, the overall structures of molecules are considered representative of functional, in vivo structures. In the absence of high-resolution in vivo structures of large RNA molecules, crystal structures are the natural gold standard to which predictions can be compared. The accuracy for flush stack prediction in the lowest free energy configuration is considerably higher than that of MM stack prediction. A partition function calculation can be used to predict the probability of coaxial stacking and probable coaxial stacks have higher positive predictive value (PPV). No accuracy trend is observed based on either size or complexity of the MBLs. | |||||||||
RESULTS AND DISCUSSION Database of crystal structures The 31 structures analyzed in the current work include tRNAs, ribozymes, rRNAs, and SRP RNA (see Materials and Methods and Table 1). 16S rRNAs and 23S rRNAs, however, by virtue of being the largest molecules whose crystal structures are known, constitute the majority of the data in terms of the number of MBLs and the number of predicted stacks (Fig. 3).The MBLs in the data set are of various types in both the number of helical branches emanating from the loop (ranging from three to eight) and the total size of the loop (from eight to 56 bases). Figure 4 shows the MBL distribution according to number of branches and total loop size. Accuracy of prediction by free energy minimization The simplest hypothesis is that the stacking configuration with the lowest free energy exists in the native structure. Figure 5 shows the accuracy of stacking prediction where the stacks from the lowest free energy configuration are predicted to exist in the crystal structure. Accuracy is measured with two parameters, PPV and sensitivity. These are defined according to
Prediction accuracy is not a function of multibranch loop complexity Flush coaxial stacks are predicted more accurately than are MM coaxial stacks (Fig. 5), apparently because helices separated by unpaired bases provide many more opportunities for alternate stacking configurations compared with helices that do not have any unpaired bases separating them. This suggests that prediction accuracy might be lower for the larger MBLs and/or the MBLs with a higher number of branches because these MBLs are likely to have many possible alternate stacking configurations. Figure 7a shows the PPV and sensitivity distribution according to the number of branches emanating from the MBL, and Figure 7b shows the same distribution according to the total size of the MBL (i.e., the number of bases that form the MBL where each base in a base pair is counted separately). Contrary to expectation, there is no clear accuracy trend apparent from either of the graphs. However, there is a considerable increase in prediction accuracy for MBLs of “intermediate” sizes. For both four-way MBLs and MBLs consisting of 16–19 bases, the PPV and sensitivity are higher than the overall averages. Both these categories consist of a considerable percentage of the total number of MBLs in the data set (40.0% for four-way MBLs and 29.2% for 16–19 base-long MBLs).Another method of examining the complexity of a MBL is based on the fact that the free energy of a MBL, as predicted by the nearest-neighbor model, is independent of the number and identity of any unpaired bases in the loop that are flanked by unpaired bases on both sides; i.e., the 3′ and 5′ neighbors on their strand are not part of any helical branch. This is because unpaired bases contribute to the stacking free energy only if they can stack directly on the end of a helix, either alone or as part of a mismatch. This means that the free energy calculation for stacking configurations only depends on the sequence identities of the helix termini and their immediate neighboring unpaired bases. Hence, an “element” of the MBL is defined as either a branch (which here means the terminal base pair of a helix) or an unpaired base that has at least one helical branch as its immediate neighbor. The ratio e/b, defined as Figure 9 shows coaxial stacking in one of the three MBLs with four helical branches and e/b = 1 that were found in the database. Two of these occur in the two available crystal structures of RNase P RNA. The third is from the large subunit ribosomal RNA (LSU RNA) from Haloarcula marismortui. In all three of these MBLs, the branches form two pairs of stacking helices. As Figure 8 shows, the stacking pairs in each of these MBLs are correctly predicted by free energy minimization. Interestingly, the recently published Escherichia coli complete ribosome structures (Nucleic Acid Database [NDB] IDs RR0123 to RR0126, not included as a test structure in this study because the resolution of 3.5 Å is above the chosen cutoff) also have four-element MBLs in the same position as H. marismortui LSU RNA, whereas Deinococcus radiodurans LSU RNA does not have a four-element MBL at that position. However, there is a homologous four-way branching MBL in D. radiodurans that has the same configuration of stacking helices, albeit the coaxial stacks are mediated by mismatches. Apparently, F coaxial stacks and MM coaxial stacks can serve homologous functions. Figure 8 also shows that 20.8% of the MBLs in the database have e/b = 3. As mentioned before, free energy minimization using the current nearest-neighbor model cannot predict any coaxial stacking in this case. Coaxial stacking does, however, occur in some of these MBLs. In total, six coaxial stacks, 5.6% of the total number of coaxial stacks in the database, occur in these MBLs. Most of these stacks are mediated by more than one mismatch. Accuracy of probable coaxial stacks as predicted with partition function One drawback of considering only the lowest free energy (LFE) conformation is that it does not take into account conformations having near-optimal free energy. For example, a structure with a helix having two possible competing coaxial stacks is less likely to be correctly predicted in the case where the difference in the free energy contribution of the two stacks is very small compared with the case where one of the possible stacks is significantly more stable than the second stack. This is because the structure is in equilibrium between the two conformations (with the crystal structure capturing one of the possible conformations), tertiary contacts define a definite configuration, or the free energy resolution resulting from the current nearest-neighbor parameters is inadequate to distinguish between the sequences under consideration.This problem is solved by calculating the probability of each coaxial stack in the LFE configuration using a partition function algorithm as described in Materials and Methods. Coaxial stacks in the LFE configuration having higher calculated probability are assumed to have higher likelihood of being correctly predicted than those with lower probability. A probability threshold can be chosen for prediction of coaxial stacking. Figure 10a shows the effect of changing the prediction probability threshold on PPV, and Figure 10b shows the corresponding result for sensitivity. As higher probability thresholds are used, fewer coaxial stacks are predicted (albeit with higher confidence for each of them). Therefore, PPV is improved at the cost of sensitivity. This is why sensitivity decreases monotonously in Figure 10b. However, it is clear from the two figures that the sensitivity is not significantly affected until about P = 0.7; there is a sharp decrease at higher probability thresholds. The sensitivity at P = 0.7 is still reasonable (51.9%), which suggests that it should be taken as the probability threshold for the purpose of prediction. Figure 11 shows the accuracy results where only the stacks that have a calculated probability >0.7 are predicted. The overall PPV is 66.7% at a sensitivity of 51.9%. Again, flush stacks are predicted with greater accuracy (74.0% PPV at a sensitivity of 66.1%) compared with MM stacks (55.9% PPV, 36.5% sensitivity). It should be noted that a significant number of stacks (51.9%) that have been found in the current database of crystal structures are flush stacks. Hence, even if only flush stacks in a molecule are predicted, these constraints should help improve RNA tertiary structure modeling. The results are also significant because this level of accuracy is obtained in spite of the presence of tertiary contacts within the molecules and with proteins present in many of the structures. That a significant number of stacks can be predicted correctly based only on the local sequence information strengthens the case of secondary structure prediction as a step toward ab initio prediction of RNA tertiary structure. Comparison of predictions for large subunit rRNAs For H. marismortui 23S rRNA, 12 existing stacks are not predicted, i.e., there are 12 false negatives, but only two out of these are considered by the nearest-neighbor model-based calculations. The other existing coaxial stacks are between helices that are separated by more than one unpaired nucleotide or are not adjacent helices. This means that only these two stacks are incorrectly rejected in favor of another stacking configuration of the corresponding MBL. There are also 12 false negatives for D. radiodurans, which includes five coaxial stacks that are considered possible by nearest-neighbor calculations. Interestingly, seven of these false-negative coxial stacks are common false negatives among the two structures. Similarly, seven of the false-positive coaxial stacks are common among the two structures, out of a total of 15 false positives in H. marismortui and 22 false positives in D. radiodurans. These stacks can be identified using the information in the Supplemental Material.Comparison with the Lescoute–Westhof method In a recently published work, Lescoute and Westhof (2006) presented their analysis of the topology of three-way MBLs in RNA crystal structures. They presented a set of empirical rules for predicting coaxial stacking orientations in three-way MBLs. The advantage of their method is that it can make predictions for even the MBLs with e/b = 3, but the free energy minimization method is more easily automated and applies to MBLs of any size.The Lescoute and Westhof analysis is significantly different from that presented here. In their work, a coaxial stack is assumed to exist in all the three-way MBLs. In many cases, they count coaxial stacks that would not have been considered coaxial stacks in this work. For example, P4-P6 domain of Group I ribozyme in Tetrahymena thermophila (Cate et al. 1996) has a three-way MBL in the P5abc subdomain in which P5a and P5b helices are not considered to be coaxially stacked in this work. That is because the P5a and P5b helices, when viewed in the context of the global structure, share a helical axis. However, when viewed at the level of the MBL, there is no clear stacking from one helix to the next. Lescoute and Westhof used their rules, comparative analysis, and structural insights to predict the stacking in six MBLs that do not yet have high-resolution structures. Except for one case, the U4-U6 spliceosomal junction where e/b = 3, free energy minimization can also be used to make predictions for the same sequences. A summary of the comparison follows:
Expansion to multiple sequences In this study, predictions of coaxial stacking are made for a single secondary structure using the nearest-neighbor rules extrapolated from small model systems. The method of Lescoute and Westhof (2006) provides rules for predicting coaxial stacking that are extrapolated from comparative sequence analysis. Future work in predicting coaxial stacking for structures could take advantage of both free energy minimization and comparative analysis to more accurately predict coaxial stacking. Comparative analysis can be used to determine the set of coaxial stacks that are feasible for all sequences in an alignment. Free energy minimization can then be used to judge the most stable conformation for cases where several possible coaxial stacks could exist.Summary This work demonstrates that, in many cases, simple nearest-neighbor rules can predict the coaxial stacking configuration of helical branches. The local MBL stability can therefore be used to predict the overall topology of a branching structure. For the complete structure, it is unlikely that the stability afforded by coaxial stacking is the only force that is driving the formation of the tertiary structure topology because there are other specific structures that also contribute to the stability of tertiary structure, such as the tetraloop–tetraloop receptor motif (Cate et al. 1996; Hodak et al. 2005; Downey et al. 2006). It is more likely that the requirement for a specific topology places constraints on the sequence evolution in the junction, such that coaxial stacking evolves to reinforce the overall topology. These results also reinforce the fact that coaxial stacking is an important consideration when designing RNA nanostructures (Chworos et al. 2004).To conclude, a partition function calculation can be used to predict coaxial stacking in RNA MBLs with a PPV of 66.7% at a sensitivity of 51.9%. The PPV and sensitivity values are higher for flush coaxial stack prediction compared with mismatch-mediated coaxial stacks. There is no clear accuracy trend based on the complexity of MBLs as measured by number of branches, size of the loop, and e/b ratio. Predictions have also been made for stacking configurations in three MBLs from two RNA three-way loops whose structure is yet to be determined. | |||||||||
MATERIALS AND METHODS Data set The 31 crystal structures that formed the data set for this study are listed in Table 1 along with their NDB codes. These 31 structures constitute all the known RNA structures that have a structural resolution better than 3.3 Å and contain at least one MBL after opening of all existing pseudoknots.The accuracy estimates in this work depend directly on the reliability of the available crystal structures at helical junctions. The quality of assignment to the electron density can vary from region to region within a given crystal structure, and so, the overall resolution and R-value do not necessarily reflect the local quality of a crystal structure for each region. Therefore, electron density was examined in the MBL regions of 16 of the 31 structures in the data set (including the crystal structures for H. marismortui 23S rRNA and T. thermophila 16S rRNA). These electron density maps were examined to determine whether the electron density is poorer in the regions of helical terminal base pairs that were used to determine the accuracy of our approach. For this purpose, the electron density maps were either obtained from the Uppsala Electron Density Server (Kleywegt et al. 2004) and visualized with VMD (Humphrey et al. 1996) or generated using the CCP4 Program Suite (Collaborative Computational Project 1994) with structure factors available from the Protein Data Bank (PDB) (Berman et al. 2002) and visualized using the Coot toolkit (Emsley and Cowtan 2004). The structures for which electron densities were examined and the corresponding methods employed are indicated in the Supplemental Material in Supplemental Table S1. The structure of the relevant base pairs apparently fit the electron density well. This increases our confidence in the validity of the results. Secondary structure Most of the structures in the NDB have accompanying RNAML (Waugh et al. 2002) files that contain the information about the secondary structure of the corresponding structure. These files were parsed to determine the secondary structures of the sequences in the data set. Only the cis-WC/WC base-pair configurations (Leontis and Westhof 2001) are taken as canonical base pairs. For the structures that do not have the corresponding RNAML file(s) in the NDB, the find_pair utility of the program 3DNA (Djikeng et al. 2003) was used to determine the set of canonical base pairs from the crystal structure coordinates. These structures are indicated in Table 1.Pseudoknots Pseudoknots were opened in all structures. First, all pseudoknots were identified by scanning for helices that satisfy the pseudoknot criterion:MBL definition An MBL is defined as a loop that is closed by at least three base pairs. It should be noted that isolated base pairs, i.e., helices with only a single base pair, are counted as MBL branches when they close MBLs. This is not ideal because it compromises the intended use of this work in prediction of tertiary structure from predicted secondary structures as many currently used secondary structure prediction algorithms do not predict isolated base pairs (Zuker 2003; Mathews et al. 2004). However, this definition of branches and helices is required because there are some cases in which large MBLs are divided by one or more isolated base pairs within the loop, e.g., the MBL in Figure 12. The interpretation of coaxial stacking configurations is unclear if only nonisolated base pairs are counted as branches.Modified nucleotides In five tRNA structures, modified bases occur at the terminal base pairs of helices or at the immediately adjacent positions. These modified bases were treated as the corresponding unmodified canonical bases for the purpose of free energy calculations. Similarly, the bromo-uridines at these positions in HCV IRES RNA were treated as uridines as was a thymine in the hammerhead ribozyme.Stacking prediction A dynamic programming algorithm was used to determine the lowest free energy stacking configuration, the total partition function, and the restricted partition function for the stacks predicted in the lowest free energy configuration for each of the MBLs present in the structure.The total partition function is calculated with The restricted partition function corresponding to a stack, j, is calculated according to The probability of existence of stack j, Pj, is Three-dimensional coordinate processing and coaxial stack discovery Previously, criteria have been developed to identify base stacking (Gabb et al. 1996; Burkard et al. 1999). For this work, these criteria were extended to the identification of base-pair stacks. Two helices are considered coaxially stacked if their terminal base pairs closing the MBL satisfy three criteria:
To find coaxial stacks between helices separated by more than a single mismatch, a revised threshold is defined for conditions 1 and 2 above to consider the base pairs as potentially stacked pending “base-overlap verification.” These thresholds are
These potentially stacked base pairs are then tested using “base-overlap verification,” which means searching for a cascade of stacked bases that lead from at least one of the bases of one base pair to at least one of the bases of the other base pair. If such a cascade is found, the base pairs are considered to be coaxially stacked. Figure 15 shows one example where base-stack verification returned a positive result for coaxial stacking between two helices. Although these coaxial stacks separated by more than a single noncanonical pair cannot be predicted using the nearest-neighbor model, it is important that they be found for the calculation of coaxial stack prediction sensitivity. The Supplemental Material contains a table of all coaxial stacks determined in the database of the available crystal structures using these criteria. | |||||||||
SUPPLEMENTAL DATA All Supplemental Material can be found at http://rna.urmc.rochester.edu/publications.html. | |||||||||
ACKNOWLEDGMENTS We thank Zhi John Lu, Celeste MacElrevey, and Joseph E. Wedekind for helpful discussions and two anonymous reviewers for helpful comments. D.H.M. is an Alfred P. Sloan Research Fellow. This work was supported by National Institutes of Health Grant R01GM076485 to D.H.M. | |||||||||
Footnotes Article published online ahead of print. Article and publication date are at http://www.rnajournal.org/cgi/doi/10.1261/rna.305307. | |||||||||
REFERENCES
| |||||||||