Computational Discovery of RNA Elements with Functional Structures in Genomic Sequences


Functional RNA elements (FRE) in post-transcriptional regulation of gene expression are often correlated with distinct RNA structures since FRE must fold into the specific conformations in which cell factors can recognize and interact with them. Recent computational studies indicate that the structures of FREs are both significantly more ordered and thermodynamically stable than anticipated at random. This is because of evolutionary constraints and intrinsic structural properties. Various computational tools for discovering well-ordered RNA structures and their structural motifs have been developed and a number of functional structured RNA elements have been determined. Here, we summarize recent efforts in the discovery of structured FRE within complex genomes by computation.

Introduction of RNA Folding

Complete genomic sequence data are accumulating at an unprecedented pace. Nucleic acids sequences of pathogenic bacteria and human genomes provide the fundamental information useful for us to explore biological properties, such as regulation of gene expression.Although RNA is transcribed as a single-stranded, almost every RNA molecule has structure that includes various double helical, base paired regions formed by fold-back in the correct antiparallel orientation between complementary segments. In addition to Watson-Crick A:U and G:C base pairs, wobble G:U and other non-canonical base pairs, such as A:G, G:G and A:A also contribute to the structural constraints and they are formed especially in the distinct loop region of secondary and tertiary structures of RNA molecules.

Recent advances in studies of non-coding RNAs (ncRNAs) and RNA interference (RNAi) indicate that RNA is more than a messenger between genome and protein. The ncRNAs are involved in various regulatory mechanisms of gene expression at multiple levels. The functional structured RNAs (FSRs) can perform the regulatory activity. Among them, small miRNAs, about 22 nucleotides (nt), can control gene expression by binding to complementary sites in the 3' UTR of target mRNAs. Intriguingly, about a hundred distinct miRNA genes have been determined in Caenorhabditis elegans and estimates for the number of miRNAs may range about 0.5-1% of total protein-coding genes. It is conceivable that there are a large number of various FSRs, including various RNA regulatory elements in each genome. The FSR molecules are expected to be characterized by various structural motifs represented by specific combinations of base pairings and conserved nt in the loop regions.

RNA Higher-ordered Structure

A complete understanding of a FSR requires a knowledge of its 3-D structure. The determination of its RNA 3-D structure is a limiting step in the study of RNA structure-function relationships because it is very difficult to crystallize and/or get nuclear magnetic resonance spectrum data for large RNA molecules. Currently, a reliable prediction of RNA secondary and tertiary structure from its primary sequence is mainly derived by phylogenetic comparisons with additional enzyme probing and the sensitivity of nucleotides to chemical modification. The phylogenetic method has been demonstrated by successful predictions of RNA structures for tRNAs, 5S and 16S rRNAs, RNase P RNAs, small nuclear RNAs (snRNAs) and other RNAs, such as group I intron. Although dynamic programming and energy minimization methods for predicting RNA structure are not as successful as phylogenetic comparative methods, they can be performed fast and automatically by computer. With improvements of the dynamic programming algorithm and parameters for the free energy of formation of RNA structural elements, about 70% of known base-pairs are predicted on average in optimized structures by the widespread MFOLD program. The computed RNA secondary structures from MFOLD are often taken as working models that are further refined by multiple methods including experimental methods or phylogenetic comparisons. Moreover, computational methods for analysis and detection of FSRs have made a great progress recently. A number of tools such as tRNAscan-SE, RNAMOT, PatScan, RNAMotif, ERPIN, and Segfold, EDscan, SigED, StemEd, SigStem and HomoStRscan . Here, we emphasize our efforts in the discovery of FSRs in genomic sequences by computation.

FSRs are Uniquely Folded

RNA structure comparison and analysis from numbers of laboratories show that some specific combinations of base pairings and some conserved loop sequences in stem-loops are more abundant in FSRs. For example, analysis of a large number of ribosomal RNAs, such as 16S and 23S rRNAs, identified three classes of 4-nt terminal loops, or tetraloops of GNRA, UNCG and CUYG. In addition to rRNAs, GNRA tetraloops are also frequently found in self-splicing ribozymes and RNase P RNAs. The specific base-pairing and stacking implicated by non-canonical base-pairings G:A, A:G and R:R has also been found in loop E of eubacteria 5S rRNAs. The solution and crystal structures of E. coli 5S rRNA segments including loop E display a well-ordered structure characterized by the major groove narrowing and larger cross-strand distances in the central portion of loop E in which 4 out of 9 base-pairings are involved in non-canonical base-pairings It includes 2 G:A base-pairings as well as G:G and A:G. Also, the distinct base-pairing and stacking participated by non-canonical base-pairings and/or bulges are a common structural motif found in rRNAs, ribozymes and other various FSRs, such as IRE and HIV-1 regulatory elements RRE and TAR. On the other hand, phylogenetic conservation in FSRs is more impressive than that observed in the structural motifs of proteins. Statistical analysis indicates that there are about 15 invariant nts in a 76-nt tRNA molecule. Based on the observation of the structural features it was suggested that FSRs possess well-ordered conformations that are both thermodynamically stable and uniquely folded.

Computational Experiments of Random Sequences

To test the hypothesis, computational experiments were designed to explore the evolutionary constraints of the conformation folded in FSRs. Schultes et al. computed three quantitative measures to estimate the stability and uniqueness of RNA secondary structures based on the mean length of stems and total number of base pairs in the predicted structure from RNA folding (using mfold and VIENNA). The comparison of three scores computed from various FSRs and their randomly shuffled sequences indicates that the well-ordered conformations found in the most of FSRs are unlikely to arise from evolutionary modification only. Their results show that the well-ordered conformation of FSRs is expected to be rare in the conformation space formed from a population of the related random sequences.

It is evident that the evaluation of the structural uniqueness is expected to be more precisely if we can inspect the structure morphology in detail in the comparison of the two RNA structures. We use a novel algorithm (rna_match) to compare each base-pair in helical duplexes and each nucleotide in single strands between the two structures in detail. As a result, we can compute the maximal similarity score (MSS) between the two RNA structures. Using the quantitative measure MSS, the uniqueness of an arbitrary RNA structure can be estimated by evaluating the difference between the average MSS computed from a natural RNA and its related, randomly shuffled sequences and those MSS computed from random versus random sequences [45]. In the comparison, a standardized z-score $Stscr$ is introduced and defined as Stscr = (RR - NR)/std, where NR is the sample mean of computed MSS from the real RNA structure and a set of structures predicted from randomly shuffled sequences, RR and std are the sample mean and sample standarddeviation of those MSS computed from the previous random structures versus the additional n random structures with the same composition and size as the natural RNA. Thus, the greater the Stscr the statistically more unique is the well-ordered structure of the natural RNA.

In a computational test experiment on 100 tRNAs, the computed Stscr were high and the Stscr values averaged to 2.94 indicating that the structural conformations of the natural tRNAs was significantly different from those of corresponding random structures, the uniqueness of the common cloverleaf structure was statistically significant. Also, the random tests for other FSRs including RNase P RNAs, TAR and RRE of HIV-1, IRES of HCV and ribozyme showed that the FSRs had well-ordered conformations that were unlikely to occur by chance. These tests strongly support the hypothesis that the well-ordered structures of FSRs are both thermodynamically stable and uniquely folded. It also indicates that the measurement of thermodynamic stability alone is not enough for us to characterize the structural features folded in the FSRs. FSRs consist of a well-ordered folding sequence (WFS).

Computational Strategy and Tactics in Finding FSRs

In addition to the development of efficient algorithms for predicting RNA high-ordered structure from the primary sequence, the another major goal of RNA structure computation is to discover potential FSRs in RNA sequences, and to correlate them with known experimental properties and to suggest candidate sites for further experimental study. Currently, there is no effective computational approach to detect FSRs that lack sequence or structure homology to one of the known FSRs. In general, computational prediction of potential FSRs in genomic sequences is further verified by experimental testing of expression levels, functional assay by deletion or mutagenesis and structural analysis. Currently, our computational strategy is often to delimit the potential FSR in a RNA sequence by searching for WFSs or unusual folding regions (UFRs). From the WFSs detected by a robust statistical inference we then explore the common structure features in homologous RNAs. Once the WFS is found to be both significantly stable and phylogenetically conserved it can be selected as a candidate for potentially FSR element. The homologous FSRs can be searched from sequence databases by pattern search tools based on both the primary sequence and the high-ordered structure of the experimentally verified FSR.

WFSs can be characterized by the thermodynamic stability and distinct conformation of the structure folded in local segments within a genomic sequence. Previously, WFSs were often searched by computer programs Sigstb and Segfold. Sigstb and Segfold are used to explore a mRNA sequence by choosing successive fragments and comparing the computed free energy of the actual sequence to a number of randomly shuffled sequences of the same size and composition. The highly stable or unstable regions are statistically inferred and termed unusual folding regions (UFRs). It has reported that the detected UFRs in HIV-1, HIV-2, and other related viruses are coincident with the RRE and TAR.

In two recently developed computational tools of EDscan and StemEd, we used a quantitative measure Ediff to evaluate the quality of an arbitrary WFS. The measure Ediff(Si) of a given RNA segment (Si) is defined as the difference of free energies between the folded global minimal energy structure (E(Si) and its corresponding optimal restrained structure (ORS) in which all the previous base pairings in the global minimal structure are forbidden (Ef(Si)). We have

Ediff(Si) = Ef(Si) - E(Si)

Zscre(Si) = (Ediff(Si) - Ediff(w))/std(w)

Ediff(w) and std(w) are the sample mean and standard eviation, respectively, of the Ediff scores computed by sliding a fixed-length window in steps of a few nt from 5' to 3' along a RNA sequence. It is clear that the score Zscre(Si) is a z-score, a standardized measure of Ediff(Si) of the segment Si. We expect that the greater the Zscre(Si) of the segment Si, the more well-ordered is the folded RNA fragment Si.

Statistical Inference for Extreme WFSs

To estimate the statistical extremes of WFSs in a long sequence, we need a good statistical model to describe the distribution of Zscre in a large sample. To well estimate the statistical significance of Zscre for a given RNA segment Si we need to know what is the general behavior of Ediff(RSi) of a set of random sequences, RSi,1, RSi,2, ..., RSi,m that are made by randomly shuffling the local segment Si rather than the complete sequence. Statistical analysis indicates that the Zscre data computed from a set of randomly shuffled RNA sequences show a normal distribution with sample mean, m = 0, and sample standard deviation, std = 1.0.

In the two novel methods of SigED and SigStem, a standard z-score, SigZscre(Si) was employed to evaluate the statistical significance of the energy difference Ediff(Si) computed from the segment Si

SigZscre(Si) = (Ediff(Si) - Ediff(RSi))/ std(RSi)

where Ediff(RSi) and std(RSi) are the sample mean and standard deviation of Ediff (RSi,1), Ediff(RSi,2, ..., Ediff(RSi,m) are the m values of energy difference computed from the m randomly shuffled sequences, RSi,1, RSi,2, ..., RSi,m. It is important to note that randomizations are done by shuffling so that the same base compositions and sizes as the natural fragment Si are maintained.

Prediction of Common RNA Secondary Structures

A number of computational methods for predicting common RNA secondary structure for a set of related RNA sequences have been proposed. Most of these methods start with a set of predicted RNA structures computed from their related RNA sequences by a thermodynamic dynamic programming algorithm and their multiple sequence alignment. The predicted base-pairings are gradually refined by the analysis of sequence covariation or mutual information so that a common RNA structure for the set of RNAs is emerged.

Our recently developed program RNAGA is different from other approaches. The method RNAGA employs a genetic algorithm (GA) to search for a common secondary structure without the need for pre-aligned homologous RNA sequences. One of the remarkable features of RNAGA is that RNA secondary structures are automatically optimized by not only the free energy of the formation of the structure but also the structural similarity among homologous sequences [40]. The program is a three-step procedure. In the first stage, a GA is used to generate a population of RNA secondary structures that satisfy certain conditions of thermodynamic stability. In this step, the free energy of a folded structure is taken as a fitness criterion. Secondly, the structural similarity between any two structures within the population of RNA secondary structures is computed. With the quantitative measure of structural similarity as the fitness criterion, a GA is then used to improve the structural similarity among homologous RNAs for the structures in the population of a sequence. Finally, those structures that satisfy certain conditions of thermodynamic stability and structural conservation are selected as predicted common structures for a set of homologous RNAs. As a result, RNAGA solves the alignment problem of multiple sequences and the folding problem of common RNA structures simultaneously. The program also ranks the predicted common structures based on the structural similarity score in descending order. In a test including a set of 20 tRNA sequences, 25 5S rRNAs, 7 HIV-1 RREs and 10 RRE of HIV-2 and SIV, fairly convincing common secondary structures were obtained by RNAGA in the top 10 ranked solutions.

Database Search for RNA Structural Motifs

Over the last decade the computational search methods for distinct RNA structural motifs have made great progress. A number of database search tools have been developed and have practical applications to the search for known FSRs and their homologues. In general, these pattern search tools can be divided into two groups. Tools in the first group are designed to search for a specific FSR, such as tRNAs. Among them, the method tRNAscan-SE is very efficient and successful in finding tRNA genes in complete genomes. The methods in the second group are designed and optimized to find general RNA structural motifs. Most of these algorithms provide a descriptor that can describe the RNA structural elements of known FSRs and a pattern search algorithm to match and score the patterns found in the genomic sequence. The recently developed algorithm, RNAMotif is more powerful and efficient than others. A significant improvement in RNAMotif is that its descriptor can specify any type of base-base interaction and RNA structural element. Also RNAMotif provides a user controlled scoring system that can be used to add capabilities in the pattern matching. The main shortcoming of pattern-based search tools is a general inability to incorporate information of sequence and structural feature in detail. Holbrook and colleagues have proposed a general computational approach to identify FSRs in genomic sequences using neural network simulations.

We recently developed a novel algorithm, HomoStRscan, of searching for homologous FSRs by scanning a genomic sequence. HomoStRscan differs from other currently used approaches in considering each base and base pair in the query RNA. Among them, any type of base-base interaction is allowable. The algorithm finds the most similar structure to match the query structure in an arbitrary segment in the target sequence. The size of the arbitrary segment ranges near the length of the query RNA, and can be flexibly controlled by users. Simultaneously, the MSS between the query RNA and each computed matching structure from the target sequence is calculated. The homologous RNAs are then predicted by robust statistical inference from the MSS distribution computed by moving a window along the target sequence. Thus, HomoStRscan can be used to search in the genomic sequence for any RNA motif corresponding to an established secondary structure. Computational test experiments for several complete bacterial genomes proved to be very effective in finding ncRNAs, such as tRNAs and 5S rRNAs.

HomoStRscan provides a flexible, robust and fine search tool for any homologous structural RNAs. The program is based on the structure matching, structure comparison and dynamic programming algorithm. It is implemented in the C programming language and running on Unix.