Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Stable stem enabled Shannon entropies distinguish non-coding RNAs from random backgrounds

Stable stem enabled Shannon entropies distinguish non-coding RNAs from random backgrounds Background: The computational identification of RNAs in genomic sequences requires the identification of signals of RNA sequences. Shannon base pairing entropy is an indicator for RNA secondary structure fold certainty in detection of structural, non-coding RNAs (ncRNAs). Under the Boltzmann ensemble of secondary structures, the probability of a base pair is estimated from its frequency across all the alternative equilibrium structures. However, such an entropy has yet to deliver the desired performance for distinguishing ncRNAs from random sequences. Developing novel methods to improve the entropy measure performance may result in more effective ncRNA gene finding based on structure detection. Results: This paper shows that the measuring performance of base pairing entropy can be significantly improved with a constrained secondary structure ensemble in which only canonical base pairs are assumed to occur in energetically stable stems in a fold. This constraint actually reduces the space of the secondary structure and may lower the probabilities of base pairs unfavorable to the native fold. Indeed, base pairing entropies computed with this constrained model demonstrate substantially narrowed gaps of Z-scores between ncRNAs, as well as drastic increases in the Z-score for all 13 tested ncRNA sets, compared to shuffled sequences. Conclusions: These results suggest the viability of developing effective structure-based ncRNA gene finding methods by investigating secondary structure ensembles of ncRNAs. Background program is its secondary structure prediction mechan- Statistical signals in primary sequences for non-coding ism, for example, based on computing the minimum RNA (ncRNA) genes have been evasive [1-3]. Because free energy for the query sequence under some thermo- single strand RNA folds into a structure, the most dynamic energy model [9-12]. The hypothesis is that the exploitable feature for structural ncRNA gene finding ncRNAs’ secondary structure is thermodynamically has been the secondary structure [4-6]. The possibility stable. Nonetheless, stability measures have not per- formed as well as one might hope [13]; there is evidence that folded secondary structure may lead to successful ab initio ncRNA gene prediction methods has energized that the measures may not be effective on all categories leading groups to independently develop structure-based of ncRNAs [14]. ncRNA gene finding methods [7,8]. The core of such a A predicted secondary structure can be characterized for its fold certainty, using the Shannon base pairing entropy [15,16]. The entropy ∑p log p of base pair- * Correspondence: [email protected]; [email protected] i,j i,j Department of Computer Science, University of Georgia, Athens, Georgia ings between all bases i and j can be calculated based 30602, USA on the partition function for the Boltzmann secondary Full list of author information is available at the end of the article © 2012 Wang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 2 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 structure ensemble, which is the space of all alternative performance on these ncRNAs with the performance secondary structures of a given sequence; the probability achieved by the software NUPACK [23] and RNAfold p is calculated as the total of Boltzmann factors over [12,29] developed under the Boltzmann standard sec- i,j all equilibrium alternative structures that contain the ondary structure ensemble [17,22]. base pair (i, j) [17]. As an uncertainty measure, the base pairing Shannon entropy is maximized when base pair- Data preparation ing probabilities are uniformly distributed. A structural We downloaded the 13 ncRNA datasets previously RNAsequenceisassumed to havealow base pairing investigated in Table 1 of [18]. They are of diverse func- Shannon entropy, since the distribution of its base pair- tions, including pre-cursor microRNAs, group I and II ing probabilities is far from uniform. The entropy mea- introns, RNase P and MRP, bacterial and eukaryotic sig- sure has been scrutinized with real ncRNA data nal recognition particle (SRP), ribosomal RNAs, small revealing a strong correlation between entropy and free nuclear spliceosomal RNAs, riboswitches, tmRNAs, reg- energy [18,19]. However, there has been mixed success ulatory RNAs, tRNAs, telomerase RNAs, small nucleolar in discerning structural ncRNAs from their randomly RNAs, and Hammerhead ribozymes. shuffled counterparts. Both measures perform impress- The results from using these datasets were analyzed ively on precursor miRNAs but not as well on tRNAs with 6 different types of measures, including Z-score and some rRNAs [14,18]. and p-value of minimal free energy (MFE), and Shannon The diverse results of the entropy measuring on dif- base pairing entropy [18], in comparisons with random ferent ncRNAs suggest that the canonical RNA second- sequences. The six measures correlate to varying ary structure ensemble has yet to capture all ncRNAs degrees, hence using MFE Z-score and Shannon base structural characteristics. For example, a Boltzmann pairing entropy may be sufficient to cover the other ensemble enhanced with weighted equilibrium alterna- measures. However, these two measures, as the respec- tive structures has also resulted in higher accuracy in tive indicators for the fold stability and fold certainty of secondary structure prediction [19]. There is strong evi- ncRNA secondary structure, have varying performances dence that the thermodynamic energy model can on the 13 ncRNA datasets. improve its structure prediction accuracy by considering For our tests, we also generated random sequences as energy contributions in addition to those from the cano- control data. For every ncRNA sequence, we randomly nical free energy model [20,21]. Therefore, developing shuffled it to produce two sets of 100 random sequences ncRNA structure models that can more effectively each; one set was based upon single nucleotide shuffling, account for critical structural characteristics may the other was based upon di-nucleotide shuffling. In become necessary for accurate measurement of RNA addition, all ncRNA sequences containing nucleotides fold certainty. other than A, C, G, T, and U were removed for the rea- In this paper, we present work that computes Shannon son that NUPACK [23] doesn’t accept sequences con- base pairing entropies based on a constrained secondary taining wildcard symbols. structure model. The results show substantial improve- ments in the Z-score of base pairing Shannon entropies Shannon entropy distribution of random sequences on 13 ncRNA datasets [18] over the Z-score of entropies Two energy model based softwares, NUPACK (with the computed by existing software (e.g., NUPACK [23] and pseudoknot function turned off) and RNAfold, and our RNAfold [12,29]) with the canonical (Boltzmann) sec- program TRIPLE computed base pairing probabilities on ondarystructureensembleand theassociated partition ncRNA sequences and on random sequences. In parti- function [22]. Our limited constraint to the secondary cular, for every ncRNA sequence x and its associated structure space is to require only canonical base pairs to randomly shuffled sequence set S , the Shannon entro- occur in stable stems. The constrained secondary struc- pies of these sequences were computed. ture model is defined with a stochastic context-free A Kolmogorov-Smirnov test (KS test) [24] was applied grammar (SCFG) and entropies are computed with the to verify the normality of the entropy distributions from Inside and Outside algorithms. Our results suggest that all randomly shuffled sequence sets. The results show incorporating more constraints may further improve the that for99% of thesequencesetswefailtorejectthe effectiveness of the fold certainty measure, offering hypothesis that entropies are normally distributed with improved ab initio ncRNA gene finding. 95% confidence level. This indicates that we may use a Z-score to measure performance. Results We implemented the algorithm for Shannon base pair- Z-score scores and comparisons ing entropy calculation into a program named TRIPLE. For each ncRNA, the average and standard deviation of We tested it on ncRNA datasets and compared its Shannon entropies of the randomly shuffled sequences Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 3 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 were estimated. The Z-score of the Shannon entropy Q Table 1 Comparisons of TRIPLE and NUPACK by the percentages of sequences falling in each category of a Z- (x) of ncRNA sequence x is defined as follows: score range. μ(Q(S ))–Q(x) ncRNA Method Z ≥ 2.0 Z ≥ 1.5 Z ≥ 1.0 Z ≥ 0.5 Z(x)= (1) σ (Q(S )) Hh1 TRIPLE 26.67 40.00 53.33 73.33 NUPACK 0.00 0.00 20.00 53.33 where μ(Q(S )) and s(Q(S )) respectively denote the x x sno_guide TRIPLE 14.43 24.45 38.39 58.19 average and standard deviation of the Shannon entropies NUPACK 0.73 8.80 27.63 45.23 of the random sequences in set S .The Z-Scoremea- sn_splice TRIPLE 40.51 50.63 60.76 65.82 sures how well entropies may distinguish the real NUPACK 3.80 18.99 48.10 70.89 ncRNA sequence x from their corresponding randomly SRP TRIPLE 35.06 44.16 59.74 67.53 shuffled sequences in S . Figure 1 compares the averages NUPACK 3.90 36.36 72.73 85.71 of the Z-scores of Shannon base pairing entropies com- tRNA TRIPLE 29.56 51.33 70.97 86.02 puted by NUPACK, RNAfold, and TRIPLE on each of NUPACK 0.00 2.30 12.04 32.21 the 13 ncRNA datasets. It shows that TRIPLE signifi- intron TRIPLE 60.75 69.16 78.50 85.98 cantly improved the Z-scores over NUPACK and RNA- NUPACK 1.87 19.63 61.68 85.05 fold across all the 13 datasets. riboswitch TRIPLE 34.64 48.37 60.13 78.43 To examine how the Z-scores might have been NUPACK 1.96 18.95 45.75 69.28 improved by TRIPLE, we designated four thresholds for miRNA TRIPLE 81.48 88.89 94.07 97.04 Z-scores,which are2,1.5,1,and 0.5. Thepercentages NUPACK 0.00 12.59 68.15 97.78 of sequences of each dataset with Z-score greater than telomerase TRIPLE 29.41 35.29 41.18 58.82 or equal to the thresholds were computed. NUPACK 11.76 17.65 35.29 47.06 Table 1 shows details of the Z-score improvements RNase TRIPLE 50.70 70.42 81.69 92.25 over NUPACK when di-nucleotide shuffling was used. NUPACK 5.63 23.94 48.59 72.54 With a threshold 2 or 1.5, our method performed better regulatory TRIPLE 22.41 24.14 32.76 56.90 than NUPACK in all datasets. With the threshold 1 and NUPACK 1.72 3.45 18.97 51.72 0.5, our method improved upon NUPACK in 12 and 10 tmRNA TRIPLE 18.64 32.20 45.76 55.93 datasets, respectively. The results of TRIPLE and NUPACK 1.69 8.47 27.12 37.29 NUPACK using a single nucleotide random shuffling rRNA TRIPLE 36.16 50.62 70.87 83.06 are given in Table 2, which shows that our method also NUPACK 4.75 21.07 42.56 61.16 performs better than NUPACK in the majority of data- Random sequences were obtained with di-nucleotide shuffling of the real sets. In particular, TRIPLE performed better than ncRNA sequences. NUPACK in all datasets with threshold of 2; with threshold equal to 1.5 or 1, our method had better results than NUPACK in 12 datasets and in 9 datasets datasets with threshold equal to 2 and 1.5. With thresh- with threshold equal of 0.5. old of 1 and 0.5, TRIPLE wins 12 (tie 1) and 8 (tie 1) The results of RNAfold using the default setting are datasets, respectively. In Table 4, TRIPLE shows similar given in Table 3 and 4. Table 3 shows results on di- performanceonsinglenucleotideshufflingdatasets. It nucleotide shuffling datasets. TRIPLE works better in has better scores than RNAfold in 13, 13, 11, and 7 (tie the majority of datasets. It outperforms RNAfold in all 1) datasets with threshold of 2, 1.5, 1, and 0.5, Figure 1 Comparisons of averaged Z-score of Shannon base pairing entropies. Comparisons of averaged Z-score of Shannon base pairing entropies computed by NUPACK, RNAfold, and TRIPLE for each of the 13 ncRNA datasets downloaded from [18]. Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 4 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 Table 2 Comparisons of TRIPLE and NUPACK by the Table 3 Comparisons of TRIPLE and RNAfold by the percentages of sequences falling in each category of a Z- percentages of sequences falling in each category of a Z- score range. score range. ncRNA Method Z ≥2Z ≥ 1.5 Z ≥1Z ≥ 0.5 Dataset Method ≥2 (%) ≥1.5 (%) ≥1(%) ≥0.5 (%) Hh1 TRIPLE 6.67 33.33 53.33 73.33 Hh1 TRIPLE 26.67 40.00 53.33 73.33 NUPACK 0.00 0.00 20.00 60.00 RNAfold 0.00 0.00 20.00 53.33 sno_guide TRIPLE 14.91 25.43 41.10 57.95 sno_guide TRIPLE 14.43 24.45 38.39 58.19 NUPACK 0.98 9.05 28.85 45.72 RNAfold 1.71 7.82 23.96 43.03 sn_splice TRIPLE 31.65 43.04 56.96 65.82 sn_splice TRIPLE 40.51 50.63 60.76 65.82 NUPACK 5.06 26.58 51.90 69.62 RNAfold 6.33 21.52 54.43 69.62 SRP TRIPLE 32.47 45.45 55.84 68.83 SRP TRIPLE 35.06 44.16 59.74 67.53 NUPACK 3.90 37.66 72.73 87.01 RNAfold 5.19 24.68 58.44 71.43 tRNA TRIPLE 24.07 45.31 64.25 79.47 tRNA TRIPLE 29.56 51.33 70.97 86.02 NUPACK 0.00 2.12 14.69 33.45 RNAfold 0.18 4.25 24.78 47.96 intron TRIPLE 59.81 68.22 74.77 84.11 intron TRIPLE 60.75 69.16 78.50 85.98 NUPACK 1.87 22.43 66.36 85.98 RNAfold 2.80 17.76 60.75 84.11 riboswitch TRIPLE 32.03 44.44 56.86 71.90 riboswitch TRIPLE 34.64 48.37 60.13 78.43 NUPACK 1.96 21.57 46.41 69.28 RNAfold 0.65 17.65 47.06 70.59 miRNA TRIPLE 75.56 81.48 90.37 93.33 miRNA TRIPLE 81.48 88.89 94.07 97.04 NUPACK 0.00 9.63 70.37 98.52 RNAfold 0.00 7.41 65.93 97.78 telomerase TRIPLE 23.53 29.41 41.18 58.82 telomerase TRIPLE 29.41 35.29 41.18 58.82 NUPACK 5.88 29.41 29.41 52.94 RNAfold 0.00 23.53 41.18 58.82 RNase TRIPLE 38.03 56.34 72.54 87.32 RNase TRIPLE 50.70 70.42 81.69 92.25 NUPACK 10.56 26.06 52.11 76.06 RNAfold 1.41 12.68 34.51 59.15 regulatory TRIPLE 18.97 25.86 31.03 51.72 regulatory TRIPLE 22.41 24.14 32.76 56.90 NUPACK 0.00 1.72 24.14 50.00 RNAfold 0.00 6.90 27.59 63.79 tmRNA TRIPLE 15.25 27.12 38.98 57.63 tmRNA TRIPLE 18.64 32.20 45.76 55.93 NUPACK 3.39 6.78 27.12 42.37 RNAfold 1.69 10.17 33.90 50.85 rRNA TRIPLE 34.09 47.31 64.88 79.96 rRNA TRIPLE 36.16 50.62 70.87 83.06 NUPACK 6.40 21.69 43.19 60.74 RNAfold 1.45 15.70 35.33 56.82 Random sequences were obtained with single nucleotide shuffling of the real Random sequences were obtained with di-nucleotide shuffling of the real ncRNA sequences. ncRNA sequences. respectively. In addition, RNAfold was tested with the ncRNA data sets. In particular, for each RNA sequence available program options (tables not shown). With from 13 ncRNA data sets, 100 sequence segments of the option “noLP” on RNAfold, TRIPLE performs better in same length were sampled from each genome sequence 13, 13, 11 (tie 1), and 9 di-nucleotide shuffling datasets and used to test against the RNA sequence to calculate in terms of threshold of 2, 1.5, 1, and 0.5, respectively. base pairing entropies and Z-score. With such genome In single nucleotide shuffling datasets, TRIPLE wins 13, backgrounds, the overall performance of TRIPLE on the 13, 12 and 8 datasets separately with threshold of 2, 1.5, 13 ncRNA data sets is mixed and is close to that of 1, and 0.5. NUPACK and RNAfold (data not shown). This perfor- When we specify “noLP” and “noCloseGU” on RNA- mance of TRIPLE on real genomes indicates that there fold, TRIPLE beats RNAfold in 13, 13, 12, and 11 di- is still a gap between the ability of our method and suc- nucleotide shuffling datasets, and 13, 13, 13, and 11 sin- cessful ncRNA gene finding. Nevertheless, the test gle nucleotide shuffling datasets with threshold 2, 1.5, 1, results reveal that the constrained “triple base pairs” and 0.5, respectively. If we specify “noLP” and “noGU” model is necessary but still not sufficient enough. This on RNAfold, our method performs better on all di- suggests incorporating further structural constraints will nucleotide shuffling and single nucleotide shuffling data- improve the effectiveness for ncRNA search on real sets with all four thresholds. genomes. We also compared TRIPLE, NUPACK, and RNAfold To roughly evaluate the speed of the three tools, the on some real genome background tests. Several genome running time for 101 sequences, including 1 real sequences from bacteria, archaea, and eukaryotes were miRNA sequence and its 100 single nucleotide shuffled retrieved from the NCBI database. Using these genome sequences, was measured on a Linux machine with an sequences, we created genome backgrounds for the 13 Intel dual-core CPU (E7500 2.93 GHz). Each sequence Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 5 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 Table 4 Comparisons of TRIPLE and RNAfold by the We note that there is only one exceptional case percentages of sequences falling in each category of a Z- observed from Table 1, 2, 3, 4: SRP whose entropy Z- score range. score performance was not improved (as much as other Dataset Method ≥2 (%) ≥1.5 (%) ≥1 (%) ≥0.5 (%) ncRNAs) when Z<1.5. The problem might have been Hh1 TRIPLE 6.67 33.33 53.33 73.33 caused by the implementation technique rather than the RNAfold 0.00 0.00 20.00 53.33 methodology. Most of the tested SRP RNA sequences sno_guide TRIPLE 14.91 25.43 41.10 57.95 (Eukaryotic and archaeal 7S RNAs) are of length around RNAfold 1.47 7.33 24.21 44.01 300 and contain about a dozen stems. In many of them, consecutive base pairs are broken by internal loops into sn_splice TRIPLE 31.65 43.04 56.96 65.82 small stem pieces, some having only two consecutive RNAfold 6.33 24.05 53.16 68.35 canonical pairs; whereas, in our SCFG implementation SRP TRIPLE 32.47 45.45 55.84 68.83 we simply required three consecutive base pairs as a RNAfold 5.19 29.87 59.74 77.92 must in a stem, possibly missing the secondary structure tRNA TRIPLE 24.07 45.31 64.25 79.47 of many of these sequences. This issue with the SCFG RNAfold 0.00 6.19 26.19 48.85 can be easily fixed, e.g., by replacing the SCFG with one intron TRIPLE 59.81 68.22 74.77 84.11 that better represents the constrained Boltzmann RNAfold 1.87 16.82 58.88 85.98 ensemble in which stems are all energetically stable. riboswitch TRIPLE 32.03 44.44 56.86 71.90 To ensure that the performance difference between RNAfold 1.31 20.92 49.67 71.24 TRIPLE and energy model based software (NUPACK miRNA TRIPLE 75.56 81.48 90.37 93.33 and RNAfold) was not due to the difference in the ther- RNAfold 0.74 10.37 69.63 97.78 modynamic energy model (Boltzmann ensemble) and telomerase TRIPLE 23.53 29.41 41.18 58.82 the simple statistical model (SCFG) with stacking rules, RNAfold 5.88 17.65 35.29 58.82 RNase TRIPLE 38.03 56.34 72.54 87.32 we also constructed two additional SCFG models, one RNAfold 1.41 15.49 35.92 61.27 for unconstrained base pairs and another requiring at regulatory TRIPLE 18.97 25.86 31.03 51.72 least two consecutive canonical base pairs in stems. RNAfold 0.00 5.17 32.76 67.24 Tests on these two models over the 13 ncRNA data set tmRNA TRIPLE 15.25 27.12 38.98 57.63 resulted in entropy Z-scores (data not shown) compar- RNAfold 0.00 11.86 35.59 45.76 able to those obtained by NUPACK and RNAfold but rRNA TRIPLE 34.09 47.31 64.88 79.96 inferior to the performance of TRIPLE. We attribute the RNAfold 1.86 17.98 37.60 57.64 impressive performance by TRIPLE to the constraint of “triple base pairs” satisfied by real ncRNA sequences but Random sequences were obtained with single nucleotide shuffling of the real ncRNA sequences. which is hard to achieve for random sequences. Since the entropy Z-score improvement by our has 100 nucleotides. TRIPLE, NUPACK, and RNAfold method was not uniform across the 13 ncRNAs, one spent 20.7 seconds, 36.2 seconds and 3.4 seconds, may want to look into additional other factors that respectively. We point out that TRIPLE has the poten- might have contributed to the under-performance of tial to be optimized for each specific grammar to certain ncRNAs. For example, the averaged GC contents improves its efficiency. are different in these 13 datasets, with SRP RNAs having 58% GC and standard deviation of 10.4%. A sequence Discussion with a high GC content is more likely to produce more This work introduced a modified ensemble of ncRNA spurious, alternative structures, possibly resulting in a secondary structures with the constraint of requiring higher base pairing entropy. However, since randomly only canonical base pairs to only occur and that stems shuffled sequences would also have the same GC con- must be energetically stable in all the alternative struc- tent, it becomes very difficult to determine if the entro- tures. The comparisons of performances between our pies of these sequences have been considerably affected program TRIPLE and energy model based software by the GC bias. Indeed, previous investigations [25] (NUPACK and RNAfold) implemented based on the have revealed that, while the base composition of a canonical structure ensemble have demonstrated a sig- ncRNA is related to the phylogenetic branches on which nificant improvement in the entropy measure for the specific ncRNA may be placed, it may not fully ncRNA fold certainty by our model. In particular, an explain the diverse performances of structure measures improvement of the entropy Z-scores was shown across on various ncRNAs. Notably it has been discovered that almost all 13 tested ncRNAs datasets previously used to base compositions are distinct in different parts of test various ncRNA measures [18]. rRNA secondary structure (stems, loops, bulges, and Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 6 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 junctions) [26], suggesting that an averaged base compo- model is constrained, however, to contain a smaller sition may not suitably represent the global structural space of equilibrium alternative structures, requiring behavior of an ncRNA sequence. there are only energetically stable stems (e.g., of free Technically the TRIPLE program was implemented energy levels under a threshold) to occur in the struc- with an SCFG that assumes stems to have at least three tures. The constraint is basically to consider the effect consecutive canonical base pairs. Yet, as we pointed out of energetically stable stems on tertiary folding and to earlier, the performance results should hold for a con- remove spurious structures that may not correspond to strained Boltzmann ensemble in which stems are a tertiary fold. According to the RNA folding pathway required to be energetically stable. This constraint of theory and the hierarchical folding model [27,28,30], stable stems was intended to capture the energetic stabi- building block helices are first stabilized by canonical lity of helical structures in the native tertiary fold base pairings before being arranged to interact with [27,28]. Since the ultimate distinction between a ncRNA each other or with unpaired strands through tertiary and a random sequence lies in its function (thus tertiary motifs (non-canonical nucleotide interactions). A typical structure); additional, critical tertiary characteristics may example is the multi-loop junctions in which one or be incorporated into the structure ensemble to further more pairs of coaxially stacked helices bring three or improve the fold certainty measure. In our testing of more regions together, further stabilized by the tertiary stem stability (see section “Energetically stable stems”), motifs at the junctions [31,32]. The helices involved are ncRNA sequences from the 51 datasets demonstrated stable before the junction is formed or any possible certain sequential properties that may characterize ter- nucleotide interaction modifications are made to the tiary interactions, e.g., coaxial stacking of helices. How- helical base pairs at the junction [33]. ever, to computationally model tertiary interactions, a model beyond a context-free system would be necessary; Energetically stable stems thus it would be difficult to use an SCFG or a Boltz- A stem is the atomic, structural unit of the new second- mann ensemble for this purpose. We need to develop ary structure space. To identify the energy levels of methods to identify tertiary contributions critical to the stemssuitabletobeincludedin thismodel,wecon- Shannon base pairing entropy measure and to model ducted a survey on the 51 sets of ncRNA seed align- such contributions. Although this method and technique ments, representatives of the ncRNAs in Rfam [34], have been developed with reference to non-coding which had been used with the software Infernal [35] as RNAs, it is possible that protein-coding mRNAs would benchmarks. From each ncRNA seed structural align- display similar properties, when sufficient structural ment, we computed the thermodynamic free energy of information about them has been gathered. everyinstanceofa stem in thealignmentdatausing various functions of the Vienna Package [12,29] as fol- Conclusions lows. RNAduplex was first applied to the two strands of We present work developing structure measures that the stem marked by the annotation to predict the opti- can effectively distinguish ncRNAs from random mal base pairings within the stem, then, the minimum sequences. We compute Shannon base pairing entropies free energy of the predicted stem structure, with over- based on a constrained secondary structure model that hangs removed, was computed with RNAeval. Figures 2 favors tertiary folding. Experimental results indicate that and 3 respectively show plots of the percentages and our approach significantly improves the Z-score of base cumulative percentages of free energy levels of stems in pairing Shannon entropies on 13 ncRNA datasets [18] these 51 ncRNA seed alignments. in comparison to that computed by NUPACK [23] and The peaks (with relatively high percentages) on the RNAfold [12,29]. These results shows that investigating percentage curve of Figure 2 indicate concentrations of secondary structure ensembles of ncRNAs is helpful for certain types of stems at energies levels around -4.5, developing effective structure-based ncRNA gene finding -3.3, and -2.4 kcal/mol. Since a G-U pair is counted methods. weakly towards the free energy contribution (by the Vienna package), we identified the peak value -4.5 kcal/ Method and model mol to be the free energy of stems of three base pairs, Our method to distinguish ncRNAs from random with two G-C pairs and one A-U in the middle or two sequences is based on measuring of the base pairing A-U pairs and one G-C in the middle. The value -3.3 Shannon entropy [15,16] under a new RNA secondary kcal/mol is the free energy of stems containing exactly structure model. The building blocks of this model are two G-C pairs or stems with one G-C pair followed by stems arranged in parallel and nested patterns con- two A-U pairs. Values around -2.4 kcal/mol are stems nected by unpaired strand segments, similar to those containing one G-C and an A-U pair or simply four A- U pairs. permitted by a standard ensemble [11,17,29]. The new Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 7 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 Figure 2 Percentages of free-energy of stems. Percentages of free-energy of stems from 51 Rfam datasets (percentages of stems with free- energy less than -12 are not given in this figure). 85.6% of the total while the group having distance 1 is Based on this survey, we were able to identify two energy thresholds: -3.4 and -4.6 kcal/mol for semi-stable 10.6%. Since zero distance between two stems may stems and stable stems respectively. Both require at least reflect a contiguous strand connecting two coaxially three base pairs of which at least one is G-C pair. We stacked helices in tertiary structure, our survey suggests further observed the difference between these two cate- a semi-stable stem interacts with another stem to main- gories of stems on the 51 ncRNA datasets. In general, tain even its own local stability. In the rest of this work, although levels of energy appear to be somewhat uni- we do not distinguish between stable and semi-stable formly distributed (see Figure 3), an overwhelmingly stems. In conducting this survey, we did not directly use large percentage of stems in both categories are located the stem structures annotated in the seed alignments to in the vicinity of other stems. In particular, 79.6% of compute their energies. Due to evolution, substantial stable stems(with afreeenergy-4.6kcal/molorlower) structural variation may occur across species; one stem have 0 (number of nucleotides) distance from their clo- maybe present in onesequenceand absent in another sest neighbor stem and 16.5% of stable stems have dis- but a structural alignment algorithm may try to align all tance 1 from their closest neighbors. For semi-stable sequences to the consensus stem, giving rise to “misa- stems, the group having zero distance to other stems is lignments” which we have observed [36]. Most of such Figure 3 Cumulative percentages of free-energy of stems. Cumulative percentages of free-energy of stems from 51 Rfam datasets (cumulative percentages of stems with free-energy less than -12 are not given in this figure). Note the step at -3.4. Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 8 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 “malformed stems” mistakenly aligned to the consensus Let p be the probability associated with the produc- often contain bulges or internal loops and have higher tion rule i,for i = 1, 2,...,7, respectively. Since the sum- free energies greater than the threshold -3.4 kcal/mol. mation of probabilities of rules with the same non- terminal on theleft-hand-sideisrequiredtobe1,we The RNA secondary structure model can establish the following equations: In the present study, a secondary structure model is p + p + p + p =1 ⎨ 1 2 3 4 defined with a Stochastic Context Free Grammar (SCFG) p + p =1 5 6 [37]. Our model requires there are at least three consecu- p =1 tive base pairs in every stem; the constraint is described with the following seven generic production rules: Let (1) X ® a (2) X ® aX (3) X ® aHb q = 0.25 × 0.25 × 0.17 × 0.17 × 0.08 × 0.08 bp (4) X ® aHbX (5) H ® aHb (6) H ® aYb (7) Y ® aXb be the geometric average of the six base pair probabil- ities. According to the principle of maximum entropy, where capital letters are non-terminal symbols that given we have no prior knowledge of a probability distri- define substructures and low case letters are terminals, bution, the assumption of a distribution with the maxi- each being one of the four nucleotides A, C, G, and U. mumentropy is thebestchoice, sinceitwilltakethe The starting non-terminal, X, can generate an smallest risk [40]. If we apply this principle to our pro- unpaired nucleotide or a base pair with the first three blem, the probability contribution from a base pair rules. The fourth rule generates two parallel substruc- should be close to the contribution from unpaired bases. tures. Non-terminal H is used to generate consecutive Rule probabilities can be estimated to satisfy following base pairs with non-terminal Y to generate the closing equations: base pair. Essentially, the process of generating a stem needs to recursively call production rules with the left- p = p ⎪ 1 2 hand-side non-terminals X, H and Y each at least once. p = p 3 4 3 6 This constraint guarantees that every stem has at least ⎪ (q ) × p × p × p = (0.25 × p ) bp 3 6 7 1 4 8 three consecutive base pairs, as required by our second- (q ) × p × p × p × p = (0.25 × p ) bp 3 5 6 7 1 ary structure model. From above equations, it follows that Probability parameter calculation p = 0.499 p = 0.499 p =0.001 1 2 3 There are two sets of probability parameters associated p =0.001 p =0.103 p = 0.897 4 5 6 with the induced SCFG. First, we used a simple scheme p =1 of probability settings for the unpaired bases and base pairs, with a uniform 0.25 probability for every base. The probability distribution of {0.25, 0.25, 0.17, 0.17, Computing base pairing Shannon entropy 0.08, 0.08} is given to the six canonical base pairs G-C, Based on the new RNA secondary structure model, we C-G, A-U, U-A, G-U, and U-G; a probability of zero is can compute the fold certainty of any given RNA given to all non-canonical base pairs. Alternatively, sequence, which is defined as the Shannon entropy mea- probabilities for unpaired bases and base pairs may be sured on base pairings formed by the sequence over the estimated from available RNA datasets with known sec- specified secondary structure space Ω. Specifically, let ondary structures [34], as has been done in some of the thesequencebe x = x x ... x of n nucleotides. For 1 2 n previously work with SCFGs [38,39]. indexes i<j, the probability P of base pairing between i,j Second, we computed the probabilities for the produc- bases x and x is computed with i j tion rules of the model as follows. To allow our method to be applicable to all structural ncRNAs, we did not P (x)= p(s, x)δ(x) i,j i,j (2) estimate the probabilities based on a training data set. s∈ In fact, we believe that the probability parameter setting of an SCFG for the fold certainty measure should be dif- where p(s, x) is the probability of x being folded into δ(x) ferent from that for fold stability measure (i.e., folding). to the structure s in the space Ω and is a binary i,j Based on the principle of maximum entropy, we devel- value indicator for the occurrence of base pair (x , x)in i j oped the following approach to calculate the probabil- structure s.The Shannonentropy of P (x) is computed i,j ities for the rules in our SCFG model. as [15,16] Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 9 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 Prob(S → aRbT, a = x , b = x )γ (R, S, T, i, j, x) i j Q(x)= − P (x)log P (x) i,j i,j (3) S→aRbT (4) i<j α(S ,1, n, x) To compute the expected frequency of the base pair- where ing, P (x), with formula(2),wetakeadvantage of the i,j Inside and Outside algorithms developed for SCFG [37]. γ (R, S, T, i, j, x)= α(R, i +1, j − 1, x) Given any nonterminal symbol S in the grammar, the j<k≤n inside probability is defined as × β(S, i, k, x) × α(T, j +1, k, x) α(S, i, j, x)= Prob(S⇒ x x ··· x ) i i+1 j in which variables S, R, T are for non-terminals and variable production S ® aRbT represents rules (3)~(7) i.e., the total probability for the sequence segment x x i i which involve base pair generations. For rules where T ... x to adopt alternative substructures specified by S. +1 j is empty, the summation and term a(T, j +1, k, x)do Assume S to be the initial nonterminal symbol for the not exist and k is fixed as j. SCFG model. Then a(S ,1, n, x) is the total probability The efficiency to compute P (x) mostly depends on of the sequence x’s folding under the model. i,j computing the Inside and Outside probabilities, which The outside probability is defined as can be accomplished with dynamic programming and β(S, i, j, x)= Prob(S ⇒ x ··· x Sx ··· x ) 0 1 i−1 j+1 n has the time complexity O(mn ) for a model of m non- terminals and rules and sequence length n. i.e., the total probability for the whole sequence x ... x to adopt all alternative substructures that allow the Acknowledgements sequence segment from position i to position j to adopt This research project was supported in part by NSF MRI 0821263, NIH BISTI any substructure specified by S (see Figure 4 for R01GM072080-01A1 grant, NIH ARRA Administrative Supplement to NIH illustration). BISTI R01GM072080-01A1, and NSF IIS grant of award No: 0916250. This article has been published as part of BMC Bioinformatics Volume 13 P (x) then can be computed as the normalized prob- i,j Supplement 5, 2012: Selected articles from the First IEEE International ability of the base pair (x , x ) occurring in all valid alter- i j Conference on Computational Advances in Bio and medical Sciences native secondary structures of x: (ICCABS 2011): Bioinformatics. The full contents of the supplement are Figure 4 Illustration of the application of the generic production rule. Illustration of the application of the generic production rule S ® aRbT that produces a base pair between positions i and j for the query sequence x, provided that the start non-terminal S derives x x ... x Sx 0 1 2 i-1 k ... x . Note that given i and j, the position of k can vary. +1 n Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 10 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 available online at http://www.biomedcentral.com/bmcbioinformatics/ 18. Freyhult E, Gardner P, Moulton V: A comparison of RNA folding measures. supplements/13/S5. BMC Bioinformatics 2005, 6(241). 19. Ding Y, Lawrence C: A statistical sampling algorithm for RNA secondary Author details structure prediction. Nucl Acids Res 2003, 31(24):7280-7301. Department of Computer Science, University of Georgia, Athens, Georgia 20. Walter A, Turner D, Kim J, Matthew H, Muller P, Mathews D, Zuker M: 30602, USA. Department of Plant Biology, University of Georgia, Athens, Coaxial stacking of helices enhances binding of oligoribonucleotides Georgia 30602, USA. Institute of Bioinformatics, University of Georgia, and improves predictions of RNA folding. Proc Natl Acad Sci USA 1994, Athens, Georgia 30602, USA. Center for Simulational Physics, University of 91(20):9218-9222. Georgia, Athens, Georgia 30602, USA. 21. Tyagi R, Mathews D: Predicting helical coaxial stacking in RNA multibranch loops. RNA 2007, 13(7):939-951. Authors’ contributions 22. Dirks R, Pierce N: An algorithm for computing nucleic acid base-pairing YW contributed to grammar design, algorithm development, program probabilities including pseudoknots. J Comput Chem 2004, 25:1295-1304. implementation, data acquisition, tests, result analysis, and manuscript 23. Dirks R, Bois J, Schaeffer J, Winfree E, Pierce N: Thermodynamic analysis of drafting. AM contributed to algorithm design and program implementation. interacting nucleic acid strands. SIAM Rev 2007, 49:65-88. PS and TIS contributed to data acquisition and tests. YWL participated in 24. Kolmogorov A: Sulla determinazione empirica di una legge di model discussion. RLM contributed to the supervision, data acquisition, distribuzione. G Inst Ital Attuari 1933, 4:83-91. results analyses, biological insights, and manuscript drafting. LC conceived 25. Schultes E, Hraber P, LaBean T: Estimating the contributions of selection the overall model and algorithm and drafted the manuscript. All authors and self-organization in secondary structure. Journal of Molecular read and approved the manuscript. Evolution 1999, 49:76-83. 26. Smit S, Yarus M, Knight B: Natural selection is not required to explain Competing interests universal compositional patterns in rRNA secondary structure categories. The authors declare that they have no competing interests. RNA 2006, 12:1-14. 27. Masquida B, Westhof E: A modular and hierarchical approach for all-atom Published: 12 April 2012 RNA modeling. In The RNA World.. 3 edition. Cold Spring Harbor Laboratory Press;Gesteland RF, Cech TR, Atkins JF 2006:659-681. 28. Tinoco I, Bustamante C: How RNA folds. Journal of Molecular Biology 1999, References 293(2):271-281. 1. Eddy S: Non-coding RNA genes and the modern RNA world. Nature 29. Hofacker IL: Vienna RNA secondary structure server. Nucleic Acids Research Reviews Genetics 2001, 2(12):919-929. 2003, 31(13):3429-3431. 2. Schattner P: Computational gene-finding for noncoding RNAs. In 30. Batey RT, Rambo RP, Doudna JA: Tertiary motifs in RNA structure and Noncoding RNAs: Molecular Biology and Molecular Medicine. Springer; folding. Angew Chem Int Ed Engl 1999, 38:2326-2343. Barciszewski, Erdmann 2003:33-48. 31. Lescoute A, Westhof E: Topology of three-way junctions in folded RNAs. 3. Griffiths-Jones S: Annotating noncoding RNA genes. Annu Rev Genomics RNA 2006, 12:83-93. Hum Genet 2007, 8:279-298. 32. Laing C, T S: Analysis of four-way junctions in RNA structures. J Mol Biol 4. Eddy S: Computational genomics of noncoding RNA genes. Cell 2002, 2009, 390(3):547-559. 109(2):137-140. 33. Thirumalai D: Native secondary structure formation in RNA may be a 5. Uzilov AV, Keegan JM, Mathews DH: Detection of non-coding RNAs on slave to tertiary folding. Proc Natl Acad Sci USA 1998, 95(20):11506-11508. the basis of predicted secondary structure formation free energy 34. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy S, Bateman A: Rfam: change. BMC Bioinformatics 2006, 7:173. Annotating Non-Coding RNAs in Complete Genomes. Nucleic Acids 6. Machado-Lima A, del Portillo HA, AM D: Computational methods in Research 2005, 33:D121-D141. noncoding RNA research. Journal of Mathematical Biology 2008, 56(1- 35. Nawrocki E, Kolbe D, Eddy S: Infernal 1.0: inference of RNA alignments. 2):15-49. Bioinformatics 2009, 25(10):1335-1337. 7. Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of 36. Huang Z, M M, Malmberg R, Cai L: RNAv: Non-coding RNA secondary noncoding RNAs. Proc Natl Acad Sci USA 2005, 102(7):2454-2459. structure variation search via graph Homomorphism. Proceedings of 8. Pedersen J, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander E, Computational Systems Bioinformatics; Stanford 2010, 56-68. Rogers J, Kent J, Miller W, Haussler D: Identification and classification of 37. Durbin R, Eddy S, Krogh A, Mitchison GJ: Biological Sequence Analysis: conserved RNA secondary structures in the human genome. PLoS Probabilistic Models of Proteins and Nucleic Acids Cambridge, UK: Cambridge Computational Biology 2006, 2(4):e33. University Press; 1998. 9. Turner D, Sugimoto N, Kierzek R, Dreiker S: Free energy increments for 38. Klein RJ, Eddy SR: RSEARCH: finding homologs of single structured RNA bydrogene bounds in nucleic acid base pairs. Journal of Am Chem Soc sequences. BMC Bioinformatics 2003, 4:44. 1987, 109:3783-3785. 39. Knudsen B, Hein J: Pfold: RNA secondary structure prediction using 10. Turner DH, Sugimoto N, Freier SM: RNA structure prediction. Annu Rev stochastic context-free grammars. Nucl Acids Res 2003, 31(13):3423-3428. Biophys Biophys Chem 1988, 17:167-192. 40. Jaynes ET: Prior probabilities. IEEE Transactions on Systems Science and 11. Zuker M, Steigler P: Optimal computer folding of larger RNA sequences Cybernetics 1968, 4(3):227-241. using thermodynamics and auxiliary information. Nucleic Acids Res 1981, 9:133-148. doi:10.1186/1471-2105-13-S5-S1 12. Hofacker I, Fontana W, Stadler P, Bonhoeffer L, tacker M, Schuster P: Fast Cite this article as: Wang et al.: Stable stem enabled Shannon entropies folding and comparison of RNA sequence structures. Monatsh Chem distinguish non-coding RNAs from random backgrounds. BMC 1994, 125:167-168. Bioinformatics 2012 13(Suppl 5):S1. 13. Moulton V: Tracking down noncoding RNAs. Proc Natl Acad Sci USA 2005, 102(7):2269-2270. 14. Bonnet E, Wuyts J, Rouzé P, Van de Peer Y: Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics 2004, 20(17):2911-2917. 15. Mathews D: Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA 2004, 10(8):1178-1190. 16. Huynen M, Gutell R, Konings D: Assessing the reliability of RNA folding using statistical mechanics. Journal of Molecular Biology 1997, 267:1104-1112. 17. McCaskill J: The equilibrium partition function and base pair probabilities for RNA secondary structure. Biopolymers 1990, 29(6-7):1105-1119. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Stable stem enabled Shannon entropies distinguish non-coding RNAs from random backgrounds

Loading next page...
 
/lp/springer-journals/stable-stem-enabled-shannon-entropies-distinguish-non-coding-rnas-from-pB3pZIr0X2

References (89)

Publisher
Springer Journals
Copyright
Copyright © 2012 by Wang et al.; licensee BioMed Central Ltd.
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN
1471-2105
DOI
10.1186/1471-2105-13-S5-S1
pmid
22537005
Publisher site
See Article on Publisher Site

Abstract

Background: The computational identification of RNAs in genomic sequences requires the identification of signals of RNA sequences. Shannon base pairing entropy is an indicator for RNA secondary structure fold certainty in detection of structural, non-coding RNAs (ncRNAs). Under the Boltzmann ensemble of secondary structures, the probability of a base pair is estimated from its frequency across all the alternative equilibrium structures. However, such an entropy has yet to deliver the desired performance for distinguishing ncRNAs from random sequences. Developing novel methods to improve the entropy measure performance may result in more effective ncRNA gene finding based on structure detection. Results: This paper shows that the measuring performance of base pairing entropy can be significantly improved with a constrained secondary structure ensemble in which only canonical base pairs are assumed to occur in energetically stable stems in a fold. This constraint actually reduces the space of the secondary structure and may lower the probabilities of base pairs unfavorable to the native fold. Indeed, base pairing entropies computed with this constrained model demonstrate substantially narrowed gaps of Z-scores between ncRNAs, as well as drastic increases in the Z-score for all 13 tested ncRNA sets, compared to shuffled sequences. Conclusions: These results suggest the viability of developing effective structure-based ncRNA gene finding methods by investigating secondary structure ensembles of ncRNAs. Background program is its secondary structure prediction mechan- Statistical signals in primary sequences for non-coding ism, for example, based on computing the minimum RNA (ncRNA) genes have been evasive [1-3]. Because free energy for the query sequence under some thermo- single strand RNA folds into a structure, the most dynamic energy model [9-12]. The hypothesis is that the exploitable feature for structural ncRNA gene finding ncRNAs’ secondary structure is thermodynamically has been the secondary structure [4-6]. The possibility stable. Nonetheless, stability measures have not per- formed as well as one might hope [13]; there is evidence that folded secondary structure may lead to successful ab initio ncRNA gene prediction methods has energized that the measures may not be effective on all categories leading groups to independently develop structure-based of ncRNAs [14]. ncRNA gene finding methods [7,8]. The core of such a A predicted secondary structure can be characterized for its fold certainty, using the Shannon base pairing entropy [15,16]. The entropy ∑p log p of base pair- * Correspondence: [email protected]; [email protected] i,j i,j Department of Computer Science, University of Georgia, Athens, Georgia ings between all bases i and j can be calculated based 30602, USA on the partition function for the Boltzmann secondary Full list of author information is available at the end of the article © 2012 Wang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 2 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 structure ensemble, which is the space of all alternative performance on these ncRNAs with the performance secondary structures of a given sequence; the probability achieved by the software NUPACK [23] and RNAfold p is calculated as the total of Boltzmann factors over [12,29] developed under the Boltzmann standard sec- i,j all equilibrium alternative structures that contain the ondary structure ensemble [17,22]. base pair (i, j) [17]. As an uncertainty measure, the base pairing Shannon entropy is maximized when base pair- Data preparation ing probabilities are uniformly distributed. A structural We downloaded the 13 ncRNA datasets previously RNAsequenceisassumed to havealow base pairing investigated in Table 1 of [18]. They are of diverse func- Shannon entropy, since the distribution of its base pair- tions, including pre-cursor microRNAs, group I and II ing probabilities is far from uniform. The entropy mea- introns, RNase P and MRP, bacterial and eukaryotic sig- sure has been scrutinized with real ncRNA data nal recognition particle (SRP), ribosomal RNAs, small revealing a strong correlation between entropy and free nuclear spliceosomal RNAs, riboswitches, tmRNAs, reg- energy [18,19]. However, there has been mixed success ulatory RNAs, tRNAs, telomerase RNAs, small nucleolar in discerning structural ncRNAs from their randomly RNAs, and Hammerhead ribozymes. shuffled counterparts. Both measures perform impress- The results from using these datasets were analyzed ively on precursor miRNAs but not as well on tRNAs with 6 different types of measures, including Z-score and some rRNAs [14,18]. and p-value of minimal free energy (MFE), and Shannon The diverse results of the entropy measuring on dif- base pairing entropy [18], in comparisons with random ferent ncRNAs suggest that the canonical RNA second- sequences. The six measures correlate to varying ary structure ensemble has yet to capture all ncRNAs degrees, hence using MFE Z-score and Shannon base structural characteristics. For example, a Boltzmann pairing entropy may be sufficient to cover the other ensemble enhanced with weighted equilibrium alterna- measures. However, these two measures, as the respec- tive structures has also resulted in higher accuracy in tive indicators for the fold stability and fold certainty of secondary structure prediction [19]. There is strong evi- ncRNA secondary structure, have varying performances dence that the thermodynamic energy model can on the 13 ncRNA datasets. improve its structure prediction accuracy by considering For our tests, we also generated random sequences as energy contributions in addition to those from the cano- control data. For every ncRNA sequence, we randomly nical free energy model [20,21]. Therefore, developing shuffled it to produce two sets of 100 random sequences ncRNA structure models that can more effectively each; one set was based upon single nucleotide shuffling, account for critical structural characteristics may the other was based upon di-nucleotide shuffling. In become necessary for accurate measurement of RNA addition, all ncRNA sequences containing nucleotides fold certainty. other than A, C, G, T, and U were removed for the rea- In this paper, we present work that computes Shannon son that NUPACK [23] doesn’t accept sequences con- base pairing entropies based on a constrained secondary taining wildcard symbols. structure model. The results show substantial improve- ments in the Z-score of base pairing Shannon entropies Shannon entropy distribution of random sequences on 13 ncRNA datasets [18] over the Z-score of entropies Two energy model based softwares, NUPACK (with the computed by existing software (e.g., NUPACK [23] and pseudoknot function turned off) and RNAfold, and our RNAfold [12,29]) with the canonical (Boltzmann) sec- program TRIPLE computed base pairing probabilities on ondarystructureensembleand theassociated partition ncRNA sequences and on random sequences. In parti- function [22]. Our limited constraint to the secondary cular, for every ncRNA sequence x and its associated structure space is to require only canonical base pairs to randomly shuffled sequence set S , the Shannon entro- occur in stable stems. The constrained secondary struc- pies of these sequences were computed. ture model is defined with a stochastic context-free A Kolmogorov-Smirnov test (KS test) [24] was applied grammar (SCFG) and entropies are computed with the to verify the normality of the entropy distributions from Inside and Outside algorithms. Our results suggest that all randomly shuffled sequence sets. The results show incorporating more constraints may further improve the that for99% of thesequencesetswefailtorejectthe effectiveness of the fold certainty measure, offering hypothesis that entropies are normally distributed with improved ab initio ncRNA gene finding. 95% confidence level. This indicates that we may use a Z-score to measure performance. Results We implemented the algorithm for Shannon base pair- Z-score scores and comparisons ing entropy calculation into a program named TRIPLE. For each ncRNA, the average and standard deviation of We tested it on ncRNA datasets and compared its Shannon entropies of the randomly shuffled sequences Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 3 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 were estimated. The Z-score of the Shannon entropy Q Table 1 Comparisons of TRIPLE and NUPACK by the percentages of sequences falling in each category of a Z- (x) of ncRNA sequence x is defined as follows: score range. μ(Q(S ))–Q(x) ncRNA Method Z ≥ 2.0 Z ≥ 1.5 Z ≥ 1.0 Z ≥ 0.5 Z(x)= (1) σ (Q(S )) Hh1 TRIPLE 26.67 40.00 53.33 73.33 NUPACK 0.00 0.00 20.00 53.33 where μ(Q(S )) and s(Q(S )) respectively denote the x x sno_guide TRIPLE 14.43 24.45 38.39 58.19 average and standard deviation of the Shannon entropies NUPACK 0.73 8.80 27.63 45.23 of the random sequences in set S .The Z-Scoremea- sn_splice TRIPLE 40.51 50.63 60.76 65.82 sures how well entropies may distinguish the real NUPACK 3.80 18.99 48.10 70.89 ncRNA sequence x from their corresponding randomly SRP TRIPLE 35.06 44.16 59.74 67.53 shuffled sequences in S . Figure 1 compares the averages NUPACK 3.90 36.36 72.73 85.71 of the Z-scores of Shannon base pairing entropies com- tRNA TRIPLE 29.56 51.33 70.97 86.02 puted by NUPACK, RNAfold, and TRIPLE on each of NUPACK 0.00 2.30 12.04 32.21 the 13 ncRNA datasets. It shows that TRIPLE signifi- intron TRIPLE 60.75 69.16 78.50 85.98 cantly improved the Z-scores over NUPACK and RNA- NUPACK 1.87 19.63 61.68 85.05 fold across all the 13 datasets. riboswitch TRIPLE 34.64 48.37 60.13 78.43 To examine how the Z-scores might have been NUPACK 1.96 18.95 45.75 69.28 improved by TRIPLE, we designated four thresholds for miRNA TRIPLE 81.48 88.89 94.07 97.04 Z-scores,which are2,1.5,1,and 0.5. Thepercentages NUPACK 0.00 12.59 68.15 97.78 of sequences of each dataset with Z-score greater than telomerase TRIPLE 29.41 35.29 41.18 58.82 or equal to the thresholds were computed. NUPACK 11.76 17.65 35.29 47.06 Table 1 shows details of the Z-score improvements RNase TRIPLE 50.70 70.42 81.69 92.25 over NUPACK when di-nucleotide shuffling was used. NUPACK 5.63 23.94 48.59 72.54 With a threshold 2 or 1.5, our method performed better regulatory TRIPLE 22.41 24.14 32.76 56.90 than NUPACK in all datasets. With the threshold 1 and NUPACK 1.72 3.45 18.97 51.72 0.5, our method improved upon NUPACK in 12 and 10 tmRNA TRIPLE 18.64 32.20 45.76 55.93 datasets, respectively. The results of TRIPLE and NUPACK 1.69 8.47 27.12 37.29 NUPACK using a single nucleotide random shuffling rRNA TRIPLE 36.16 50.62 70.87 83.06 are given in Table 2, which shows that our method also NUPACK 4.75 21.07 42.56 61.16 performs better than NUPACK in the majority of data- Random sequences were obtained with di-nucleotide shuffling of the real sets. In particular, TRIPLE performed better than ncRNA sequences. NUPACK in all datasets with threshold of 2; with threshold equal to 1.5 or 1, our method had better results than NUPACK in 12 datasets and in 9 datasets datasets with threshold equal to 2 and 1.5. With thresh- with threshold equal of 0.5. old of 1 and 0.5, TRIPLE wins 12 (tie 1) and 8 (tie 1) The results of RNAfold using the default setting are datasets, respectively. In Table 4, TRIPLE shows similar given in Table 3 and 4. Table 3 shows results on di- performanceonsinglenucleotideshufflingdatasets. It nucleotide shuffling datasets. TRIPLE works better in has better scores than RNAfold in 13, 13, 11, and 7 (tie the majority of datasets. It outperforms RNAfold in all 1) datasets with threshold of 2, 1.5, 1, and 0.5, Figure 1 Comparisons of averaged Z-score of Shannon base pairing entropies. Comparisons of averaged Z-score of Shannon base pairing entropies computed by NUPACK, RNAfold, and TRIPLE for each of the 13 ncRNA datasets downloaded from [18]. Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 4 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 Table 2 Comparisons of TRIPLE and NUPACK by the Table 3 Comparisons of TRIPLE and RNAfold by the percentages of sequences falling in each category of a Z- percentages of sequences falling in each category of a Z- score range. score range. ncRNA Method Z ≥2Z ≥ 1.5 Z ≥1Z ≥ 0.5 Dataset Method ≥2 (%) ≥1.5 (%) ≥1(%) ≥0.5 (%) Hh1 TRIPLE 6.67 33.33 53.33 73.33 Hh1 TRIPLE 26.67 40.00 53.33 73.33 NUPACK 0.00 0.00 20.00 60.00 RNAfold 0.00 0.00 20.00 53.33 sno_guide TRIPLE 14.91 25.43 41.10 57.95 sno_guide TRIPLE 14.43 24.45 38.39 58.19 NUPACK 0.98 9.05 28.85 45.72 RNAfold 1.71 7.82 23.96 43.03 sn_splice TRIPLE 31.65 43.04 56.96 65.82 sn_splice TRIPLE 40.51 50.63 60.76 65.82 NUPACK 5.06 26.58 51.90 69.62 RNAfold 6.33 21.52 54.43 69.62 SRP TRIPLE 32.47 45.45 55.84 68.83 SRP TRIPLE 35.06 44.16 59.74 67.53 NUPACK 3.90 37.66 72.73 87.01 RNAfold 5.19 24.68 58.44 71.43 tRNA TRIPLE 24.07 45.31 64.25 79.47 tRNA TRIPLE 29.56 51.33 70.97 86.02 NUPACK 0.00 2.12 14.69 33.45 RNAfold 0.18 4.25 24.78 47.96 intron TRIPLE 59.81 68.22 74.77 84.11 intron TRIPLE 60.75 69.16 78.50 85.98 NUPACK 1.87 22.43 66.36 85.98 RNAfold 2.80 17.76 60.75 84.11 riboswitch TRIPLE 32.03 44.44 56.86 71.90 riboswitch TRIPLE 34.64 48.37 60.13 78.43 NUPACK 1.96 21.57 46.41 69.28 RNAfold 0.65 17.65 47.06 70.59 miRNA TRIPLE 75.56 81.48 90.37 93.33 miRNA TRIPLE 81.48 88.89 94.07 97.04 NUPACK 0.00 9.63 70.37 98.52 RNAfold 0.00 7.41 65.93 97.78 telomerase TRIPLE 23.53 29.41 41.18 58.82 telomerase TRIPLE 29.41 35.29 41.18 58.82 NUPACK 5.88 29.41 29.41 52.94 RNAfold 0.00 23.53 41.18 58.82 RNase TRIPLE 38.03 56.34 72.54 87.32 RNase TRIPLE 50.70 70.42 81.69 92.25 NUPACK 10.56 26.06 52.11 76.06 RNAfold 1.41 12.68 34.51 59.15 regulatory TRIPLE 18.97 25.86 31.03 51.72 regulatory TRIPLE 22.41 24.14 32.76 56.90 NUPACK 0.00 1.72 24.14 50.00 RNAfold 0.00 6.90 27.59 63.79 tmRNA TRIPLE 15.25 27.12 38.98 57.63 tmRNA TRIPLE 18.64 32.20 45.76 55.93 NUPACK 3.39 6.78 27.12 42.37 RNAfold 1.69 10.17 33.90 50.85 rRNA TRIPLE 34.09 47.31 64.88 79.96 rRNA TRIPLE 36.16 50.62 70.87 83.06 NUPACK 6.40 21.69 43.19 60.74 RNAfold 1.45 15.70 35.33 56.82 Random sequences were obtained with single nucleotide shuffling of the real Random sequences were obtained with di-nucleotide shuffling of the real ncRNA sequences. ncRNA sequences. respectively. In addition, RNAfold was tested with the ncRNA data sets. In particular, for each RNA sequence available program options (tables not shown). With from 13 ncRNA data sets, 100 sequence segments of the option “noLP” on RNAfold, TRIPLE performs better in same length were sampled from each genome sequence 13, 13, 11 (tie 1), and 9 di-nucleotide shuffling datasets and used to test against the RNA sequence to calculate in terms of threshold of 2, 1.5, 1, and 0.5, respectively. base pairing entropies and Z-score. With such genome In single nucleotide shuffling datasets, TRIPLE wins 13, backgrounds, the overall performance of TRIPLE on the 13, 12 and 8 datasets separately with threshold of 2, 1.5, 13 ncRNA data sets is mixed and is close to that of 1, and 0.5. NUPACK and RNAfold (data not shown). This perfor- When we specify “noLP” and “noCloseGU” on RNA- mance of TRIPLE on real genomes indicates that there fold, TRIPLE beats RNAfold in 13, 13, 12, and 11 di- is still a gap between the ability of our method and suc- nucleotide shuffling datasets, and 13, 13, 13, and 11 sin- cessful ncRNA gene finding. Nevertheless, the test gle nucleotide shuffling datasets with threshold 2, 1.5, 1, results reveal that the constrained “triple base pairs” and 0.5, respectively. If we specify “noLP” and “noGU” model is necessary but still not sufficient enough. This on RNAfold, our method performs better on all di- suggests incorporating further structural constraints will nucleotide shuffling and single nucleotide shuffling data- improve the effectiveness for ncRNA search on real sets with all four thresholds. genomes. We also compared TRIPLE, NUPACK, and RNAfold To roughly evaluate the speed of the three tools, the on some real genome background tests. Several genome running time for 101 sequences, including 1 real sequences from bacteria, archaea, and eukaryotes were miRNA sequence and its 100 single nucleotide shuffled retrieved from the NCBI database. Using these genome sequences, was measured on a Linux machine with an sequences, we created genome backgrounds for the 13 Intel dual-core CPU (E7500 2.93 GHz). Each sequence Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 5 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 Table 4 Comparisons of TRIPLE and RNAfold by the We note that there is only one exceptional case percentages of sequences falling in each category of a Z- observed from Table 1, 2, 3, 4: SRP whose entropy Z- score range. score performance was not improved (as much as other Dataset Method ≥2 (%) ≥1.5 (%) ≥1 (%) ≥0.5 (%) ncRNAs) when Z<1.5. The problem might have been Hh1 TRIPLE 6.67 33.33 53.33 73.33 caused by the implementation technique rather than the RNAfold 0.00 0.00 20.00 53.33 methodology. Most of the tested SRP RNA sequences sno_guide TRIPLE 14.91 25.43 41.10 57.95 (Eukaryotic and archaeal 7S RNAs) are of length around RNAfold 1.47 7.33 24.21 44.01 300 and contain about a dozen stems. In many of them, consecutive base pairs are broken by internal loops into sn_splice TRIPLE 31.65 43.04 56.96 65.82 small stem pieces, some having only two consecutive RNAfold 6.33 24.05 53.16 68.35 canonical pairs; whereas, in our SCFG implementation SRP TRIPLE 32.47 45.45 55.84 68.83 we simply required three consecutive base pairs as a RNAfold 5.19 29.87 59.74 77.92 must in a stem, possibly missing the secondary structure tRNA TRIPLE 24.07 45.31 64.25 79.47 of many of these sequences. This issue with the SCFG RNAfold 0.00 6.19 26.19 48.85 can be easily fixed, e.g., by replacing the SCFG with one intron TRIPLE 59.81 68.22 74.77 84.11 that better represents the constrained Boltzmann RNAfold 1.87 16.82 58.88 85.98 ensemble in which stems are all energetically stable. riboswitch TRIPLE 32.03 44.44 56.86 71.90 To ensure that the performance difference between RNAfold 1.31 20.92 49.67 71.24 TRIPLE and energy model based software (NUPACK miRNA TRIPLE 75.56 81.48 90.37 93.33 and RNAfold) was not due to the difference in the ther- RNAfold 0.74 10.37 69.63 97.78 modynamic energy model (Boltzmann ensemble) and telomerase TRIPLE 23.53 29.41 41.18 58.82 the simple statistical model (SCFG) with stacking rules, RNAfold 5.88 17.65 35.29 58.82 RNase TRIPLE 38.03 56.34 72.54 87.32 we also constructed two additional SCFG models, one RNAfold 1.41 15.49 35.92 61.27 for unconstrained base pairs and another requiring at regulatory TRIPLE 18.97 25.86 31.03 51.72 least two consecutive canonical base pairs in stems. RNAfold 0.00 5.17 32.76 67.24 Tests on these two models over the 13 ncRNA data set tmRNA TRIPLE 15.25 27.12 38.98 57.63 resulted in entropy Z-scores (data not shown) compar- RNAfold 0.00 11.86 35.59 45.76 able to those obtained by NUPACK and RNAfold but rRNA TRIPLE 34.09 47.31 64.88 79.96 inferior to the performance of TRIPLE. We attribute the RNAfold 1.86 17.98 37.60 57.64 impressive performance by TRIPLE to the constraint of “triple base pairs” satisfied by real ncRNA sequences but Random sequences were obtained with single nucleotide shuffling of the real ncRNA sequences. which is hard to achieve for random sequences. Since the entropy Z-score improvement by our has 100 nucleotides. TRIPLE, NUPACK, and RNAfold method was not uniform across the 13 ncRNAs, one spent 20.7 seconds, 36.2 seconds and 3.4 seconds, may want to look into additional other factors that respectively. We point out that TRIPLE has the poten- might have contributed to the under-performance of tial to be optimized for each specific grammar to certain ncRNAs. For example, the averaged GC contents improves its efficiency. are different in these 13 datasets, with SRP RNAs having 58% GC and standard deviation of 10.4%. A sequence Discussion with a high GC content is more likely to produce more This work introduced a modified ensemble of ncRNA spurious, alternative structures, possibly resulting in a secondary structures with the constraint of requiring higher base pairing entropy. However, since randomly only canonical base pairs to only occur and that stems shuffled sequences would also have the same GC con- must be energetically stable in all the alternative struc- tent, it becomes very difficult to determine if the entro- tures. The comparisons of performances between our pies of these sequences have been considerably affected program TRIPLE and energy model based software by the GC bias. Indeed, previous investigations [25] (NUPACK and RNAfold) implemented based on the have revealed that, while the base composition of a canonical structure ensemble have demonstrated a sig- ncRNA is related to the phylogenetic branches on which nificant improvement in the entropy measure for the specific ncRNA may be placed, it may not fully ncRNA fold certainty by our model. In particular, an explain the diverse performances of structure measures improvement of the entropy Z-scores was shown across on various ncRNAs. Notably it has been discovered that almost all 13 tested ncRNAs datasets previously used to base compositions are distinct in different parts of test various ncRNA measures [18]. rRNA secondary structure (stems, loops, bulges, and Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 6 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 junctions) [26], suggesting that an averaged base compo- model is constrained, however, to contain a smaller sition may not suitably represent the global structural space of equilibrium alternative structures, requiring behavior of an ncRNA sequence. there are only energetically stable stems (e.g., of free Technically the TRIPLE program was implemented energy levels under a threshold) to occur in the struc- with an SCFG that assumes stems to have at least three tures. The constraint is basically to consider the effect consecutive canonical base pairs. Yet, as we pointed out of energetically stable stems on tertiary folding and to earlier, the performance results should hold for a con- remove spurious structures that may not correspond to strained Boltzmann ensemble in which stems are a tertiary fold. According to the RNA folding pathway required to be energetically stable. This constraint of theory and the hierarchical folding model [27,28,30], stable stems was intended to capture the energetic stabi- building block helices are first stabilized by canonical lity of helical structures in the native tertiary fold base pairings before being arranged to interact with [27,28]. Since the ultimate distinction between a ncRNA each other or with unpaired strands through tertiary and a random sequence lies in its function (thus tertiary motifs (non-canonical nucleotide interactions). A typical structure); additional, critical tertiary characteristics may example is the multi-loop junctions in which one or be incorporated into the structure ensemble to further more pairs of coaxially stacked helices bring three or improve the fold certainty measure. In our testing of more regions together, further stabilized by the tertiary stem stability (see section “Energetically stable stems”), motifs at the junctions [31,32]. The helices involved are ncRNA sequences from the 51 datasets demonstrated stable before the junction is formed or any possible certain sequential properties that may characterize ter- nucleotide interaction modifications are made to the tiary interactions, e.g., coaxial stacking of helices. How- helical base pairs at the junction [33]. ever, to computationally model tertiary interactions, a model beyond a context-free system would be necessary; Energetically stable stems thus it would be difficult to use an SCFG or a Boltz- A stem is the atomic, structural unit of the new second- mann ensemble for this purpose. We need to develop ary structure space. To identify the energy levels of methods to identify tertiary contributions critical to the stemssuitabletobeincludedin thismodel,wecon- Shannon base pairing entropy measure and to model ducted a survey on the 51 sets of ncRNA seed align- such contributions. Although this method and technique ments, representatives of the ncRNAs in Rfam [34], have been developed with reference to non-coding which had been used with the software Infernal [35] as RNAs, it is possible that protein-coding mRNAs would benchmarks. From each ncRNA seed structural align- display similar properties, when sufficient structural ment, we computed the thermodynamic free energy of information about them has been gathered. everyinstanceofa stem in thealignmentdatausing various functions of the Vienna Package [12,29] as fol- Conclusions lows. RNAduplex was first applied to the two strands of We present work developing structure measures that the stem marked by the annotation to predict the opti- can effectively distinguish ncRNAs from random mal base pairings within the stem, then, the minimum sequences. We compute Shannon base pairing entropies free energy of the predicted stem structure, with over- based on a constrained secondary structure model that hangs removed, was computed with RNAeval. Figures 2 favors tertiary folding. Experimental results indicate that and 3 respectively show plots of the percentages and our approach significantly improves the Z-score of base cumulative percentages of free energy levels of stems in pairing Shannon entropies on 13 ncRNA datasets [18] these 51 ncRNA seed alignments. in comparison to that computed by NUPACK [23] and The peaks (with relatively high percentages) on the RNAfold [12,29]. These results shows that investigating percentage curve of Figure 2 indicate concentrations of secondary structure ensembles of ncRNAs is helpful for certain types of stems at energies levels around -4.5, developing effective structure-based ncRNA gene finding -3.3, and -2.4 kcal/mol. Since a G-U pair is counted methods. weakly towards the free energy contribution (by the Vienna package), we identified the peak value -4.5 kcal/ Method and model mol to be the free energy of stems of three base pairs, Our method to distinguish ncRNAs from random with two G-C pairs and one A-U in the middle or two sequences is based on measuring of the base pairing A-U pairs and one G-C in the middle. The value -3.3 Shannon entropy [15,16] under a new RNA secondary kcal/mol is the free energy of stems containing exactly structure model. The building blocks of this model are two G-C pairs or stems with one G-C pair followed by stems arranged in parallel and nested patterns con- two A-U pairs. Values around -2.4 kcal/mol are stems nected by unpaired strand segments, similar to those containing one G-C and an A-U pair or simply four A- U pairs. permitted by a standard ensemble [11,17,29]. The new Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 7 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 Figure 2 Percentages of free-energy of stems. Percentages of free-energy of stems from 51 Rfam datasets (percentages of stems with free- energy less than -12 are not given in this figure). 85.6% of the total while the group having distance 1 is Based on this survey, we were able to identify two energy thresholds: -3.4 and -4.6 kcal/mol for semi-stable 10.6%. Since zero distance between two stems may stems and stable stems respectively. Both require at least reflect a contiguous strand connecting two coaxially three base pairs of which at least one is G-C pair. We stacked helices in tertiary structure, our survey suggests further observed the difference between these two cate- a semi-stable stem interacts with another stem to main- gories of stems on the 51 ncRNA datasets. In general, tain even its own local stability. In the rest of this work, although levels of energy appear to be somewhat uni- we do not distinguish between stable and semi-stable formly distributed (see Figure 3), an overwhelmingly stems. In conducting this survey, we did not directly use large percentage of stems in both categories are located the stem structures annotated in the seed alignments to in the vicinity of other stems. In particular, 79.6% of compute their energies. Due to evolution, substantial stable stems(with afreeenergy-4.6kcal/molorlower) structural variation may occur across species; one stem have 0 (number of nucleotides) distance from their clo- maybe present in onesequenceand absent in another sest neighbor stem and 16.5% of stable stems have dis- but a structural alignment algorithm may try to align all tance 1 from their closest neighbors. For semi-stable sequences to the consensus stem, giving rise to “misa- stems, the group having zero distance to other stems is lignments” which we have observed [36]. Most of such Figure 3 Cumulative percentages of free-energy of stems. Cumulative percentages of free-energy of stems from 51 Rfam datasets (cumulative percentages of stems with free-energy less than -12 are not given in this figure). Note the step at -3.4. Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 8 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 “malformed stems” mistakenly aligned to the consensus Let p be the probability associated with the produc- often contain bulges or internal loops and have higher tion rule i,for i = 1, 2,...,7, respectively. Since the sum- free energies greater than the threshold -3.4 kcal/mol. mation of probabilities of rules with the same non- terminal on theleft-hand-sideisrequiredtobe1,we The RNA secondary structure model can establish the following equations: In the present study, a secondary structure model is p + p + p + p =1 ⎨ 1 2 3 4 defined with a Stochastic Context Free Grammar (SCFG) p + p =1 5 6 [37]. Our model requires there are at least three consecu- p =1 tive base pairs in every stem; the constraint is described with the following seven generic production rules: Let (1) X ® a (2) X ® aX (3) X ® aHb q = 0.25 × 0.25 × 0.17 × 0.17 × 0.08 × 0.08 bp (4) X ® aHbX (5) H ® aHb (6) H ® aYb (7) Y ® aXb be the geometric average of the six base pair probabil- ities. According to the principle of maximum entropy, where capital letters are non-terminal symbols that given we have no prior knowledge of a probability distri- define substructures and low case letters are terminals, bution, the assumption of a distribution with the maxi- each being one of the four nucleotides A, C, G, and U. mumentropy is thebestchoice, sinceitwilltakethe The starting non-terminal, X, can generate an smallest risk [40]. If we apply this principle to our pro- unpaired nucleotide or a base pair with the first three blem, the probability contribution from a base pair rules. The fourth rule generates two parallel substruc- should be close to the contribution from unpaired bases. tures. Non-terminal H is used to generate consecutive Rule probabilities can be estimated to satisfy following base pairs with non-terminal Y to generate the closing equations: base pair. Essentially, the process of generating a stem needs to recursively call production rules with the left- p = p ⎪ 1 2 hand-side non-terminals X, H and Y each at least once. p = p 3 4 3 6 This constraint guarantees that every stem has at least ⎪ (q ) × p × p × p = (0.25 × p ) bp 3 6 7 1 4 8 three consecutive base pairs, as required by our second- (q ) × p × p × p × p = (0.25 × p ) bp 3 5 6 7 1 ary structure model. From above equations, it follows that Probability parameter calculation p = 0.499 p = 0.499 p =0.001 1 2 3 There are two sets of probability parameters associated p =0.001 p =0.103 p = 0.897 4 5 6 with the induced SCFG. First, we used a simple scheme p =1 of probability settings for the unpaired bases and base pairs, with a uniform 0.25 probability for every base. The probability distribution of {0.25, 0.25, 0.17, 0.17, Computing base pairing Shannon entropy 0.08, 0.08} is given to the six canonical base pairs G-C, Based on the new RNA secondary structure model, we C-G, A-U, U-A, G-U, and U-G; a probability of zero is can compute the fold certainty of any given RNA given to all non-canonical base pairs. Alternatively, sequence, which is defined as the Shannon entropy mea- probabilities for unpaired bases and base pairs may be sured on base pairings formed by the sequence over the estimated from available RNA datasets with known sec- specified secondary structure space Ω. Specifically, let ondary structures [34], as has been done in some of the thesequencebe x = x x ... x of n nucleotides. For 1 2 n previously work with SCFGs [38,39]. indexes i<j, the probability P of base pairing between i,j Second, we computed the probabilities for the produc- bases x and x is computed with i j tion rules of the model as follows. To allow our method to be applicable to all structural ncRNAs, we did not P (x)= p(s, x)δ(x) i,j i,j (2) estimate the probabilities based on a training data set. s∈ In fact, we believe that the probability parameter setting of an SCFG for the fold certainty measure should be dif- where p(s, x) is the probability of x being folded into δ(x) ferent from that for fold stability measure (i.e., folding). to the structure s in the space Ω and is a binary i,j Based on the principle of maximum entropy, we devel- value indicator for the occurrence of base pair (x , x)in i j oped the following approach to calculate the probabil- structure s.The Shannonentropy of P (x) is computed i,j ities for the rules in our SCFG model. as [15,16] Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 9 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 Prob(S → aRbT, a = x , b = x )γ (R, S, T, i, j, x) i j Q(x)= − P (x)log P (x) i,j i,j (3) S→aRbT (4) i<j α(S ,1, n, x) To compute the expected frequency of the base pair- where ing, P (x), with formula(2),wetakeadvantage of the i,j Inside and Outside algorithms developed for SCFG [37]. γ (R, S, T, i, j, x)= α(R, i +1, j − 1, x) Given any nonterminal symbol S in the grammar, the j<k≤n inside probability is defined as × β(S, i, k, x) × α(T, j +1, k, x) α(S, i, j, x)= Prob(S⇒ x x ··· x ) i i+1 j in which variables S, R, T are for non-terminals and variable production S ® aRbT represents rules (3)~(7) i.e., the total probability for the sequence segment x x i i which involve base pair generations. For rules where T ... x to adopt alternative substructures specified by S. +1 j is empty, the summation and term a(T, j +1, k, x)do Assume S to be the initial nonterminal symbol for the not exist and k is fixed as j. SCFG model. Then a(S ,1, n, x) is the total probability The efficiency to compute P (x) mostly depends on of the sequence x’s folding under the model. i,j computing the Inside and Outside probabilities, which The outside probability is defined as can be accomplished with dynamic programming and β(S, i, j, x)= Prob(S ⇒ x ··· x Sx ··· x ) 0 1 i−1 j+1 n has the time complexity O(mn ) for a model of m non- terminals and rules and sequence length n. i.e., the total probability for the whole sequence x ... x to adopt all alternative substructures that allow the Acknowledgements sequence segment from position i to position j to adopt This research project was supported in part by NSF MRI 0821263, NIH BISTI any substructure specified by S (see Figure 4 for R01GM072080-01A1 grant, NIH ARRA Administrative Supplement to NIH illustration). BISTI R01GM072080-01A1, and NSF IIS grant of award No: 0916250. This article has been published as part of BMC Bioinformatics Volume 13 P (x) then can be computed as the normalized prob- i,j Supplement 5, 2012: Selected articles from the First IEEE International ability of the base pair (x , x ) occurring in all valid alter- i j Conference on Computational Advances in Bio and medical Sciences native secondary structures of x: (ICCABS 2011): Bioinformatics. The full contents of the supplement are Figure 4 Illustration of the application of the generic production rule. Illustration of the application of the generic production rule S ® aRbT that produces a base pair between positions i and j for the query sequence x, provided that the start non-terminal S derives x x ... x Sx 0 1 2 i-1 k ... x . Note that given i and j, the position of k can vary. +1 n Wang et al. BMC Bioinformatics 2012, 13(Suppl 5):S1 Page 10 of 10 http://www.biomedcentral.com/1471-2105/13/S5/S1 available online at http://www.biomedcentral.com/bmcbioinformatics/ 18. Freyhult E, Gardner P, Moulton V: A comparison of RNA folding measures. supplements/13/S5. BMC Bioinformatics 2005, 6(241). 19. Ding Y, Lawrence C: A statistical sampling algorithm for RNA secondary Author details structure prediction. Nucl Acids Res 2003, 31(24):7280-7301. Department of Computer Science, University of Georgia, Athens, Georgia 20. Walter A, Turner D, Kim J, Matthew H, Muller P, Mathews D, Zuker M: 30602, USA. Department of Plant Biology, University of Georgia, Athens, Coaxial stacking of helices enhances binding of oligoribonucleotides Georgia 30602, USA. Institute of Bioinformatics, University of Georgia, and improves predictions of RNA folding. Proc Natl Acad Sci USA 1994, Athens, Georgia 30602, USA. Center for Simulational Physics, University of 91(20):9218-9222. Georgia, Athens, Georgia 30602, USA. 21. Tyagi R, Mathews D: Predicting helical coaxial stacking in RNA multibranch loops. RNA 2007, 13(7):939-951. Authors’ contributions 22. Dirks R, Pierce N: An algorithm for computing nucleic acid base-pairing YW contributed to grammar design, algorithm development, program probabilities including pseudoknots. J Comput Chem 2004, 25:1295-1304. implementation, data acquisition, tests, result analysis, and manuscript 23. Dirks R, Bois J, Schaeffer J, Winfree E, Pierce N: Thermodynamic analysis of drafting. AM contributed to algorithm design and program implementation. interacting nucleic acid strands. SIAM Rev 2007, 49:65-88. PS and TIS contributed to data acquisition and tests. YWL participated in 24. Kolmogorov A: Sulla determinazione empirica di una legge di model discussion. RLM contributed to the supervision, data acquisition, distribuzione. G Inst Ital Attuari 1933, 4:83-91. results analyses, biological insights, and manuscript drafting. LC conceived 25. Schultes E, Hraber P, LaBean T: Estimating the contributions of selection the overall model and algorithm and drafted the manuscript. All authors and self-organization in secondary structure. Journal of Molecular read and approved the manuscript. Evolution 1999, 49:76-83. 26. Smit S, Yarus M, Knight B: Natural selection is not required to explain Competing interests universal compositional patterns in rRNA secondary structure categories. The authors declare that they have no competing interests. RNA 2006, 12:1-14. 27. Masquida B, Westhof E: A modular and hierarchical approach for all-atom Published: 12 April 2012 RNA modeling. In The RNA World.. 3 edition. Cold Spring Harbor Laboratory Press;Gesteland RF, Cech TR, Atkins JF 2006:659-681. 28. Tinoco I, Bustamante C: How RNA folds. Journal of Molecular Biology 1999, References 293(2):271-281. 1. Eddy S: Non-coding RNA genes and the modern RNA world. Nature 29. Hofacker IL: Vienna RNA secondary structure server. Nucleic Acids Research Reviews Genetics 2001, 2(12):919-929. 2003, 31(13):3429-3431. 2. Schattner P: Computational gene-finding for noncoding RNAs. In 30. Batey RT, Rambo RP, Doudna JA: Tertiary motifs in RNA structure and Noncoding RNAs: Molecular Biology and Molecular Medicine. Springer; folding. Angew Chem Int Ed Engl 1999, 38:2326-2343. Barciszewski, Erdmann 2003:33-48. 31. Lescoute A, Westhof E: Topology of three-way junctions in folded RNAs. 3. Griffiths-Jones S: Annotating noncoding RNA genes. Annu Rev Genomics RNA 2006, 12:83-93. Hum Genet 2007, 8:279-298. 32. Laing C, T S: Analysis of four-way junctions in RNA structures. J Mol Biol 4. Eddy S: Computational genomics of noncoding RNA genes. Cell 2002, 2009, 390(3):547-559. 109(2):137-140. 33. Thirumalai D: Native secondary structure formation in RNA may be a 5. Uzilov AV, Keegan JM, Mathews DH: Detection of non-coding RNAs on slave to tertiary folding. Proc Natl Acad Sci USA 1998, 95(20):11506-11508. the basis of predicted secondary structure formation free energy 34. Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy S, Bateman A: Rfam: change. BMC Bioinformatics 2006, 7:173. Annotating Non-Coding RNAs in Complete Genomes. Nucleic Acids 6. Machado-Lima A, del Portillo HA, AM D: Computational methods in Research 2005, 33:D121-D141. noncoding RNA research. Journal of Mathematical Biology 2008, 56(1- 35. Nawrocki E, Kolbe D, Eddy S: Infernal 1.0: inference of RNA alignments. 2):15-49. Bioinformatics 2009, 25(10):1335-1337. 7. Washietl S, Hofacker IL, Stadler PF: Fast and reliable prediction of 36. Huang Z, M M, Malmberg R, Cai L: RNAv: Non-coding RNA secondary noncoding RNAs. Proc Natl Acad Sci USA 2005, 102(7):2454-2459. structure variation search via graph Homomorphism. Proceedings of 8. Pedersen J, Bejerano G, Siepel A, Rosenbloom K, Lindblad-Toh K, Lander E, Computational Systems Bioinformatics; Stanford 2010, 56-68. Rogers J, Kent J, Miller W, Haussler D: Identification and classification of 37. Durbin R, Eddy S, Krogh A, Mitchison GJ: Biological Sequence Analysis: conserved RNA secondary structures in the human genome. PLoS Probabilistic Models of Proteins and Nucleic Acids Cambridge, UK: Cambridge Computational Biology 2006, 2(4):e33. University Press; 1998. 9. Turner D, Sugimoto N, Kierzek R, Dreiker S: Free energy increments for 38. Klein RJ, Eddy SR: RSEARCH: finding homologs of single structured RNA bydrogene bounds in nucleic acid base pairs. Journal of Am Chem Soc sequences. BMC Bioinformatics 2003, 4:44. 1987, 109:3783-3785. 39. Knudsen B, Hein J: Pfold: RNA secondary structure prediction using 10. Turner DH, Sugimoto N, Freier SM: RNA structure prediction. Annu Rev stochastic context-free grammars. Nucl Acids Res 2003, 31(13):3423-3428. Biophys Biophys Chem 1988, 17:167-192. 40. Jaynes ET: Prior probabilities. IEEE Transactions on Systems Science and 11. Zuker M, Steigler P: Optimal computer folding of larger RNA sequences Cybernetics 1968, 4(3):227-241. using thermodynamics and auxiliary information. Nucleic Acids Res 1981, 9:133-148. doi:10.1186/1471-2105-13-S5-S1 12. Hofacker I, Fontana W, Stadler P, Bonhoeffer L, tacker M, Schuster P: Fast Cite this article as: Wang et al.: Stable stem enabled Shannon entropies folding and comparison of RNA sequence structures. Monatsh Chem distinguish non-coding RNAs from random backgrounds. BMC 1994, 125:167-168. Bioinformatics 2012 13(Suppl 5):S1. 13. Moulton V: Tracking down noncoding RNAs. Proc Natl Acad Sci USA 2005, 102(7):2269-2270. 14. Bonnet E, Wuyts J, Rouzé P, Van de Peer Y: Evidence that microRNA precursors, unlike other non-coding RNAs, have lower folding free energies than random sequences. Bioinformatics 2004, 20(17):2911-2917. 15. Mathews D: Using an RNA secondary structure partition function to determine confidence in base pairs predicted by free energy minimization. RNA 2004, 10(8):1178-1190. 16. Huynen M, Gutell R, Konings D: Assessing the reliability of RNA folding using statistical mechanics. Journal of Molecular Biology 1997, 267:1104-1112. 17. McCaskill J: The equilibrium partition function and base pair probabilities for RNA secondary structure. Biopolymers 1990, 29(6-7):1105-1119.

Journal

BMC BioinformaticsSpringer Journals

Published: Apr 12, 2012

There are no references for this article.