Maximum expected accuracy structural neighbors of an RNA secondary structure

Peter Clote; Feng Lou; William Lorenz

doi:10.1186/1471-2105-13-S5-S6

Maximum expected accuracy structural neighbors of an RNA secondary structure

Clote, Peter; Lou, Feng; Lorenz, William 2012-04-12 00:00:00 Background: Since RNA molecules regulate genes and control alternative splicing by allostery, it is important to develop algorithms to predict RNA conformational switches. Some tools, such as paRNAss, RNAshapes and RNAbor, can be used to predict potential conformational switches; nevertheless, no existent tool can detect general (i.e., not family specific) entire riboswitches (both aptamer and expression platform) with accuracy. Thus, the development of additional algorithms to detect conformational switches seems important, especially since the difference in free energy between the two metastable secondary structures may be as large as 15-20 kcal/mol. It has recently emerged that RNA secondary structure can be more accurately predicted by computing the maximum expected accuracy (MEA) structure, rather than the minimum free energy (MFE) structure. Results: Given an arbitrary RNA secondary structure S for an RNA nucleotide sequence a = a ,..., a , we say that 0 1 n another secondary structure S of a is a k-neighbor of S , if the base pair distance between S and S is k. In this 0 0 paper, we prove that the Boltzmann probability of all k-neighbors of the minimum free energy structure S can be approximated with accuracy ε and confidence 1 - p, simultaneously for all 0 ≤ k<K, by a relative frequency count −1 over N sampled structures, provided that 2K , where F(z) is the cumulative distribution N > N(ε, p, K)= 4ε function (CDF) for the standard normal distribution. We go on to describe the algorithm RNAborMEA, which for an arbitrary initial structure S and for all values 0 ≤ k<K, computes the secondary structure MEA(k), having 3 2 maximum expected accuracy over all k-neighbors of S . Computation time is O(n · K ), and memory requirements are O(n · K). We analyze a sample TPP riboswitch, and apply our algorithm to the class of purine riboswitches. Conclusions: The approximation of RNAbor by sampling, with rigorous bound on accuracy, together with the computation of maximum expected accuracy k-neighbors by RNAborMEA, provide additional tools toward conformational switch detection. Results from RNAborMEA are quite distinct from other tools, such as RNAbor, RNAshapes and paRNAss, hence may provide orthogonal information when looking for suboptimal structures or conformational switches. Source code for RNAborMEA can be downloaded from http://sourceforge.net/ projects/rnabormea/ or http://bioinformatics.bc.edu/clotelab/RNAborMEA/. * Correspondence: [email protected]; [email protected] Department of Biology, Boston College, Chestnut Hill, MA 02467, USA Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud XI, 91405 Orsay cedex, France Full list of author information is available at the end of the article © 2012 Clote et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 2 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 In previous work [20,21], we described a novel pro- Background gram RNAbor to predict RNA conformational switches. RNA secondary structure conformational switches play For a given secondary structure S of a given RNA an essential role in a number of biological processes, sequence s, the secondary structure T of s is said to be such as regulation of viral replication [1] and of viroid a k-neighbor of S, if the base pair distance between S replication [2], regulation of R1 plasmid copy number in and T is k. (Base pair distance is the minimum number E. coli by hok/sok system [3], transcriptional and transla- of base pairs that must be either added or removed, in tional gene regulation in prokaryotes by riboswitches order to transform the structure S into T.) Given an [4], regulation of alternative splicing in eukaryotes [5], arbitrary initial structure S and stress-responsive gene regulation in humans [6], etc. , for all values 0 ≤ k<K,the program RNAbor[20], computes the secondary structure Due to the biological importance of conformational MFE(k), having minimum free energy over all k-neigh- switches, several groups have developed algorithms that bors of S . (Note that K ≤ 2· n,sincethebasepairdis- attempt to recognize switches - in particular, thermody- tance between any two secondary structures of a length namics-based methods such as paRNAss[7], RNA- n RNAsequenceisatmost2· n.) As well, RNAbor shapes[8], RNAbor[9], as well as an approach using computes for each value 0 ≤ k ≤ K, the Boltzmann prob- the second eigenvalue of the Laplacian matrix [10]. Z(k) ability ,where Z(k)isthe sumofall Boltzmann Riboswitches are portions of the 5’ untranslated region p = factors exp(-E(S)/RT) of all structures S having base pair (UTR) of messenger RNAs, experimentally known to distance k from an initially given structure S ,and regulate genes in bacteria by allostery [4], and to regu- where the partition function Z is the sum of all Boltz- late alternative splicing of the gene NMT1 in the eukar- mann factors of all secondary structures of the given yote Neurospora crassa [5]. Riboswitches are composed RNA sequence. Here E(S) is the free energy of second- of a 5’ aptamer and a 3’ expression platform.Since the ary structure S, with respect to the Turner energy aptamer binds to a specific ligand with high affinity (K -1 -1 model [22,23], R = 0.001987 kcal mol K is the uni- ≈ 5 nM), thus triggering the conformational change of versal gas constant, and T is absolute temperature. In the expression platform upon ligand binding [11], its thecasethat S is the minimum free energy structure, sequence and secondary structure tend to be highly con- 0 the existence of one or more ‘peaks’,orvalues k ≫ 0, served. In contrast, there is lower sequence for the where p is relatively large, suggests that there are two expression platform, which forms a bistable switch, k or more low energy structures having large base pair effecting gene regulation by premature abortion of tran- distance k from S - i.e., a potentially distinct meta- scription (as in guanine riboswitches [12]), or by seques- 0 stable structure, as shown in Figure 1. tering the Shine-Dalgarno sequence (as in thiamine pyrophosphate riboswitches [13]). Due to the conserved In [24], Do et al. introduced the notion of maximum expected accuracy (MEA) secondary structure, determined sequence and secondary structure within the aptamer, as follows: (i) compute base pairing probabilities p(i, j) all existent algorithms (to the best of our knowledge), using a trained stochastic context free grammar; (ii) com- such as [14-16], attempt only to detect riboswitch apta- q(i)= 1 − p(i, j) − p(j, i) pute probabilities mers, without the expression platform. In addition to i<j j<i that position i does not pair; (iii) using a dynamic pro- these specific algorithmic approaches, more general gramming algorithm similar to that of Nussinov and computational tools that rely on stochastic context free Jacobson [25], determine that secondary structure S having grammars,suchas Infernal[17] and CMFinder[18], 2α · p(i, j)+ βq maximum score i , where have been trained to recognize riboswitch aptamers; in (i,j)∈S iunpaired the first sum is over paired positions (i, j)of S and the sec- particular, Infernal was used to create the Rfam data- ond sum is over positions i located in loop regions of S, base [19], which includes 14 families of riboswitch and where a, b >0 are parameters with default values 1. aptamers. Subsequently Kiryu et al. [26] computed the MEA struc- Since current riboswitch detection algorithms do not ture by replacing the stochastic context free grammar attempt to predict the location of the expression plat- computation of base pairs in (i) by using McCaskill’s algo- form, we have developed tools, RNAbor-Sample and rithm [27], which computes the Boltzmann base pairing RNAborMEA, described in this paper, which yield infor- probabilities mation concerning alternative, or suboptimal, structures of a given RNA sequence. These tools can suggest the exp(−E(S)/RT) {S:(i,j)∈S} presence of a conformational switch; however, much p(i, j)= (1) exp(−E(S)/RT) more work must be done to actually produce a ribos- witch gene finder, part of thedifficultydueto thefact The sum in the numerator is taken over all secondary that riboswitch aptamers contain pseudoknots that can- structures of the given RNA sequence, that contain base not be captured by secondary structure. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 3 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 1 Output of RNAboron the 27 nt bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial structure S , the minimum free energy (MFE) structure....... ((((((((....)))))))) with free energy -10.3 kcal/mol. The 16-neighbor of S is the 0 0 metastable structure ((((((((....))))))))....... with free energy -9.9 kcal/mol. The MFE structure appears above the leftmost peak, while the MFE(16) structure appears above the rightmost peak. The output of RNAbor includes a graph of the Boltzmann probabilities p = , and MFE(k) structures, for all 0 ≤ k ≤ 2n. The existence of distinct ‘peaks’ suggests the presence of a conformational switch. pair (i, j), while the sum in the denominator is taken structure, in light of previously mentioned results over all secondary structures of the given RNA [26,28]. Surprisingly, it turns out that MEA(k) structures sequence. Thus p(i, j) is the sum of the Boltzmann fac- are quite different from MFE(k) structures, as shown tors of all secondary structures that contain the fixed later in one of the figures. base pair (i, j), divided by the partition function, which In this paper, we begin by showing rigorously how to latter is the sum of Boltzmann factors of all secondary approximate the output of RNAbor by frequency counts structures. In fact, Kiryu et al. [26] describe an algo- from sampling, using Sfold[30]. We then extend the rithm to compute the MEA structure common to all MEA technique to compute the maximum expected RNAs in a given alignment. Later, Lu et al. [28] redis- accuracy k-neighbor of a given RNA secondary structure covered Kiryu’s method; in addition, Lu et al. computed S ; i.e., that secondary structure which has maximum suboptimal MEA structures by implementing an analo- expected accuracy over all structures that differ from S gue of Zuker’s method [29]. by exactly k base pairs. By analyzing the family of purine Our motivation in developing both RNAbor-Sample riboswitches, obtained by retrieving full riboswitch and RNAborMEA, was to simplify and improve our pre- sequences (aptamer and expression platform) from cor- vious software, RNAbor, in detecting conformational responding EMBL genomic data, by extending the apta- switches. Since RNAbor computes the minimum free mers from the seed alignment of Rfam family RF00167 energy structure, MFE(k), over all structures having base [31], we show that our software RNAborMEA produces pair distance k from an initially given structure S ,a strikingly different results from other software that pro- complex portion of the code in RNAbor concerns the duce suboptimal structures (RNAbor, RNAbor-Sam- retrieval of free energy parameters from the Turner ple, RNAlocopt, RNAshapes, UNAFold). model [22,23]. The idea of RNAborMEA was to compute Since the detection of computational switches remains the base pairing probabilities p(i, j)-seeequation(1) - an open problem, despite the success of some tools by McCaskill’s algorithm using RNAfold, then to com- such as RNAshapes and RNAbor, we feel the addition pute the maximum expected accuracy structure, MEA of the tool RNAborMEA could prove useful, since it (k), which needs no retrieval of energy parameters, and appears to be orthogonal to all other methods of gener- which we hoped would be very similar to the MFE(k) ating suboptimal secondary structures. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 4 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 RNAbor-Sample and RNAborMEA.These tables and Results and discussion figures are only briefly described, and we refer the In this paper, we describe the following new results, dis- reader to the captions of the figures and tables, which cussed in the ‘Methods’ section in greater detail with explain the results in greater detail. attendant definitions of unexplained concepts. Figure 1 illustrates the presence of two peaks, corre- sponding to the Boltzmann probability of each of the 1. WedescribeaPythonscript RNAbor-Sample metastable structures for a 27 nt bistable switch pre- that approximates the output p = of RNAbor by ˆ viously considered by Hofacker et al. Figure 2 displays frequency counts p from sampled structures, for all the Boltzmann probabilities p 0 ≤ k ≤ 2n,using Sfold[30], or RNAsubopt -p from RNAbor,Boltz- mann probabilities estimates ˆ from RNAbor-Sample [32]. p for the SAM riboswitch aptamer with GenBank acces- 2. We prove that for any desired accuracy 0 < ε and sion code AP004597.1/11894-11904. Clearly, probability probability 0 < a <1, if at least estimates ˆ are close to actual values p .The figure k k −1 additionally shows probabilities r from our software 2K (2) Z (LO) N(ε, p, K)= k 2 RNAlocopt[34], computed by r = ,where Z(LO) 4ε Z(LO) is the sum of Boltzmann factors of all locally optimal secondary structures, and Z (LO)isthe sumofall locally optimal k-neighbors of S . A secondary structure structures are sampled, then S is said to be locally optimal, if its energy does not decrease by the addition or removal of a single (valid) P(|p − p | <ε) > 1 − α (3) k k base pair; i.e., E(S ∪ {(x, y)}) ≥ E(S), and E(S -{(x, y)}) ≥ E(S). Figure 3 displays the experimentally determined GENE ON and GENE OFF structures of an XPT gua- nine riboswitch from B. subtilis, taken from [35]. Figure for all 0 ≤ k<K; i.e., RNAbor-Sample furnishes 4 shows the outputs of RNAborMEA, RNAbor,and estimates p of p , for all 0 ≤ k<K, which with con- k k RNAshapes, which are most similar to the GENE ON fidence 1 - a are within ε of the actual values p . structure from the previous Figure 3. Figures 5 and 6 Here, F(z) is the cumulative distribution function determine the structural simlarity, as measured by the (CDF) for the standard normal distribution. program NestedAlign[36], between that structure 3. We develop an algorithm, RNAborMEA,running 3 2 2 output by RNAborMEA (as well as structures output by in time O(n · K )and space O(n · K), which com- RNAbor, RNAbor-Sample, RNAlocopt, RNAshapes, putes simultaneously for all 0 ≤ k ≤ K, the maximum and UNAFold), which are most similar to the XPT pur- expected accuracy k-neighbors of a given RNA sec- ine riboswitch, displayed in Figure 3. Figure 5 deter- ondary structure S ; i.e., for each 0 ≤ k ≤ K, RNA- mines the structural similarity to the GENE ON borMEA computes that structure S which has structure (left panel of Figure 3), while Figure 6 deter- maximum expected accuracy over all structures that mines the structural similarity to the GENE OFF struc- differ from S by exactly k base pairs. The algorithm ture (right panel of Figure 3). None of the structural RNAborMEA additionally computes, for each 0 ≤ k ≤ neighbors, or sampled structures, are identical to the K, the pseudo partition function GENE ON or GENE OFF structures; however, there are some candidates that bear some resemblance to those Z = exp(MEA(S)/RT). structures. At this point, we can say that RNAbor-Sam- {S:d (S,S )=k} BP 0 ple and RNAborMEA are methods that generate subop- timal structures, some of which may be similar to the metastable structures of a conformational switch; how- Moreover, RNAborMEA allows the user to stipulate ever, much additional work is necessary before a robust (partial) hard constraints, that stipulate whether par- method can be developed to detect conformational ticular nucleotides are unpaired, or base-pair with switches. certain other nucleotides. The implementation of Figure 7 shows that the MEA(k) structural neighbors, hard constraints follows ideas from Mathews [33], as computed by RNAborMEA, are very different than the albeit suitably modified to simultaneously consider MFE(k) structural neighbors, as computed by RNAbor. all k-neighbors, for 0 ≤ k ≤ K. At present, such computational experiments show RNA- borMEA computes suboptimal structures, which seem We now describe the 13 figures and 4 tables, corre- to share (chimeric) similarities between parts of low energy structures, but which themselves do not have sponding to computational experiments performed with Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 5 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 2 Boltzmann density plot for RNAbor, along with approximating relative frequency plots for RNAborMEAandRNAlocoptfor the 101 nt RNA sequence UACUUAUCAA GAGAGGUGGA GGGACUGGCC CGCUGAAACC UCAGCAACAG AACGCAUCUG UCUGUGCUAA AUCCUGCAAG CAAUAGCUUG AAAGAUAAGU U for the SAM riboswitch aptamter with GenBank accession code AP004597.1/118941- 119041. The program RNAbor computes the Boltzmann probability , where Z = exp(−E(S)/RT), where S is p = k {S:d (S,S )=k} 0 k BP 0 the initial structure (taken as the minimum free energy here). The script RNAbor-Sample calls Sfold on 1000 structures, in order to compute a relative frequence f ≈ p of all k-neighbors of S . Finally, we compute relative frequency of RNAlocopt[34], a program that samples only k k 0 locally optimal secondary structures, having the property that one cannot obtain a lower energy structure by adding or removing a single base pair. very low energies. Such suboptimal structures appear to algorithm Sfold[30]. Figure 13 displays the pseudo- be ‘orthogonal’ to those output by all other methods, Boltzmann probabilities p˜ = for two small RNA such as Sfold, RNAbor, RNAbor-Sample, RNAlo- sequences. While temperature T has a natural signifi- copt, RNAshapes, UNAFold). Figure 8 displays the cance, when computing Boltzmann probabilities p = , output of RNAborMEA, given the sequence of a TPP there is no natural meaning of temperature T,when riboswitch with EMBL accession code AF269819/1811- computing pseudo Boltzmann factors exp(MEA(S)/RT), 1669. In this instance, RNAborMEA found two low and indeed very different curves can be obtained with energy structures having large base pair distance from different temperatures. each other. (Other computational experiments did not We now briefly describe Tables 1, 2, 3, 4. Table 1 pro- yield such a good example.) Figure 9 displays the free vides some sample sizes N, computed by the formula energy and maximum expected accuracy scores, for from equation (2), for an ε approximation of Boltzmann each of the k-neighbors of the given TPP riboswitch probabilities p ,0 ≤ k<K,with1- a confidence level. sequence, just described in Figure 8. Figures 10 and 11 Tables 2 and 3 provide the numerical values for the ear- present the pseudocode for the RNAborMEA algorithm, lier described Figures 5 and 6, where the NestedA- which given an RNA sequence a , .. .,a and initial lign structural similarity is computed for the most 1 n structure S , computes the MEA(k) structure and pseudo similar k-neighbor, determined by RNAborMEA, RNA- partition function , for each 0 ≤ k ≤ K in time O(n · bor-Sample and RNAlocopt.Table 4presentsthe 2 2 K ) and space O(n · K). Figure 12 presents pseudocode number of times that each of the methods RNAborMEA, for the O(n ) algorithm to sample structures from the RNAbor, RNAbor-Sample, RNAlocopt, RNA- ensemble of structures having high MEA scores - a shapes, UNAFold output the most similar structure maximum expected accuracy analogue of the sampling to the GENE ON resp. GENE OFF structure for the Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 6 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 3 GENE ON (left) and GENE OFF (right) secondary structures for the 148 nt. XPT guanine riboswitch from B. subtilis with sequence CACUCAUAUA AUCGCGUGGA UAUGGCACGC AAGUUUCUAC CGGGCACCGU AAAUGUCCGA CUAUGGGUGA GCAAUGGAAC CGCACGUGUA CGGUUUUUUG UGAUAUCAGC AUUGCUUGCU CUUUAUUUGA GCGGGCAAUG CUUUUUUU. Sequence and secondary structure taken from [35]. The free energy of GENE ON resp. GENE OFF secondary structrure is -16.46 kcal/mol resp. -22.6 kcal/mol. Free energies were determined using RNAeval and figures produced using RNAplot, both from the Vienna RNA Package [40]. XPT purine riboswitch described in Figure 3. This com- low energy structures, but rather for maximum expected accuracy structures. putational experiment was performed for all RNA sequences in the seed alignment of the Rfam purine The figures and tables show, in summary, that RNA- riboswitch family RF00167 [31]. This table shows that borMEA provides useful suboptimal structures, which RNAborMEA and RNAbor both outperform any other may be closer to metastable structures of a conforma- method in determining structures similar to the GENE tional switch than more traditional methods, which rely OFF XPT structure; however, RNAborMEA uniquely on searching for low energy structures. outperforms all methods, including RNAbor, in deter- mining structures similar to the GENE ON XPT struc- Conclusions ture. One of the reasons for this excellent result is that We have applied the notion of maximum expected accu- unlike other methods, RNAborMEA does not look for racy within the context of structural neighbors of a Figure 4 Given riboswitch sequence X83878/168-267 and initial structure S , the minimum free energy structure, a structure output by RNAborMEAis most structurally similar to the XPT GENE ONstructure, as measured by NestedAlign[36]. The NestedAlign score for RNAborMEA is 87.5, while optimal score for RNAbor is 60.0, and for RNAshapes is 64.0. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 7 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 5 For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was included. Then the following six programs were run: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. Each program outputs a number of near-optimal secondary structures, each according to different criteria. Taking RNAbor and RNAborMEA as examples, the programs RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S . Subsequently, we applied the program NestedAlign described in [36] to compute the structural similarity between the experimentally determined GENE ON structure for XPT guanine riboswitch of B. subtilis; i.e. the left panel of Figure 3. (Similar structures have positive scores; dissimilar structures have negative scores.) For each RNA in the seed alignment of RF00167, we determined the value k , such the MEA(k ) structure for that RNA has the 1 1 greatest structural similarity with the XPT GENE ON structure, as determined by NestedAlign. (See the left panel of Figure 3 for the experimentally determined GENE ON structure of XPT.) As earlier explained, we performed similar computations for the programs RNAshapes [39] and UNAFold [41], the programs RNAborMEA and RNAbor-Sample, described in this paper, and programs RNAbor[9] and RNAlocopt [34], developed by our lab. In 21 out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE OFF structure, as measured by NestedAlign. given RNA sequence a , .. ., a and structure S .Our Here, the expected accuracy score s is defined by 1 n 0 software RNAborMEA not only computes the structures σ (S)=2 · p + q MEA(k) having maximum expected accuracy over all i,j i (i,j)∈S iunpaired structures S, whose base pair distance d (S , S) is equal BP 0 to k.Inaddition, RNAborMEA allows the user to enter where first sum is taken over all base pairs (i, j) structural constraints, which specify partial secondary belonging to S, and the second sum is taken over all structures required of all MEA(k) structures, if so unpaired positions in S,and where p [resp. q ]isthe i,j i desired. Additionally, RNAborMEA computes an analo- probability that i, j are paired [resp. i is unpaired] in the gue of the temperature-dependent partition function, ensemble of low energy structures, as computed by defined by McCaskill’s algorithm [27]. Finally, RNAborMEA allows the user to sample structures from the maximum Z (T)= exp(σ (S))/RT expected accuracy ensemble, in a fashion analogous to {S:d (S ,S)=k} BP 0 Ding-Lawrence sampling from the low energy Boltz- and mann ensemble, as in Sfold[30]. Our preliminary investigations have not indicated a Z(T)= Z = exp(σ (S))/RT. clear application of the partition function analogue, k S though it maybeconstrued to providearepresentation Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 8 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 6 For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was included. Then the following six programs were run: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. Each program outputs a number of near-optimal secondary structures, each according to different criteria. Taking RNAbor and RNAborMEA as examples, the programs RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S . Subsequently, we applied the program NestedAlign described in [36] to compute the structural similarity between the experimentally determined GENE OFF structure for XPT guanine riboswitch of B. subtilis; i.e. the right panel of Figure 3. (Similar structures have positive scores; dissimilar structures have negative scores.) For each RNA of the seed alignment of RF00167, we determined the value k , such the MEA(k ) structure for that RNA has the 1 1 greatest structural similarity with the XPT GENE OFF structure, as determined by NestedAlign. (See the right panel of Figure 3 for the experimentally determined GENE OFF structure of XPT.) As earlier explained, we performed similar computations for the programs RNAshapes [39] and UNAFold [41], the programs RNAborMEA and RNAbor-Sample, described in this paper, and RNAbor[9] and RNAlocopt[34]. In 22 out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE OFF structure, as measured by NestedAlign. of the temperature-dependent mixing of various struc- sourceforge.net/projects/rnabormea/ and http://bioinfor- matics.bc.edu/clotelab/RNAborMEA/. tures having large score s. On the other hand, in com- putational experiments reported in the Results Section, it appears that RNAborMEA produces near-optimal Methods structures that are closer to the biologically functional Preliminaries structures, in the case of conformational switches that Recall the definition of RNA secondary structure. are difficult to predict by any method. Definition 1 A secondary structure S on RNA Indeed, in 18 [resp. 11] out of 34 instances, RNAbor- sequence a , .. ., a is defined to be a set of ordered pairs 1 n MEA produced the secondary structure most structurally (i, j), such that 1 ≤ i<j ≤ n and the following are satis- similar to the experimentally determined XPT GENE fied. ON [resp. GENE OFF] structure, as measured by Nes- tedAlign[36]. See Table 4. Since there appears to be 1. Watson-Crick or GU wobble pairs: If (i, j) belongs little to no correlation between the structures MFE(k) to S, then pair (a , a ) must be one of the following i j output by RNAbor[20] and the structures MEA(k)out- canonical base pairs: (A, U), (U, A), (G, C), (C, G), put by our current program RNAborMEA, it appears (G, U), (U, G). that RNAborMEA yields a signal that is orthogonal and 2. Threshold requirement: If (i, j) belongs to S, then j complementary to that provided by state-of-the-art ther- - i> θ,where θ, generally taken to be equal to 3,is modynamics software, such as UNAFold, RNAfold, the minimum number of unpaired bases in a hairpin RNAstructure, Sfold, RNAshapes, RNAbor,etc. loop; i.e., there must be at least θ unpaired bases in For thesereasons,wefeelthat RNAborMEA has a cer- a hairpin loop. tain value, along with the programs UNAFold, RNA- 3. Nonexistence of pseudoknots: If (i, j) and (k, ℓ) fold, RNAstructure, Sfold, RNAshapes, belong to S, then it is not the case that i < k < j <ℓ. RNAbor, etc. when producing suboptimal structures. 4. No base triples: If (i, j) and (i, k) belong to S, then RNAborMEA is written in C and available at http:// j = k; if (i, j) and (k, j) belong to S, then i = k. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 9 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 7 Figure depicting the increasing divergence between RNAborand RNAborMEA. For each RNA sequence in the seed alignment from Rfam family RF00066 of U7 small nuclear RNAs, both RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S . We computed the base pair distance between the MFE(k) structure and the MEA(k) structure over all sequences in the seed alignment of RF00066. The figure displays the average ± one standard deviation of base pair distance. The preceding definition provides for an inductive as follows. If 0 ≤ d ≤ θ,then S = {∅};i.e., theonly i,i+d construction of the set of all secondary structures for a secondary structure for a , .. ., a is the empty struc- i i+d given RNA sequence a , .. ., a . For all values of d =0, .. ture containing no base pairs (due to the requirement 1 n ., n and all values of i =1, .. ., n - d, the collection that all hairpins contain at least θ unpaired bases). If d S of all secondary structures for a , .. ., a is defined > θ and has been defined by recursion for all i ≤ j< i,i+d i,j i i+d Figure 8 Sample outputs from RNAborMEAon a 143 nt TPP-riboswitch, AF269819/1811-1669 with sequence CUACUAGGGG AGCCAAAAGG CUGAGAUGAA UGUAUUCAGA CCCUUAUAAC CUGAUUUGGU UAAUACCAAC GUAGGAAAGU AGUUAUUAAC UAUUCGUCAU UGAGAUGUCU UGGUCUAACU ACUUUCUUCG CUGGGAAGUA GUU. We took the TPP riboswitch aptamer from the Rfam database [19], then extracted right-flanking nucleotides from the corresponding EMBL file, in order to include the expression platform. Displayed from left to right are the structures MEA(0) and MEA(61) (the structure MEA(52) is similar to that of MEA(61) and corresponds to a free energy local minimum in the left figure.) The structure MEA(61) had the highest MEA score over all structural neighbors, including the original structure S = MEA(0), and had free energy, -46.0 kcal/mol, that was equal to that of the initial structure S = MEA(0), which is the minimum free energy 0 0 structure for the given sequence. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 10 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 9 (Left) Free energy for all MEA(k) structural neighbors, 0 ≤ k ≤ 99, of the TPP-riboswitch, AF269819/1811-1669, described in the previous figure. Clearly, MEA(0) and MEA(61) have the least energy, - 46.0 kcal/mol, and MEA(61) has the largest MEA score, 134.555, of all secondary structures for the given RNA sequence. It is more common that the free energy of the MEA(k) structure is monotonically increasing as a function of k. (Right) MEA score for all MEA(k) structural neighbors, 0 ≤ k ≤ 99, of the TPP-riboswitch, AF269819/1811-1669, described in the previous figure. Clearly, MEA(61) has the largest MEA score, 134.555, of all secondary structures for the given RNA sequence. Figure 10 Initial portion of pseudocode for RNAborMEAalgorithm, which continues in Figure 11. Given RNA sequence s = s , .. .,s of 1 n length n, initial secondary structure S of s, RNAborMEA computes for all values of 0 ≤ k ≤ n that structure S with base pair distance k from S , 0 0 M(i, j, k)= 2αp + βq i,j i which maximizes the value . The pseudocode actually computes only values M(i, j, k) for all i, j, k; (i,j)∈S iunpaired in s 5 3 the MEA structures are obtained by backtracing. This algorithm clearly runs in O(n ) time with O(n ) space. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 11 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 11 Pseudocode for RNAborMEAalgorithm. Given RNA sequence s = s , .. ., s of length n, initial secondary structure S of s, 1 n 0 RNAborMEA computes for all values of 0 ≤ k ≤ n that structure S with base pair distance k from S , which maximizes the value M(i, j, k)= 2αp + βq i,j i . The pseudocode actually computes only values M(i, j, k) for all i, j, k; the MEA structures are (i,j)∈S iunpaired in s 3 3 obtained by backtracing. This algorithm clearly runs in O(n ) time with O(n ) space. Figure 12 Pseudocode for the O(n ) traceback computed by our RNAborMEAalgorithm. Note that run time could be reduced to O(n ln n) by applying the boustrephedonic method described in [42]. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 12 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 13 (Left) Pseudo-Boltzmann and uniform probabilities of structural neighbors MEA(k) for the 49 nt SECIS sequence fdhA, with nucleotide sequence CGCCACCCUG CGAACCCAAU AAUAAAAUAU ACAAGGGAGC AAGGUGGCG and where S is (((((((.(((... (((.................))).))).))))))). Here, the (formal) parameter RT taken to be 49 (length of sequence), in order to uniformize MEA scores to range (k) (k) between 0 and 1. The pseudo-Boltzmann probability is defined by , where (i) Z = Σexp(MEA(S)/RT), the sum being taken over all P (k)= Z (k) (k) (k) S such that d (S , S)= k, and (ii) Z = Σ Z . The uniform probability is defined by , where N is the number of k-neighbors of S BP 0 k P (k)= 0 and N is the total number of secondary structures. (Right) Pseudo-Boltzmann probabilities for MEA(k) structural neighbors of the 27 nt Vienna bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial (minimum free energy) structure.......((((((((....)))))))). The left curve is when RT = 0.6, the approximate value obtained by multiplying the universal gas constant 0.00198 kcal/mol times 310 Kelvin. In contrast, the right curve is when RT = 27 (length of sequence). Though not shown in this graph, the pseudo-Boltzmann distribution is identical with the uniform distribution, when RT = n, where n is sequence length. i + d, then then for any secondary structure S for a , .. .,a and i r-1 any secondary structure T for a , ..., a , the struc- r+1 j-1 � Any secondary structure of a , .. ., a is a sec- ture S ∪ T ∪ {(r, j)} is a secondary structure for a , .. i i+d-1 i ondary structure for a , .. ., a ,inwhich a is ., a . i i+d i+d i+d unpaired. � If a , a can form a Watson-Crick or wobble base Given two secondary structures S, T,wedefinethe i j pair, then for any secondary structure S for a , .. ., base pair distance between S, T, denoted by d (S, T), i+1 BP a , the structure S ∪ {(i, j)} is a secondary struc- to be the cardinality of the symmetric difference of S, T; i+d-1 ture for a , ..., a . i.e., d (S, T)= |(S - T) ∪ (T - S)|. i i+d BP � For any intermediate value i +1 ≤ r ≤ j - θ -1,if RNAbor-Sample a , a can form a Watson-Crick or wobble base pair, In this section, we describe how sampling from the r j Boltzmann ensemble, using Sfold[30], leads to a sim- ple and fast approximation of RNAbor computations, Table 1 Number of samples needed for high-confidence provided that the input consists of an RNA sequence approximation of Boltzmann probabilities and the minimum free energy (MFE) secondary struc- PK ε zN ture for that sequence. The work of this section is 0.05 1 0.01 1.45 9506 inspired by sampling approximations of the number of 0.05 100 0.01 3.48 30276 structural motifs, such as hairpins, multiloops, etc. of 0.05 1000000 0.01 5.45 74256 Ding and Lawrence [30], as well as the sampling 0.001 100 0.01 3.89 37830 approximation used in RNAshapes[8] for large 0.000001 100 0.01 5.73 82082 sequences. A novelty of our work is that we provide a 0.05 1 0.001 1.45 950600 rigorous justification for the accuracy of the approxima- 0.05 100 0.001 3.48 3027600 tion, depending on sample size. 2 Let RNAbor-Sample denote the protocol, where we −1 The number of samples sufficient to 2K apply Sfold[30] to sample N secondary structures S of N = N(ε, p, K)= 4ε an input RNA sequence a , .. .,a ,thensubsequently guarantee that |f - p |< ε with confidence 1 - p, for all 0 ≤ k<K, in the 1 n k k compute the relative frequencies f for 0 ≤ k<K,where application RNAbor-Sample.Here , the Boltzmann probability, as k p = f = is defined to be the number N of sampled computed exactly by RNAbor, for a k-neighbor of S , and f is the relative k k 0 k frequency of k-neighbors among N sampled structures. structures S, whose base pair distance with S is k, 0 Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 13 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Table 2 Comparison of NestedAlign similarity scores for the GENE ON structure of the XPT guanine riboswitch index EMBL RNAbor RNAborMEA RNAbor-Sample RNAlocopt RNAshapes UNAFold 0 AL591981/205922-205823 -9.0 5.0 -9.0 -8.5 -9.0 -9.0 1 CP000764/271074-271175 -43.5 5.0 -37.5 -44.5 -23.0 -53.0 2 CP000764/308099-308200 -27.0 -18.0 -24.5 -31.5 -25.5 -22.0 3 BA000028/760473-760574 -25.5 -0.5 -36.0 -38.5 -24.5 -31.0 4 CP000557/252200-252301 -9.5 8.5 -9.5 8.5 -10.0 -12.0 5 X83878/168-267 60.0 87.5 57.0 66.0 64.0 59.0 6 BA000004/1593074-1592973 35.0 16.5 -13.5 -21.5 -19.0 -13.5 7 AAOX01000023/19446-19345 -15.0 -2.0 -13.0 -18.5 -13.5 -15.5 8 CP000416/1798040-1798138 5.5 1.5 1.5 12.0 4.5 -4.5 9 CP000721/398929-399026 26.0 24.5 16.5 -20.0 21.5 -32.0 10 BA000028/1103943-1104044 1.0 1.5 2.0 -0.5 0.5 0.5 11 ABDQ01000002/251055-251152 -16.0 -2.5 -16.5 -21.5 -17.5 -22.5 12 AAXV01000026/31334-31233 11.5 6.0 -1.5 -8.5 22.0 -3.0 13 AE016877/298774-298875 -18.5 14.0 -17.5 -34.0 -12.0 -26.5 14 BA000004/676475-676576 -28.5 -31.0 -28.0 -69.0 -21.0 -29.5 15 AE017333/692981-693082 -1.5 2.5 -11.5 -9.5 -5.5 -53.0 16 AM180355/256217-256318 -17.0 -45.0 -45.5 -49.0 -48.0 -49.0 17 AM406671/1321062-1320965 -25.5 -15.0 -22.0 -28.5 -23.5 -23.5 18 CP000612/2598111-2598012 -42.0 -39.5 -42.0 -47.5 -39.0 -38.5 19 CP000002/697032-697134 -8.0 -11.0 -10.5 -10.0 -4.5 -7.5 20 CP000002/2295936-2295837 23.5 47.0 31.5 21.0 30.0 22.5 21 AL596170/223345-223246 -0.5 7.0 0.5 -8.5 -10.0 -10.0 22 ABDQ01000005/131908-131807 -33.0 -15.5 -31.5 -31.5 -19.0 -50.0 23 AAOX01000052/9069-8968 -13.5 1.5 -14.0 -21.0 -15.5 -14.5 24 AE017333/4024324-4024425 -29.5 -26.5 -33.5 -24.0 -23.5 -36.0 25 AP006627/1554717-1554818 -31.5 -1.5 -37.0 -44.5 -28.5 -43.5 26 CP000024/1182948-1183043 -0.5 -18.5 -9.0 4.0 2.0 -19.0 27 BA000028/786767-786867 -18.0 -41.5 -48.0 -46.5 -49.0 -44.5 28 ABDP01000002/29688-29587 -34.5 -42.5 -34.5 -37.0 -35.0 -50.0 29 BA000043/272473-272574 -9.5 4.0 -9.5 -10.0 -3.0 -12.5 30 CP000724/944285-944386 -30.5 -21.5 -30.5 -28.5 -26.5 -31.5 31 CP000764/1409725-1409826 14.0 -3.0 -18.0 -24.0 -11.5 -20.0 32 AAEK01000017/86437-86538 -44.5 -44.0 -41.5 -52.0 -35.0 -49.0 33 CP000764/357645-357544 11.0 -13.5 -33.0 -26.0 -18.5 -36.0 NestedAlign similarity scores between the GENE ON structure of the XPT guanine riboswitch of B. subtilis, experimentally determined using in-line probing (see [35]), and the structurally most similar secondary structure among near-optimal structures generated by each of the following six methods: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. These values are plotted in Figure 5, where more details on the computational experiment are given. divided by N.Since Sfold appears to be only available As usual, let Z denote the partition function, repre- as a web server, where the user can not stipulate the senting the sum of Boltzmann factors of all secondary number of sampled structures, we instead use the structures of a , .. ., a ; i.e., 1 n Vienna RNA Package implementation of Sfold,given Z = exp(−E(S)/RT) in RNAsubopt -p[32]. Let a , .. .,a be an arbitrary RNA sequence having 1 n MFE structure of S . Following [9], let Z denote the 0 k Z and let p = denote the Boltzmann probability of all sum of Boltzmann factors of all k-neighbors of S ; i.e., k-neighbors. Given a desired approximation accuracy of ε, a prob- Z = exp(−E(S)/RT). ability p, and an upper bound K, we wish to compute a d (S ,S)=k BP 0 value N = N(ε, p, K), such that whenever we sample at Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 14 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Table 3 Comparison of NestedAlign similarity scores for the GENE OFF structure of the XPT guanine riboswitch Index EMBL RNAbor RNAborMEA RNAbor-Sample RNAlocopt RNAshapes UNAFold 0 AL591981/205922-205823 27.5 28.5 28.5 25.5 25.5 25.5 1 CP000764/271074-271175 13.0 12.5 11.0 6.5 12.0 5.5 2 CP000764/308099-308200 24.0 26.0 26.5 23.0 24.5 26.5 3 BA000028/760473-760574 18.5 22.0 13.0 20.5 23.5 23.0 4 CP000557/252200-252301 7.0 8.0 7.0 10.0 6.5 4.5 5 X83878/168-267 143.0 143.5 143.0 141.0 143.0 141.0 6 BA000004/1593074-1592973 41.0 39.0 41.0 36.0 38.0 41.0 7 AAOX01000023/19446-19345 47.5 45.5 46.0 42.5 34.0 43.5 8 CP000416/1798040-1798138 17.5 12.5 12.5 13.0 11.5 12.5 9 CP000721/398929-399026 36.5 20.5 23.0 -38.5 34.5 -52.5 10 BA000028/1103943-1104044 32.0 29.5 32.0 27.5 30.5 30.0 11 ABDQ01000002/251055-251152 27.0 26.0 26.5 24.0 25.5 7.5 12 AAXV01000026/31334-31233 37.5 38.5 38.0 32.5 35.0 36.0 13 AE016877/298774-298875 24.0 25.5 23.0 19.0 23.0 22.5 14 BA000004/676475-676576 9.0 4.5 6.5 -35.5 5.0 9.0 15 AE017333/692981-693082 -30.0 -9.5 -23.5 -25.5 -17.0 -70.5 16 AM180355/256217-256318 -23.5 -24.0 -25.0 -27.0 -23.5 -27.0 17 AM406671/1321062-1320965 -0.5 3.5 1.0 -10.0 1.0 0.5 18 CP000612/2598111-2598012 -12.0 -9.0 -8.0 -8.5 -9.5 -9.0 19 CP000002/697032-697134 16.5 7.0 12.0 14.0 16.5 7.5 20 CP000002/2295936-2295837 75.0 73.0 75.5 71.0 72.0 69.5 21 AL596170/223345-223246 30.5 31.5 30.5 28.5 29.5 29.5 22 ABDQ01000005/131908-131807 12.5 3.0 13.0 10.5 13.5 4.5 23 AAOX01000052/9069-8968 12.5 14.5 13.5 11.0 12.0 12.0 24 AE017333/4024324-4024425 -3.5 2.5 3.5 6.0 -2.5 -1.5 25 AP006627/1554717-1554818 22.5 18.0 22.5 14.5 25.5 12.5 26 CP000024/1182948-1183043 6.0 7.0 6.5 6.0 5.0 6.0 27 BA000028/786767-786867 -23.5 -19.5 -23.0 -24.5 -21.0 -24.0 28 ABDP01000002/29688-29587 3.0 1.0 2.5 1.0 4.5 0.5 29 BA000043/272473-272574 17.5 12.5 12.5 13.5 12.5 11.5 30 CP000724/944285-944386 10.0 11.0 10.5 7.0 12.0 9.5 31 CP000764/1409725-1409826 32.5 36.0 32.0 26.5 35.0 30.5 32 AAEK01000017/86437-86538 11.5 11.5 13.0 8.0 13.0 11.0 33 CP000764/357645-357544 23.5 22.0 24.5 24.0 22.0 22.5 NestedAlign similarity scores between the GENE OFF structure of the XPT guanine riboswitch of B. subtilis, experimentally determined using in-line probing (see [35]), and the structurally most similar secondary structure among near-optimal structures generated by each of the following six methods: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. These values are plotted in Figure 6, where more details on the computational experiment are given. Table 4 Number of times that the most similar structure was produced Method greatest similarity to gene on greatest similarity to gene off RNAborMEA 18 11 RNAbor 711 RNAbor-Sample 28 RNAlocopt 32 RNAshapes 58 UNAFold 13 Number of times that the most similar structure to the GENE ON resp. GENE OFF structure of the B. subtilis XPT riboswitch was produced by each of the six methods investigated. Although the test was made with 34 sequences from the seed alignment of Rfam family RF00167 [31], the sums of each column may exceed 34; this is because If two or more methods produced the maximum score, then each was counted. Structural similarity was measured using the NestedAlign structural alignment algorithm [36]. While the GENE OFF structure involves a terminator loop, that is often correctly found by thermodynamics- based software, the GENE ON secondary structure, having higher free energy (hence less stable thermodynamically) is less likely to be found using thermodynamics-based approaches. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 15 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 least N secondary structures from the Boltzmann p (1 − p ) k k ε = z · . ensemble using Sfold,the relative frequency f of k- α/2 neighbors sampled is within ε of the probability p of k- neighbors, for all 0 ≤ k< K, with confidence level of (1 It follows that - p). Formally, this means that for each 0 ≤ k<K, 2 2 z z α/2 α/2 P(|f − p | <ε) ≥ 1 − p. N = N(α, ε)= ≥ · p (1 − p ) (4) k k k k 2 2 4ε ε Consider the value k as bin number.For afixed bin k, provides a sufficient lower bound on number of sam- let us denote the exact value of by p . If we sample N ples necessary to guarantee margin of error ε.Let α = structures, each falling in the kth bin with probability p , and define then the number of structures in the kth bin is given by the random variable X having binomial distribution p −1 p 2K 2K (6) with mean N · p and variance N · p (1 - p ). It follows N = N(ε, p, K)= = . k k k 4ε 4ε that the proportion of structures in the kth bin has mean µ = p and standard deviation We have just shown that for N ≥ N(ε, p, K), p (1−p ) −1 P k k 1 √ . To determine minimum sample , hence σ = < P(|z| > | |) < 2K K 2 N size sufficient to ensure a certain approximation accu- ⎛ ⎞ racy with certain confidence interval, we adapt a stan- |f − p | p p ⎜ k k ⎟ −1 dard argument from statistics [37] (see equation (24.35) P > | | < . ⎝ ⎠ p (1−p ) 2K K k k on p. 529), by approximating the binomial distribution by the standard normal distribution using Z-scores. Before starting, we mention that it will suffice for our The following is now a key step. If we have K bins, intended application of RNAbor-Sample to have a pre- and we desire to have a small probability p that we are cise approximation of those p which exceed some mod- off by more than ε in our estimate of the probability of est lower bound, such as δ =0.01or δ = 0.0001. Thus any bin (in our intended application, the kth bin, for 0 ≤ we intend to prove that for all 0 ≤ k<K, if p ≥ δ,then k<K, is the collection of k-neighbors of S , i.e., those Equation (4) holds. structures S, whose base pair distance with S is k)then Temporarily, we fix k.Let X be a Bernouilli random it sufficesthatwehaveaprobability that we are off variable with success probability p , corresponding to by more than ε in any single bin. Indeed, let Y denote the indicator random variable that returns 1 if a single the indicator random variable, with value 1 provided sampled secondary structure is a k-neighbor of S .Pro- that |f - p |> ε,where f is the relative frequency of k k k vided that we sample a number N of structures, which sampling a k-neighbor of S , after sampling N secondary satisfies N · p ≥ 30, then the standard normal distribu- structures, where by Equation (5), N is chosen so that tion can be used to approximate the left and right tail of the distribution of Z-scores of sampled proportions N(|f − p |) p k k P(|z| >ε)= P >ε < i=1 , defined by f = K p (1 − p ) k k k x − μ f − p N(f − p ) k k k k then z = = = . (5) σ p (1−p ) k k p (1 − p ) k k N P(Y ∨ ··· ∨ Y ) < K · p/K = p. 0 K−1 1 2 Let (z)= exp(−x /2)dx denote the cumu- Putting everything together, we have shown that for −∞ 2π lative distribution function (CDF) for the standard nor- given ε, p, K, we can define by defining N mal distribution. Given desired confidence interval of C −1 =1- a, recall that the critical value z is defined by ( ) a/2 2k (7) N = N(ε, p, K)= 4ε −1 −1 z = (1 − α/2) = | (α/2)|. α/2 we have If ε is the margin of error in the left tail (-∞,-z ) and a/2 ⎛ ⎡ ⎤ ⎞ right tail (z ,+∞) for the normal approximation of the a/2 |f − p | p ⎜ ⎢ k k ⎥ ⎟ −1 binomial distribution, then by a well-known argument P [∃0 ≤ k < K] > < p ⎝ ⎣ ⎦ ⎠ p (1−p ) 2K k k (e.g. equation (24.35) on p. 529 of [37]), we have N Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 16 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 We have completed a more rigorous argument using previous section. At the same time, the value M(j, i, k), Chernoff bounds, but prefer the exposition given here stored in the lower triangular portion of matrix M,will for simplicity. Note that the same argument, given ver- consist of a triple r, k , k of numbers, such that the fol- 0 1 batim, can be used for any binning procedure. In parti- lowing approximately holds (in this section, we provide cular, this approach provides information on sufficient the motivating idea; the actual algorithm description, number of samples in order to approximate the result of which deviates slightly from the description here, is RNAshapes [8,38,39] by means of sampling. given inthe next sectionand in Figures10and 11). (i) We can make some basic conclusions from the above If r =0then M(i, j, k) is maximized by a k-neighbor S analysis. The number of samples sufficient to ensure of S [i, j] for the subsequence a , .. ., a in which a is 0 i j j that |f - p |< ε for 0 ≤ k< K with confidence 1 - p is unpaired. In this case, k = k and k =0. (ii) If r = i, k k 0 1 reasonable, and only slightly increases for a higher num- then M(i, j, k) is maximized by a k-neighbor S of S [i, j] ber K of bins or to ensure greater confidence 1 - p. for the subsequence a , ...,a in which base pair (i, j) Î S. i j However, the number of samples increases greatly when In this case, k = 0 and k = k -1. (i) If i<r ≤ j - θ -1 0 1 higher precision estimates (smaller ε values) are needed, then M(i, j, k) is maximized by a k-neighbor S of S [i, j] even for one bin. for the subsequence a , .. .,a in which base pair (r, j) Î i j In the case of one bin, it is important to remember S.The left portionof S,which is S[i, r -1]willbea k that the value N is a conservative estimate, sufficient to neighbor of S[i, r - 1], while the right portion of S, ensure our conclusion. This estimate uses the worst which is S[r, j] must contain the base pair (r, j) and itself case of 50% probability of being in a bin, which maxi- be a k neighbor of S[r, j]. In summary, the values r, k , 1 0 mizes the standard deviation. For bins with small prob- k will be used in computing the traceback, where the ability, one can re-estimate what is needed. However, maximum expected accurate structure that is a k-neigh- for bins with smaller probability, higher precision is bor of S[i, j] will be constructed by one of the following: needed as well, unless all that is required is to verify (i) MEA k-neighbor of S[i, j - 1], in the event that a is that the bin has small probability. Also, clearly if a bin unpaired in [i, j]; (ii) MEA k -1-neighbor of S[i +1, j - -6 has probability of 10 , then at least on the order of one 1], in the event that a , a form abasepair; (iii) MEA i j million samples are needed, just for a reasonable prob- k -neighbor of S[i, r -1] and the MEA k -neighbor of S 0 1 ability of sampling the bin once. [r, j], where k + k = k, in the event that a , a form a 0 1 r j base pair. Algorithm description Pseudocode for the algorithm RNAborMEA is given in GivenanRNA sequence a = a , .. ., a , a secondary Figures 10 and 11. An array M of size n× n×Kmax is 1 n structure S of a, and a maximum desired value Kmax ≤ required to store the MEA scores in M(i, j, k)for all n,the RNAborMEA algorithm computes, for each 1 ≤ i subsequences [i, j] and all base pair distances 0 ≤ k ≤ <j ≤ n and each 0 ≤ k <Kmax ≤ n, the maximum score Kmax between structures S[i, j] and initially given struc- M(i, j, k) ture S [i, j]. For 1 ≤ i ≤ j ≤ n and all 0 ≤ k ≤ Kmax,the pseudocode in Figure 11 stores a value of the form (x, y, 2αp + βq z) in the lower triangular portion, M(j, i, k), of the array. i,j i (i,j)∈S iunpaired Here, x = 0 indicates that the optimal structure on [i, j], i.e., having maximum MEA score over all k-neighbors of where the first sum is taken over all base pairs (i, j) S [i, j], is obtained by not pairing j with any nucleotide belonging to S, the second sum is taken over all in [i, j]; for values x>0, hence x Î [i, j - θ - 1], the opti- unpaired positions in S,and where p [resp. q ]isthe i,j i mal k-neighbor of S [i, j] is obtained by pairing x with j. probability that i, j are paired [resp. i is unpaired] in the The values y, z correspond to the values k , k ,such 0 1 ensemble of low energy structures, and a, b >0are that: (i) if x = 0, then the optimal k-neighbor of S [i, j] weights. Our computational experiments, as in Figure 9, is obtained by first computing the optimal k -neighbor were carried out with default values of 1 for a, b.(See of S [i, j - 1], where k = k - b , then leaving j unpaired; 0 0 0 Equation 1 for the formal definition of Boltzmann base (ii) if x = i, then the optimal k-neighbor of S [i, j]is pairing probability p .) i,j obtained by first computing the optimal k -neighbor of The dynamic programming computation of M(i, j, k) S [i +1, j -1], where k = k - b , then adding the 0 1 1 is performed by recursion on increasing values of j - i enclosing base pair (i, j); (iii) if x = r Î [i +1, j - θ -1], for all values 1 ≤ i ≤ j ≤ n and 0 ≤ k ≤ Kmax. The value then the optimal k-neighbor of S [i, j] is obtained by of M(i, j, k), stored in the upper triangular portion of first computing the optimal k -neighbor of S [i, r -1]as 0 0 matrix M, will involve taking the maximum over three well as the optimal k -neighbor of S [r +1, j - 1], then 1 0 cases, which correspond to the inductive construction of adding the base pair (r, j). This last calculation must be all secondary structures on a , .. ., a , as described in the i j done over all values k , k such that k + k = k.Using 0 1 0 1 Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 17 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 3. Franch T, Gultyaev AP, Gerdes K: Programmed Cell Death by hok/sok of the values M(j, i, k)=(x, y, z), the traceback can be Plasmid R1: Processing at the hok mRNA 3H-end Triggers Structural easily computed by recursion; see Figure 12 for pseudo- Rearrangements that Allow Translation and Antisense RNA Binding. J code of traceback. Mol Biol 1997, 273:38-51. 4. Mandal M, Boese B, Barrick J, Winkler W, Breaker R: Riboswitches control Inamannersimilar to thepseudocodeofFigures 10 fundamental biochemical pathways in Bacillus subtilis and other and 11 (essentially, one replaces the operation of taking bacteria. Cell 2003, 113(5):577-586. the maximum by the a summation, and one replaces the 5. Cheah MT, Wachter A, Sudarsan N, Breaker RR: Control of alternative RNA splicing and gene expression by eukaryotic riboswitches. Nature 2007, MEA score by the pseudo-Boltzmann factor exp(MEA 447(7143):497-500. (S)/RT)), RNAborMEA also computes the pseudo-Boltz- 6. Ray PS, Jia J, Yao P, Majumder M, Hatzoglou M, Fox PL: A stress-responsive mann partition function values RNA switch regulates VEGFA expression. Nature 2009, 457(7231):915-919. 7. Voss B, Meyer C, Giegerich R: Evaluating the predictability of (k) conformational switching in RNA. Bioinformatics 2004, 20(10):1573-1582. Z = exp(MEA(S/RT)). i,j 8. Voss B, Giegerich R, Rehmsmeier M: Complete probabilistic analysis of {S∈S :d (S ,S)=k) i,j BP 0 RNA shapes. BMC Biol 2006, 4(5). 9. Freyhult E, Moulton V, Clote P: Boltzmann probability of RNA structural (k) 1,n We have graphed the Boltzmann probabilities as neighbors and riboswitch detection. Bioinformatics 2007, 23(16):2054-2062, (k) 1,n N (k) Doi: 10.1093/bioinformatics/btm314. 1,n well as the uniform probabilities ,where is the 1,n 1,n 10. Barash D: Second eigenvalue of the Laplacian matrix for predicting RNA number of k-neighbors, and N is the total number of 1,n conformational switch by mutation. Bioinformatics 2004, 20(12):1861-1869. secondary structures. When RT = n, which normalizes 11. Mandal M, Breaker RR: Adenine riboswitches and gene activation by disruption of a transcription terminator. Nat Struct Mol Biol 2004, 11:29-35. the MEA score to a maximum of 1, it appears that the 12. Serganov A, Yuan Y, Pikovskaya O, Polonskaia A, Malinina L, Phan A, Boltzmann distribution is the same as the uniform dis- Hobartner C, Micura R, Breaker R, Patel D: Structural Basis for tribution, as shown in Figure 13. Discriminative Regulation of Gene Expression by Adenine- and Guanine- Sensing mRNAs. Chem Biol 2004, 11(12):1729-1741. 13. Serganov A, Polonskaia A, Phan AT, Breaker RR, Patel DJ: Structural basis for gene regulation by a thiamine pyrophosphate-sensing riboswitch. Acknowledgements Nature 2006, 441(7097):1167-1171. Funding for the research of P. Clote and F. Lou was provided by the Digiteo 14. Abreu-Goodger C, Merino E: RibEx: A web server for locating riboswitches Foundation for the project RNAomics. Additional funding was provided to P. and other conserved bacterial regulatory elements. Nucleic Acids Res Clote by National Science Foundation grants DMS-1016618 and DMS- 2005, 33:W690-W692. 0817971. Any opinions, findings, and conclusions or recommendations 15. Bengert P, Dandekar T: Riboswitch finder - A tool for identification of expressed in this material are those of the authors and do not necessarily riboswitch RNAs. Nucleic Acids Res 2004, 32:W154-W159. reflect the views of the National Science Foundation. 16. Chang TH, Huang HD, Wu LC, Yeh CT, Liu BJ, Horng JT: Computational This article has been published as part of BMC Bioinformatics Volume 13 identification of riboswitches based on RNA conserved functional Supplement 5, 2012: Selected articles from the First IEEE International sequences and conformations. RNA 2009, 15(7). Conference on Computational Advances in Bio and medical Sciences 17. Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA (ICCABS 2011): Bioinformatics. The full contents of the supplement are alignments. Bioinformatics 2009, 25(10):1335-1337. available online at http://www.biomedcentral.com/bmcbioinformatics/ 18. Weinberg Z, Barrick JE, Yao Z, Roth A, Kim JN, Gore J, Wang JX, Lee ER, supplements/13/S5. Block KF, Sudarsan N, Neph S, Tompa M, Ruzzo WL, Breaker RR: Identification of 22 candidate structured RNAs in bacteria using the Author details 1 CMfinder comparative genomics pipeline. Nucleic Acids Res 2007, Department of Biology, Boston College, Chestnut Hill, MA 02467, USA. 2 35(14):4809-4819. Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud XI, 3 19. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, 91405 Orsay cedex, France. Laboratoire d’Informatique (LIX), Ecole 4 Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A: Rfam: Polytechnique, 91128 Palaiseau, France. Department of Mathematics and updates to the RNA families database. Nucleic Acids Res 2009, Computer Science, Denison University, Granville, OH 43023-0810, USA. 37(Database):D136-D140. 20. Freyhult E, Moulton V, Clote P: Boltzmann probability of RNA structural Authors’ contributions neighbors and riboswitch detection. Bioinformatics 2007, 23(16):2054-2062. RNAbor-Sample was conceived by W.A.L., who provided a proof for 21. Freyhult E, Moulton V, Clote P: RNAbor: a web server for RNA structural sample size sufficient to ensure a desired degree of accuracy with a desired neighbors. Nucleic Acids Res 2007, 35(Web):W305-W309. level of confidence. RNAborMEA was conceived by P.C., then designed and 22. Matthews D, Sabina J, Zuker M, Turner D: Expanded sequence programmed by P.C. (prototype implementation in Python) and F.L. dependence of thermodynamic parameters improves prediction of RNA (implementation in C). Computational experiments were carried out by F.L., secondary structure. J Mol Biol 1999, 288:911-940. figures were created by F.L. and P.C. and the manuscript was written by P.C. 23. Xia T, SantaLucia JJ, Burkard M, Kierzek R, Schroeder S, Jiao X, Cox C, Turner D: Thermodynamic parameters for an expanded nearest-neighbor Competing interests model for formation of RNA duplexes with Watson-Crick base pairs. The authors declare that they have no competing interests. Biochemistry 1999, 37:14719-35. 24. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic Published: 12 April 2012 consistency-based multiple sequence alignment. Genome Res 2005, 15(2):330-340. References 25. Nussinov R, Jacobson AB: Fast Algorithm for Predicting the Secondary 1. Olsthoorn R, Mertens S, Brederode F, Bol J: A conformational switch at the Structure of Single Stranded RNA. Proceedings of the National Academy of 3’ end of a plant virus RNA regulates viral replication. EMBO J 1999, Sciences, USA 1980, 77(11):6309-6313. 18:4856-4864. 26. Kiryu H, Kin T, Asai K: Robust prediction of consensus secondary 2. Repsilber D, Wiese S, Rachen M, Schroder A, Riesner D, Steger G: Formation structures using averaged base pairing probability matrices. of metastable RNA structures by sequential folding during transcription: Bioinformatics 2007, 23(4):434-441. time-resolved structural analysis of potato spindle tuber viroid 27. McCaskill J: The equilibrium partition function and base pair binding (-)-stranded RNA by temperature-gradient gel. RNA 1999, 5:574-584. probabilities for RNA secondary structure. Biopolymers 1990, 29:1105-1119. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 18 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 28. Lu ZJ, Gloor JW, Mathews DH: Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA 2009, 15(10):1805-1813. 29. Zuker M: On finding all suboptimal foldings of an RNA molecule. Science 1989, 244(7):48-52. 30. Ding Y, Lawrence CE: A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 2003, 31:7280-7301. 31. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res 2011, 39(Database):D141-D145. 32. Gruber A, Lorenz R, Bernhart S, Neubock R, Hofacker I: The Vienna RNA websuite. Nucleic Acids Research 2008, 36:70-74. 33. Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA 2004, 101(19):7287-7292. 34. Lorenz W, Clote P: Computing the partition function for kinetically trapped RNA secondary structures. Public Library of Science One (PLoS ONE) 2011, 6:316178, Doi:10.1371/journal.pone.0016178. 35. Wakeman CA, Winkler WC, Dann C: Structural features of metabolite- sensing riboswitches. Trends Biochem Sci 2007, 32(9):415-424. 36. Blin G, Denise A, Dulucq S, Herrbach C, Touz H: Alignments of RNA structures. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010. 37. Zar J: Biostatistical Analysis Prentice-Hall, Inc; 1999. 38. Giegerich R, Voss B, Rehmsmeier M: Abstract shapes of RNA. Nucleic Acids Res 2004, 32(16):4843-4851. 39. Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R: RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics 2006, 22(4):500-503. 40. Hofacker I: Vienna RNA secondary structure server. Nucleic Acids Res 2003, 31:3429-3431. 41. Markham NR, Zuker M: UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol 2008, 453:3-31. 42. Ponty Y: Efficient sampling of RNA secondary structures from the Boltzmann ensemble of low-energy: The boustrophedon method. J Math Biol 2008, 56(1-2):107-127. doi:10.1186/1471-2105-13-S5-S6 Cite this article as: Clote et al.: Maximum expected accuracy structural neighbors of an RNA secondary structure. BMC Bioinformatics 2012 13 (Suppl 5):S6. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color ﬁgure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals http://www.deepdyve.com/lp/springer-journals/maximum-expected-accuracy-structural-neighbors-of-an-rna-secondary-PPaQq60Syo

Loading next page...

References (53)

Eric Nawrocki, D. Kolbe, S. Eddy (2009)
Infernal 1.0: inference of RNA alignments
Bioinformatics, 25 10
Chuong Do, Mahathi Mahabhashyam, M. Brudno, S. Batzoglou (2005)
ProbCons: Probabilistic consistency-based multiple sequence alignment.
Genome research, 15 2
Tzu-Hao Chang, Hsien-Da Huang, Li-Ching Wu, Chi-Ta Yeh, Baw-Jhiune Liu, Jorng-Tzong Horng (2009)
Computational identification of riboswitches based on RNA conserved functional sequences and conformations.
RNA, 15 7
P. Gardner, J. Daub, J. Tate, B. Moore, Isabelle Osuch, S. Griffiths-Jones, R. Finn, Eric Nawrocki, D. Kolbe, S. Eddy, A. Bateman (2010)
Rfam: Wikipedia, clans and the “decimal” release
Nucleic Acids Research, 39
D. Mathews, M. Disney, Jessica Childs, S. Schroeder, M. Zuker, D. Turner (2004)
Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure.
Proceedings of the National Academy of Sciences of the United States of America, 101 19
(1999)
Biostatistical Analysis Prentice-Hall, Inc
R. Nussinov, A. Jacobson (1980)
Fast algorithm for predicting the secondary structure of single-stranded RNA.
Proceedings of the National Academy of Sciences of the United States of America, 77 11
Maumita Mandal, R. Breaker (2004)
Adenine riboswitches and gene activation by disruption of a transcription terminator
Nature Structural &Molecular Biology, 11
I. Hofacker (2003)
Vienna RNA secondary structure server
Nucleic acids research, 31 13
D. Mathews, Jeffrey Sabina, M. Zuker, D. Turner (1999)
Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure.
Journal of molecular biology, 288 5
H. Kiryu, Taishin Kin, K. Asai (2007)
Robust prediction of consensus secondary structures using averaged base pairing probability matrices
Bioinformatics, 23 4
D. Barash (2004)
Second eigenvalue of the Laplacian matrix for predicting RNA conformational switch by mutation
Bioinformatics, 20 12
E Freyhult, V Moulton, P Clote (2007)
RNAbor: a web server for RNA structural neighbors
Nucleic Acids Res, 35
Ye Ding, C. Lawrence (2003)
A statistical sampling algorithm for RNA secondary structure prediction.
Nucleic acids research, 31 24
(2012)
Maximum expected accuracy structural neighbors of an RNA secondary structure
Catherine Wakeman, W. Winkler, C. Dann (2007)
Structural features of metabolite-sensing riboswitches.
Trends in biochemical sciences, 32 9
M. Cheah, A. Wachter, Narasimhan Sudarsan, R. Breaker (2007)
Control of alternative RNA splicing and gene expression by eukaryotic riboswitches
Nature, 447
Ye Ding, C. Chan, C. Lawrence (2005)
RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble.
RNA, 11 8
J. McCaskill (1990)
The equilibrium partition function and base pair binding probabilities for RNA secondary structure
Biopolymers, 29
Björn Voß, R. Giegerich, Marc Rehmsmeier (2006)
Complete probabilistic analysis of RNA shapes
BMC Biology, 4
Y. Ponty (2007)
Efficient sampling of RNA secondary structures from the Boltzmann ensemble of low-energy
Journal of Mathematical Biology, 56
Nicholas Markham, M. Zuker (2008)
UNAFold: software for nucleic acid folding and hybridization.
Methods in molecular biology, 453
Université Paris-Sud XI, 91405 Orsay cedex, France. 3 Laboratoire d'Informatique (LIX), Ecole Polytechnique
P. Ray, J. Jia, P. Yao, Mithu Majumder, M. Hatzoglou, P. Fox (2008)
A stress-responsive RNA switch regulates VEGF expression
Nature, 457
A. Serganov, Yu-Ren Yuan, O. Pikovskaya, A. Polonskaia, L. Malinina, A. Phan, Claudia Hobartner, R. Micura, R. Breaker, D. Patel (2004)
Structural basis for discriminative regulation of gene expression by adenine- and guanine-sensing mRNAs.
Chemistry & biology, 11 12
R. Olsthoorn, S. Mertens, F. Brederode, J. Bol (1999)
A conformational switch at the 3′ end of a plant virus RNA regulates viral replication
The EMBO Journal, 18
S. Miyazawa (1995)
A reliable sequence alignment method based on probabilities of residue correspondences.
Protein engineering, 8 10
Z. Weinberg, Jeffrey Barrick, Zizhen Yao, Adam Roth, Jane Kim, J. Gore, J. Wang, Elaine Lee, Kirsten Block, Narasimhan Sudarsan, Shane Neph, M. Tompa, W. Ruzzo, R. Breaker (2007)
Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline
Nucleic Acids Research, 35
R. Giegerich, B. Voß, Marc Rehmsmeier (2004)
Abstract shapes of RNA.
Nucleic acids research, 32 16
B. Voß, Carsten Meyer, R. Giegerich (2004)
Evaluating the predictability of conformational switching in RNA
Bioinformatics, 20 10
Author details 1
P. Gardner, J. Daub, J. Tate, Eric Nawrocki, D. Kolbe, S. Lindgreen, A. Wilkinson, R. Finn, S. Griffiths-Jones, S. Eddy, A. Bateman (2008)
Rfam: updates to the RNA families database
Nucleic Acids Research, 37
P Steffen, B Voss, M Rehmsmeier, J Reeder, R Giegerich (2006)
RNAshapes: an integrated RNA analysis package based on abstract shapes
Bioinformatics, 22
J Zar (1999)
Biostatistical Analysis
BIOINFORMATICS APPLICATIONS NOTE doi:10.1093/bioinformatics/btk010 Sequence analysis RNAshapes: an integrated RNA analysis package based on abstract shapes
T. Franch, A. Gultyaev, Kenn Gerdes (1997)
Programmed cell death by hok/sok of plasmid R1: processing at the hok mRNA 3'-end triggers structural rearrangements that allow translation and antisense RNA binding.
Journal of molecular biology, 273 1
E. Freyhult, V. Moulton, P. Clote (1996)
Deep sea microbes and DNA cloning.
Environmental Health Perspectives, 104
G Blin, A Denise, S Dulucq, C Herrbach, H Touz (2010)
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Y Ponty (2008)
Efficient sampling of RNA secondary structures from the Boltzmann ensemble of low-energy: The boustrophedon method
J Math Biol, 56
Peter Bengert, T. Dandekar (2004)
Riboswitch finder tool for identification of riboswitch RNAs
Nucleic acids research, 32 Web Server issue
G. Blin, A. Denise, S. Dulucq, Claire Herrbach, H. Touzet (2010)
Alignments of RNA Structures
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 7
D. Repsilber, S. Wiese, M. Rachen, A. Schröder, D. Riesner, G. Steger (1999)
Formation of metastable RNA structures by sequential folding during transcription: time-resolved structural analysis of potato spindle tuber viroid (-)-stranded RNA by temperature-gradient gel electrophoresis.
RNA, 5 4
(1999)
Zar J: Biostatistical Analysis
T. Xia, J. SantaLucia, M. Burkard, R. Kierzek, S. Schroeder, Xiaoqi Jiao, C. Cox, D. Turner (1998)
Thermodynamic parameters for an expanded nearest-neighbor model for formation of RNA duplexes with Watson-Crick base pairs.
Biochemistry, 37 42
Maumita Mandal, B. Boese, Jeffrey Barrick, W. Winkler, R. Breaker (2003)
Riboswitches Control Fundamental Biochemical Pathways in Bacillus subtilis and Other Bacteria
Cell, 113
C. Abreu-Goodger, E. Merino (2005)
RibEx: a web server for locating riboswitches and other conserved bacterial regulatory elements
Nucleic Acids Research, 33
A. Gruber, R. Lorenz, S. Bernhart, R. Neuböck, I. Hofacker (2008)
The Vienna RNA Websuite
Nucleic Acids Research, 36
Z. Lu, Jason Gloor, D. Mathews (2009)
Improved RNA secondary structure prediction by maximizing expected pair accuracy.
RNA, 15 10
E. Freyhult, V. Moulton, P. Clote (2007)
Boltzmann probability of RNA structural neighbors and riboswitch detection
Bioinformatics, 23 16
5 -15.0 -22.0 -28.5 -23.5 -23.5 18 CP000612/2598111-2598012 -42.0 -39.5 -42.0 -47.5 -39.0 -38.5 19 CP000002/697032-697134 -8.0 -11.0 -10.5 -10
A. Serganov, A. Polonskaia, A. Phan, R. Breaker, D. Patel (2006)
Structural basis for gene regulation by a thiamine pyrophosphate-sensing riboswitch
Nature, 441
W. Lorenz, P. Clote (2011)
Computing the Partition Function for Kinetically Trapped RNA Secondary Structures
PLoS ONE, 6
M. Zuker (1989)
On finding all suboptimal foldings of an RNA molecule.
Science, 244 4900

Publisher: Springer Journals
Copyright: Copyright © 2012 by Clote et al.; licensee BioMed Central Ltd.
Subject: Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN: 1471-2105
DOI: 10.1186/1471-2105-13-S5-S6
pmid: 22537010
Publisher site: See Article on Publisher Site

Abstract

Background: Since RNA molecules regulate genes and control alternative splicing by allostery, it is important to develop algorithms to predict RNA conformational switches. Some tools, such as paRNAss, RNAshapes and RNAbor, can be used to predict potential conformational switches; nevertheless, no existent tool can detect general (i.e., not family specific) entire riboswitches (both aptamer and expression platform) with accuracy. Thus, the development of additional algorithms to detect conformational switches seems important, especially since the difference in free energy between the two metastable secondary structures may be as large as 15-20 kcal/mol. It has recently emerged that RNA secondary structure can be more accurately predicted by computing the maximum expected accuracy (MEA) structure, rather than the minimum free energy (MFE) structure. Results: Given an arbitrary RNA secondary structure S for an RNA nucleotide sequence a = a ,..., a , we say that 0 1 n another secondary structure S of a is a k-neighbor of S , if the base pair distance between S and S is k. In this 0 0 paper, we prove that the Boltzmann probability of all k-neighbors of the minimum free energy structure S can be approximated with accuracy ε and confidence 1 - p, simultaneously for all 0 ≤ k<K, by a relative frequency count −1 over N sampled structures, provided that 2K , where F(z) is the cumulative distribution N > N(ε, p, K)= 4ε function (CDF) for the standard normal distribution. We go on to describe the algorithm RNAborMEA, which for an arbitrary initial structure S and for all values 0 ≤ k<K, computes the secondary structure MEA(k), having 3 2 maximum expected accuracy over all k-neighbors of S . Computation time is O(n · K ), and memory requirements are O(n · K). We analyze a sample TPP riboswitch, and apply our algorithm to the class of purine riboswitches. Conclusions: The approximation of RNAbor by sampling, with rigorous bound on accuracy, together with the computation of maximum expected accuracy k-neighbors by RNAborMEA, provide additional tools toward conformational switch detection. Results from RNAborMEA are quite distinct from other tools, such as RNAbor, RNAshapes and paRNAss, hence may provide orthogonal information when looking for suboptimal structures or conformational switches. Source code for RNAborMEA can be downloaded from http://sourceforge.net/ projects/rnabormea/ or http://bioinformatics.bc.edu/clotelab/RNAborMEA/. * Correspondence: [email protected]; [email protected] Department of Biology, Boston College, Chestnut Hill, MA 02467, USA Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud XI, 91405 Orsay cedex, France Full list of author information is available at the end of the article © 2012 Clote et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 2 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 In previous work [20,21], we described a novel pro- Background gram RNAbor to predict RNA conformational switches. RNA secondary structure conformational switches play For a given secondary structure S of a given RNA an essential role in a number of biological processes, sequence s, the secondary structure T of s is said to be such as regulation of viral replication [1] and of viroid a k-neighbor of S, if the base pair distance between S replication [2], regulation of R1 plasmid copy number in and T is k. (Base pair distance is the minimum number E. coli by hok/sok system [3], transcriptional and transla- of base pairs that must be either added or removed, in tional gene regulation in prokaryotes by riboswitches order to transform the structure S into T.) Given an [4], regulation of alternative splicing in eukaryotes [5], arbitrary initial structure S and stress-responsive gene regulation in humans [6], etc. , for all values 0 ≤ k<K,the program RNAbor[20], computes the secondary structure Due to the biological importance of conformational MFE(k), having minimum free energy over all k-neigh- switches, several groups have developed algorithms that bors of S . (Note that K ≤ 2· n,sincethebasepairdis- attempt to recognize switches - in particular, thermody- tance between any two secondary structures of a length namics-based methods such as paRNAss[7], RNA- n RNAsequenceisatmost2· n.) As well, RNAbor shapes[8], RNAbor[9], as well as an approach using computes for each value 0 ≤ k ≤ K, the Boltzmann prob- the second eigenvalue of the Laplacian matrix [10]. Z(k) ability ,where Z(k)isthe sumofall Boltzmann Riboswitches are portions of the 5’ untranslated region p = factors exp(-E(S)/RT) of all structures S having base pair (UTR) of messenger RNAs, experimentally known to distance k from an initially given structure S ,and regulate genes in bacteria by allostery [4], and to regu- where the partition function Z is the sum of all Boltz- late alternative splicing of the gene NMT1 in the eukar- mann factors of all secondary structures of the given yote Neurospora crassa [5]. Riboswitches are composed RNA sequence. Here E(S) is the free energy of second- of a 5’ aptamer and a 3’ expression platform.Since the ary structure S, with respect to the Turner energy aptamer binds to a specific ligand with high affinity (K -1 -1 model [22,23], R = 0.001987 kcal mol K is the uni- ≈ 5 nM), thus triggering the conformational change of versal gas constant, and T is absolute temperature. In the expression platform upon ligand binding [11], its thecasethat S is the minimum free energy structure, sequence and secondary structure tend to be highly con- 0 the existence of one or more ‘peaks’,orvalues k ≫ 0, served. In contrast, there is lower sequence for the where p is relatively large, suggests that there are two expression platform, which forms a bistable switch, k or more low energy structures having large base pair effecting gene regulation by premature abortion of tran- distance k from S - i.e., a potentially distinct meta- scription (as in guanine riboswitches [12]), or by seques- 0 stable structure, as shown in Figure 1. tering the Shine-Dalgarno sequence (as in thiamine pyrophosphate riboswitches [13]). Due to the conserved In [24], Do et al. introduced the notion of maximum expected accuracy (MEA) secondary structure, determined sequence and secondary structure within the aptamer, as follows: (i) compute base pairing probabilities p(i, j) all existent algorithms (to the best of our knowledge), using a trained stochastic context free grammar; (ii) com- such as [14-16], attempt only to detect riboswitch apta- q(i)= 1 − p(i, j) − p(j, i) pute probabilities mers, without the expression platform. In addition to i<j j<i that position i does not pair; (iii) using a dynamic pro- these specific algorithmic approaches, more general gramming algorithm similar to that of Nussinov and computational tools that rely on stochastic context free Jacobson [25], determine that secondary structure S having grammars,suchas Infernal[17] and CMFinder[18], 2α · p(i, j)+ βq maximum score i , where have been trained to recognize riboswitch aptamers; in (i,j)∈S iunpaired the first sum is over paired positions (i, j)of S and the sec- particular, Infernal was used to create the Rfam data- ond sum is over positions i located in loop regions of S, base [19], which includes 14 families of riboswitch and where a, b >0 are parameters with default values 1. aptamers. Subsequently Kiryu et al. [26] computed the MEA struc- Since current riboswitch detection algorithms do not ture by replacing the stochastic context free grammar attempt to predict the location of the expression plat- computation of base pairs in (i) by using McCaskill’s algo- form, we have developed tools, RNAbor-Sample and rithm [27], which computes the Boltzmann base pairing RNAborMEA, described in this paper, which yield infor- probabilities mation concerning alternative, or suboptimal, structures of a given RNA sequence. These tools can suggest the exp(−E(S)/RT) {S:(i,j)∈S} presence of a conformational switch; however, much p(i, j)= (1) exp(−E(S)/RT) more work must be done to actually produce a ribos- witch gene finder, part of thedifficultydueto thefact The sum in the numerator is taken over all secondary that riboswitch aptamers contain pseudoknots that can- structures of the given RNA sequence, that contain base not be captured by secondary structure. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 3 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 1 Output of RNAboron the 27 nt bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial structure S , the minimum free energy (MFE) structure....... ((((((((....)))))))) with free energy -10.3 kcal/mol. The 16-neighbor of S is the 0 0 metastable structure ((((((((....))))))))....... with free energy -9.9 kcal/mol. The MFE structure appears above the leftmost peak, while the MFE(16) structure appears above the rightmost peak. The output of RNAbor includes a graph of the Boltzmann probabilities p = , and MFE(k) structures, for all 0 ≤ k ≤ 2n. The existence of distinct ‘peaks’ suggests the presence of a conformational switch. pair (i, j), while the sum in the denominator is taken structure, in light of previously mentioned results over all secondary structures of the given RNA [26,28]. Surprisingly, it turns out that MEA(k) structures sequence. Thus p(i, j) is the sum of the Boltzmann fac- are quite different from MFE(k) structures, as shown tors of all secondary structures that contain the fixed later in one of the figures. base pair (i, j), divided by the partition function, which In this paper, we begin by showing rigorously how to latter is the sum of Boltzmann factors of all secondary approximate the output of RNAbor by frequency counts structures. In fact, Kiryu et al. [26] describe an algo- from sampling, using Sfold[30]. We then extend the rithm to compute the MEA structure common to all MEA technique to compute the maximum expected RNAs in a given alignment. Later, Lu et al. [28] redis- accuracy k-neighbor of a given RNA secondary structure covered Kiryu’s method; in addition, Lu et al. computed S ; i.e., that secondary structure which has maximum suboptimal MEA structures by implementing an analo- expected accuracy over all structures that differ from S gue of Zuker’s method [29]. by exactly k base pairs. By analyzing the family of purine Our motivation in developing both RNAbor-Sample riboswitches, obtained by retrieving full riboswitch and RNAborMEA, was to simplify and improve our pre- sequences (aptamer and expression platform) from cor- vious software, RNAbor, in detecting conformational responding EMBL genomic data, by extending the apta- switches. Since RNAbor computes the minimum free mers from the seed alignment of Rfam family RF00167 energy structure, MFE(k), over all structures having base [31], we show that our software RNAborMEA produces pair distance k from an initially given structure S ,a strikingly different results from other software that pro- complex portion of the code in RNAbor concerns the duce suboptimal structures (RNAbor, RNAbor-Sam- retrieval of free energy parameters from the Turner ple, RNAlocopt, RNAshapes, UNAFold). model [22,23]. The idea of RNAborMEA was to compute Since the detection of computational switches remains the base pairing probabilities p(i, j)-seeequation(1) - an open problem, despite the success of some tools by McCaskill’s algorithm using RNAfold, then to com- such as RNAshapes and RNAbor, we feel the addition pute the maximum expected accuracy structure, MEA of the tool RNAborMEA could prove useful, since it (k), which needs no retrieval of energy parameters, and appears to be orthogonal to all other methods of gener- which we hoped would be very similar to the MFE(k) ating suboptimal secondary structures. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 4 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 RNAbor-Sample and RNAborMEA.These tables and Results and discussion figures are only briefly described, and we refer the In this paper, we describe the following new results, dis- reader to the captions of the figures and tables, which cussed in the ‘Methods’ section in greater detail with explain the results in greater detail. attendant definitions of unexplained concepts. Figure 1 illustrates the presence of two peaks, corre- sponding to the Boltzmann probability of each of the 1. WedescribeaPythonscript RNAbor-Sample metastable structures for a 27 nt bistable switch pre- that approximates the output p = of RNAbor by ˆ viously considered by Hofacker et al. Figure 2 displays frequency counts p from sampled structures, for all the Boltzmann probabilities p 0 ≤ k ≤ 2n,using Sfold[30], or RNAsubopt -p from RNAbor,Boltz- mann probabilities estimates ˆ from RNAbor-Sample [32]. p for the SAM riboswitch aptamer with GenBank acces- 2. We prove that for any desired accuracy 0 < ε and sion code AP004597.1/11894-11904. Clearly, probability probability 0 < a <1, if at least estimates ˆ are close to actual values p .The figure k k −1 additionally shows probabilities r from our software 2K (2) Z (LO) N(ε, p, K)= k 2 RNAlocopt[34], computed by r = ,where Z(LO) 4ε Z(LO) is the sum of Boltzmann factors of all locally optimal secondary structures, and Z (LO)isthe sumofall locally optimal k-neighbors of S . A secondary structure structures are sampled, then S is said to be locally optimal, if its energy does not decrease by the addition or removal of a single (valid) P(|p − p | <ε) > 1 − α (3) k k base pair; i.e., E(S ∪ {(x, y)}) ≥ E(S), and E(S -{(x, y)}) ≥ E(S). Figure 3 displays the experimentally determined GENE ON and GENE OFF structures of an XPT gua- nine riboswitch from B. subtilis, taken from [35]. Figure for all 0 ≤ k<K; i.e., RNAbor-Sample furnishes 4 shows the outputs of RNAborMEA, RNAbor,and estimates p of p , for all 0 ≤ k<K, which with con- k k RNAshapes, which are most similar to the GENE ON fidence 1 - a are within ε of the actual values p . structure from the previous Figure 3. Figures 5 and 6 Here, F(z) is the cumulative distribution function determine the structural simlarity, as measured by the (CDF) for the standard normal distribution. program NestedAlign[36], between that structure 3. We develop an algorithm, RNAborMEA,running 3 2 2 output by RNAborMEA (as well as structures output by in time O(n · K )and space O(n · K), which com- RNAbor, RNAbor-Sample, RNAlocopt, RNAshapes, putes simultaneously for all 0 ≤ k ≤ K, the maximum and UNAFold), which are most similar to the XPT pur- expected accuracy k-neighbors of a given RNA sec- ine riboswitch, displayed in Figure 3. Figure 5 deter- ondary structure S ; i.e., for each 0 ≤ k ≤ K, RNA- mines the structural similarity to the GENE ON borMEA computes that structure S which has structure (left panel of Figure 3), while Figure 6 deter- maximum expected accuracy over all structures that mines the structural similarity to the GENE OFF struc- differ from S by exactly k base pairs. The algorithm ture (right panel of Figure 3). None of the structural RNAborMEA additionally computes, for each 0 ≤ k ≤ neighbors, or sampled structures, are identical to the K, the pseudo partition function GENE ON or GENE OFF structures; however, there are some candidates that bear some resemblance to those Z = exp(MEA(S)/RT). structures. At this point, we can say that RNAbor-Sam- {S:d (S,S )=k} BP 0 ple and RNAborMEA are methods that generate subop- timal structures, some of which may be similar to the metastable structures of a conformational switch; how- Moreover, RNAborMEA allows the user to stipulate ever, much additional work is necessary before a robust (partial) hard constraints, that stipulate whether par- method can be developed to detect conformational ticular nucleotides are unpaired, or base-pair with switches. certain other nucleotides. The implementation of Figure 7 shows that the MEA(k) structural neighbors, hard constraints follows ideas from Mathews [33], as computed by RNAborMEA, are very different than the albeit suitably modified to simultaneously consider MFE(k) structural neighbors, as computed by RNAbor. all k-neighbors, for 0 ≤ k ≤ K. At present, such computational experiments show RNA- borMEA computes suboptimal structures, which seem We now describe the 13 figures and 4 tables, corre- to share (chimeric) similarities between parts of low energy structures, but which themselves do not have sponding to computational experiments performed with Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 5 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 2 Boltzmann density plot for RNAbor, along with approximating relative frequency plots for RNAborMEAandRNAlocoptfor the 101 nt RNA sequence UACUUAUCAA GAGAGGUGGA GGGACUGGCC CGCUGAAACC UCAGCAACAG AACGCAUCUG UCUGUGCUAA AUCCUGCAAG CAAUAGCUUG AAAGAUAAGU U for the SAM riboswitch aptamter with GenBank accession code AP004597.1/118941- 119041. The program RNAbor computes the Boltzmann probability , where Z = exp(−E(S)/RT), where S is p = k {S:d (S,S )=k} 0 k BP 0 the initial structure (taken as the minimum free energy here). The script RNAbor-Sample calls Sfold on 1000 structures, in order to compute a relative frequence f ≈ p of all k-neighbors of S . Finally, we compute relative frequency of RNAlocopt[34], a program that samples only k k 0 locally optimal secondary structures, having the property that one cannot obtain a lower energy structure by adding or removing a single base pair. very low energies. Such suboptimal structures appear to algorithm Sfold[30]. Figure 13 displays the pseudo- be ‘orthogonal’ to those output by all other methods, Boltzmann probabilities p˜ = for two small RNA such as Sfold, RNAbor, RNAbor-Sample, RNAlo- sequences. While temperature T has a natural signifi- copt, RNAshapes, UNAFold). Figure 8 displays the cance, when computing Boltzmann probabilities p = , output of RNAborMEA, given the sequence of a TPP there is no natural meaning of temperature T,when riboswitch with EMBL accession code AF269819/1811- computing pseudo Boltzmann factors exp(MEA(S)/RT), 1669. In this instance, RNAborMEA found two low and indeed very different curves can be obtained with energy structures having large base pair distance from different temperatures. each other. (Other computational experiments did not We now briefly describe Tables 1, 2, 3, 4. Table 1 pro- yield such a good example.) Figure 9 displays the free vides some sample sizes N, computed by the formula energy and maximum expected accuracy scores, for from equation (2), for an ε approximation of Boltzmann each of the k-neighbors of the given TPP riboswitch probabilities p ,0 ≤ k<K,with1- a confidence level. sequence, just described in Figure 8. Figures 10 and 11 Tables 2 and 3 provide the numerical values for the ear- present the pseudocode for the RNAborMEA algorithm, lier described Figures 5 and 6, where the NestedA- which given an RNA sequence a , .. .,a and initial lign structural similarity is computed for the most 1 n structure S , computes the MEA(k) structure and pseudo similar k-neighbor, determined by RNAborMEA, RNA- partition function , for each 0 ≤ k ≤ K in time O(n · bor-Sample and RNAlocopt.Table 4presentsthe 2 2 K ) and space O(n · K). Figure 12 presents pseudocode number of times that each of the methods RNAborMEA, for the O(n ) algorithm to sample structures from the RNAbor, RNAbor-Sample, RNAlocopt, RNA- ensemble of structures having high MEA scores - a shapes, UNAFold output the most similar structure maximum expected accuracy analogue of the sampling to the GENE ON resp. GENE OFF structure for the Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 6 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 3 GENE ON (left) and GENE OFF (right) secondary structures for the 148 nt. XPT guanine riboswitch from B. subtilis with sequence CACUCAUAUA AUCGCGUGGA UAUGGCACGC AAGUUUCUAC CGGGCACCGU AAAUGUCCGA CUAUGGGUGA GCAAUGGAAC CGCACGUGUA CGGUUUUUUG UGAUAUCAGC AUUGCUUGCU CUUUAUUUGA GCGGGCAAUG CUUUUUUU. Sequence and secondary structure taken from [35]. The free energy of GENE ON resp. GENE OFF secondary structrure is -16.46 kcal/mol resp. -22.6 kcal/mol. Free energies were determined using RNAeval and figures produced using RNAplot, both from the Vienna RNA Package [40]. XPT purine riboswitch described in Figure 3. This com- low energy structures, but rather for maximum expected accuracy structures. putational experiment was performed for all RNA sequences in the seed alignment of the Rfam purine The figures and tables show, in summary, that RNA- riboswitch family RF00167 [31]. This table shows that borMEA provides useful suboptimal structures, which RNAborMEA and RNAbor both outperform any other may be closer to metastable structures of a conforma- method in determining structures similar to the GENE tional switch than more traditional methods, which rely OFF XPT structure; however, RNAborMEA uniquely on searching for low energy structures. outperforms all methods, including RNAbor, in deter- mining structures similar to the GENE ON XPT struc- Conclusions ture. One of the reasons for this excellent result is that We have applied the notion of maximum expected accu- unlike other methods, RNAborMEA does not look for racy within the context of structural neighbors of a Figure 4 Given riboswitch sequence X83878/168-267 and initial structure S , the minimum free energy structure, a structure output by RNAborMEAis most structurally similar to the XPT GENE ONstructure, as measured by NestedAlign[36]. The NestedAlign score for RNAborMEA is 87.5, while optimal score for RNAbor is 60.0, and for RNAshapes is 64.0. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 7 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 5 For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was included. Then the following six programs were run: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. Each program outputs a number of near-optimal secondary structures, each according to different criteria. Taking RNAbor and RNAborMEA as examples, the programs RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S . Subsequently, we applied the program NestedAlign described in [36] to compute the structural similarity between the experimentally determined GENE ON structure for XPT guanine riboswitch of B. subtilis; i.e. the left panel of Figure 3. (Similar structures have positive scores; dissimilar structures have negative scores.) For each RNA in the seed alignment of RF00167, we determined the value k , such the MEA(k ) structure for that RNA has the 1 1 greatest structural similarity with the XPT GENE ON structure, as determined by NestedAlign. (See the left panel of Figure 3 for the experimentally determined GENE ON structure of XPT.) As earlier explained, we performed similar computations for the programs RNAshapes [39] and UNAFold [41], the programs RNAborMEA and RNAbor-Sample, described in this paper, and programs RNAbor[9] and RNAlocopt [34], developed by our lab. In 21 out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE OFF structure, as measured by NestedAlign. given RNA sequence a , .. ., a and structure S .Our Here, the expected accuracy score s is defined by 1 n 0 software RNAborMEA not only computes the structures σ (S)=2 · p + q MEA(k) having maximum expected accuracy over all i,j i (i,j)∈S iunpaired structures S, whose base pair distance d (S , S) is equal BP 0 to k.Inaddition, RNAborMEA allows the user to enter where first sum is taken over all base pairs (i, j) structural constraints, which specify partial secondary belonging to S, and the second sum is taken over all structures required of all MEA(k) structures, if so unpaired positions in S,and where p [resp. q ]isthe i,j i desired. Additionally, RNAborMEA computes an analo- probability that i, j are paired [resp. i is unpaired] in the gue of the temperature-dependent partition function, ensemble of low energy structures, as computed by defined by McCaskill’s algorithm [27]. Finally, RNAborMEA allows the user to sample structures from the maximum Z (T)= exp(σ (S))/RT expected accuracy ensemble, in a fashion analogous to {S:d (S ,S)=k} BP 0 Ding-Lawrence sampling from the low energy Boltz- and mann ensemble, as in Sfold[30]. Our preliminary investigations have not indicated a Z(T)= Z = exp(σ (S))/RT. clear application of the partition function analogue, k S though it maybeconstrued to providearepresentation Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 8 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 6 For each RNA sequence in the seed alignment from Rfam family RF00167 of purine riboswitch aptamers, we retrieved downstream flanking residues from the appropriate EMBL files, in order to ensure likelihood that the expression platform was included. Then the following six programs were run: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. Each program outputs a number of near-optimal secondary structures, each according to different criteria. Taking RNAbor and RNAborMEA as examples, the programs RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S . Subsequently, we applied the program NestedAlign described in [36] to compute the structural similarity between the experimentally determined GENE OFF structure for XPT guanine riboswitch of B. subtilis; i.e. the right panel of Figure 3. (Similar structures have positive scores; dissimilar structures have negative scores.) For each RNA of the seed alignment of RF00167, we determined the value k , such the MEA(k ) structure for that RNA has the 1 1 greatest structural similarity with the XPT GENE OFF structure, as determined by NestedAlign. (See the right panel of Figure 3 for the experimentally determined GENE OFF structure of XPT.) As earlier explained, we performed similar computations for the programs RNAshapes [39] and UNAFold [41], the programs RNAborMEA and RNAbor-Sample, described in this paper, and RNAbor[9] and RNAlocopt[34]. In 22 out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE OFF structure, as measured by NestedAlign. of the temperature-dependent mixing of various struc- sourceforge.net/projects/rnabormea/ and http://bioinfor- matics.bc.edu/clotelab/RNAborMEA/. tures having large score s. On the other hand, in com- putational experiments reported in the Results Section, it appears that RNAborMEA produces near-optimal Methods structures that are closer to the biologically functional Preliminaries structures, in the case of conformational switches that Recall the definition of RNA secondary structure. are difficult to predict by any method. Definition 1 A secondary structure S on RNA Indeed, in 18 [resp. 11] out of 34 instances, RNAbor- sequence a , .. ., a is defined to be a set of ordered pairs 1 n MEA produced the secondary structure most structurally (i, j), such that 1 ≤ i<j ≤ n and the following are satis- similar to the experimentally determined XPT GENE fied. ON [resp. GENE OFF] structure, as measured by Nes- tedAlign[36]. See Table 4. Since there appears to be 1. Watson-Crick or GU wobble pairs: If (i, j) belongs little to no correlation between the structures MFE(k) to S, then pair (a , a ) must be one of the following i j output by RNAbor[20] and the structures MEA(k)out- canonical base pairs: (A, U), (U, A), (G, C), (C, G), put by our current program RNAborMEA, it appears (G, U), (U, G). that RNAborMEA yields a signal that is orthogonal and 2. Threshold requirement: If (i, j) belongs to S, then j complementary to that provided by state-of-the-art ther- - i> θ,where θ, generally taken to be equal to 3,is modynamics software, such as UNAFold, RNAfold, the minimum number of unpaired bases in a hairpin RNAstructure, Sfold, RNAshapes, RNAbor,etc. loop; i.e., there must be at least θ unpaired bases in For thesereasons,wefeelthat RNAborMEA has a cer- a hairpin loop. tain value, along with the programs UNAFold, RNA- 3. Nonexistence of pseudoknots: If (i, j) and (k, ℓ) fold, RNAstructure, Sfold, RNAshapes, belong to S, then it is not the case that i < k < j <ℓ. RNAbor, etc. when producing suboptimal structures. 4. No base triples: If (i, j) and (i, k) belong to S, then RNAborMEA is written in C and available at http:// j = k; if (i, j) and (k, j) belong to S, then i = k. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 9 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 7 Figure depicting the increasing divergence between RNAborand RNAborMEA. For each RNA sequence in the seed alignment from Rfam family RF00066 of U7 small nuclear RNAs, both RNAbor and RNAborMEA were run, in order to compute the MFE(k) structure and the MEA(k) structure, which have minimum free energy resp. maximum expected accuracy among all k-neighbors of the intial minimum free energy structures S . We computed the base pair distance between the MFE(k) structure and the MEA(k) structure over all sequences in the seed alignment of RF00066. The figure displays the average ± one standard deviation of base pair distance. The preceding definition provides for an inductive as follows. If 0 ≤ d ≤ θ,then S = {∅};i.e., theonly i,i+d construction of the set of all secondary structures for a secondary structure for a , .. ., a is the empty struc- i i+d given RNA sequence a , .. ., a . For all values of d =0, .. ture containing no base pairs (due to the requirement 1 n ., n and all values of i =1, .. ., n - d, the collection that all hairpins contain at least θ unpaired bases). If d S of all secondary structures for a , .. ., a is defined > θ and has been defined by recursion for all i ≤ j< i,i+d i,j i i+d Figure 8 Sample outputs from RNAborMEAon a 143 nt TPP-riboswitch, AF269819/1811-1669 with sequence CUACUAGGGG AGCCAAAAGG CUGAGAUGAA UGUAUUCAGA CCCUUAUAAC CUGAUUUGGU UAAUACCAAC GUAGGAAAGU AGUUAUUAAC UAUUCGUCAU UGAGAUGUCU UGGUCUAACU ACUUUCUUCG CUGGGAAGUA GUU. We took the TPP riboswitch aptamer from the Rfam database [19], then extracted right-flanking nucleotides from the corresponding EMBL file, in order to include the expression platform. Displayed from left to right are the structures MEA(0) and MEA(61) (the structure MEA(52) is similar to that of MEA(61) and corresponds to a free energy local minimum in the left figure.) The structure MEA(61) had the highest MEA score over all structural neighbors, including the original structure S = MEA(0), and had free energy, -46.0 kcal/mol, that was equal to that of the initial structure S = MEA(0), which is the minimum free energy 0 0 structure for the given sequence. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 10 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 9 (Left) Free energy for all MEA(k) structural neighbors, 0 ≤ k ≤ 99, of the TPP-riboswitch, AF269819/1811-1669, described in the previous figure. Clearly, MEA(0) and MEA(61) have the least energy, - 46.0 kcal/mol, and MEA(61) has the largest MEA score, 134.555, of all secondary structures for the given RNA sequence. It is more common that the free energy of the MEA(k) structure is monotonically increasing as a function of k. (Right) MEA score for all MEA(k) structural neighbors, 0 ≤ k ≤ 99, of the TPP-riboswitch, AF269819/1811-1669, described in the previous figure. Clearly, MEA(61) has the largest MEA score, 134.555, of all secondary structures for the given RNA sequence. Figure 10 Initial portion of pseudocode for RNAborMEAalgorithm, which continues in Figure 11. Given RNA sequence s = s , .. .,s of 1 n length n, initial secondary structure S of s, RNAborMEA computes for all values of 0 ≤ k ≤ n that structure S with base pair distance k from S , 0 0 M(i, j, k)= 2αp + βq i,j i which maximizes the value . The pseudocode actually computes only values M(i, j, k) for all i, j, k; (i,j)∈S iunpaired in s 5 3 the MEA structures are obtained by backtracing. This algorithm clearly runs in O(n ) time with O(n ) space. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 11 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 11 Pseudocode for RNAborMEAalgorithm. Given RNA sequence s = s , .. ., s of length n, initial secondary structure S of s, 1 n 0 RNAborMEA computes for all values of 0 ≤ k ≤ n that structure S with base pair distance k from S , which maximizes the value M(i, j, k)= 2αp + βq i,j i . The pseudocode actually computes only values M(i, j, k) for all i, j, k; the MEA structures are (i,j)∈S iunpaired in s 3 3 obtained by backtracing. This algorithm clearly runs in O(n ) time with O(n ) space. Figure 12 Pseudocode for the O(n ) traceback computed by our RNAborMEAalgorithm. Note that run time could be reduced to O(n ln n) by applying the boustrephedonic method described in [42]. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 12 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Figure 13 (Left) Pseudo-Boltzmann and uniform probabilities of structural neighbors MEA(k) for the 49 nt SECIS sequence fdhA, with nucleotide sequence CGCCACCCUG CGAACCCAAU AAUAAAAUAU ACAAGGGAGC AAGGUGGCG and where S is (((((((.(((... (((.................))).))).))))))). Here, the (formal) parameter RT taken to be 49 (length of sequence), in order to uniformize MEA scores to range (k) (k) between 0 and 1. The pseudo-Boltzmann probability is defined by , where (i) Z = Σexp(MEA(S)/RT), the sum being taken over all P (k)= Z (k) (k) (k) S such that d (S , S)= k, and (ii) Z = Σ Z . The uniform probability is defined by , where N is the number of k-neighbors of S BP 0 k P (k)= 0 and N is the total number of secondary structures. (Right) Pseudo-Boltzmann probabilities for MEA(k) structural neighbors of the 27 nt Vienna bistable switch with nucleotide sequence CUUAUGAGGG UACUCAUAAG AGUAUCC and initial (minimum free energy) structure.......((((((((....)))))))). The left curve is when RT = 0.6, the approximate value obtained by multiplying the universal gas constant 0.00198 kcal/mol times 310 Kelvin. In contrast, the right curve is when RT = 27 (length of sequence). Though not shown in this graph, the pseudo-Boltzmann distribution is identical with the uniform distribution, when RT = n, where n is sequence length. i + d, then then for any secondary structure S for a , .. .,a and i r-1 any secondary structure T for a , ..., a , the struc- r+1 j-1 � Any secondary structure of a , .. ., a is a sec- ture S ∪ T ∪ {(r, j)} is a secondary structure for a , .. i i+d-1 i ondary structure for a , .. ., a ,inwhich a is ., a . i i+d i+d i+d unpaired. � If a , a can form a Watson-Crick or wobble base Given two secondary structures S, T,wedefinethe i j pair, then for any secondary structure S for a , .. ., base pair distance between S, T, denoted by d (S, T), i+1 BP a , the structure S ∪ {(i, j)} is a secondary struc- to be the cardinality of the symmetric difference of S, T; i+d-1 ture for a , ..., a . i.e., d (S, T)= |(S - T) ∪ (T - S)|. i i+d BP � For any intermediate value i +1 ≤ r ≤ j - θ -1,if RNAbor-Sample a , a can form a Watson-Crick or wobble base pair, In this section, we describe how sampling from the r j Boltzmann ensemble, using Sfold[30], leads to a sim- ple and fast approximation of RNAbor computations, Table 1 Number of samples needed for high-confidence provided that the input consists of an RNA sequence approximation of Boltzmann probabilities and the minimum free energy (MFE) secondary struc- PK ε zN ture for that sequence. The work of this section is 0.05 1 0.01 1.45 9506 inspired by sampling approximations of the number of 0.05 100 0.01 3.48 30276 structural motifs, such as hairpins, multiloops, etc. of 0.05 1000000 0.01 5.45 74256 Ding and Lawrence [30], as well as the sampling 0.001 100 0.01 3.89 37830 approximation used in RNAshapes[8] for large 0.000001 100 0.01 5.73 82082 sequences. A novelty of our work is that we provide a 0.05 1 0.001 1.45 950600 rigorous justification for the accuracy of the approxima- 0.05 100 0.001 3.48 3027600 tion, depending on sample size. 2 Let RNAbor-Sample denote the protocol, where we −1 The number of samples sufficient to 2K apply Sfold[30] to sample N secondary structures S of N = N(ε, p, K)= 4ε an input RNA sequence a , .. .,a ,thensubsequently guarantee that |f - p |< ε with confidence 1 - p, for all 0 ≤ k<K, in the 1 n k k compute the relative frequencies f for 0 ≤ k<K,where application RNAbor-Sample.Here , the Boltzmann probability, as k p = f = is defined to be the number N of sampled computed exactly by RNAbor, for a k-neighbor of S , and f is the relative k k 0 k frequency of k-neighbors among N sampled structures. structures S, whose base pair distance with S is k, 0 Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 13 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Table 2 Comparison of NestedAlign similarity scores for the GENE ON structure of the XPT guanine riboswitch index EMBL RNAbor RNAborMEA RNAbor-Sample RNAlocopt RNAshapes UNAFold 0 AL591981/205922-205823 -9.0 5.0 -9.0 -8.5 -9.0 -9.0 1 CP000764/271074-271175 -43.5 5.0 -37.5 -44.5 -23.0 -53.0 2 CP000764/308099-308200 -27.0 -18.0 -24.5 -31.5 -25.5 -22.0 3 BA000028/760473-760574 -25.5 -0.5 -36.0 -38.5 -24.5 -31.0 4 CP000557/252200-252301 -9.5 8.5 -9.5 8.5 -10.0 -12.0 5 X83878/168-267 60.0 87.5 57.0 66.0 64.0 59.0 6 BA000004/1593074-1592973 35.0 16.5 -13.5 -21.5 -19.0 -13.5 7 AAOX01000023/19446-19345 -15.0 -2.0 -13.0 -18.5 -13.5 -15.5 8 CP000416/1798040-1798138 5.5 1.5 1.5 12.0 4.5 -4.5 9 CP000721/398929-399026 26.0 24.5 16.5 -20.0 21.5 -32.0 10 BA000028/1103943-1104044 1.0 1.5 2.0 -0.5 0.5 0.5 11 ABDQ01000002/251055-251152 -16.0 -2.5 -16.5 -21.5 -17.5 -22.5 12 AAXV01000026/31334-31233 11.5 6.0 -1.5 -8.5 22.0 -3.0 13 AE016877/298774-298875 -18.5 14.0 -17.5 -34.0 -12.0 -26.5 14 BA000004/676475-676576 -28.5 -31.0 -28.0 -69.0 -21.0 -29.5 15 AE017333/692981-693082 -1.5 2.5 -11.5 -9.5 -5.5 -53.0 16 AM180355/256217-256318 -17.0 -45.0 -45.5 -49.0 -48.0 -49.0 17 AM406671/1321062-1320965 -25.5 -15.0 -22.0 -28.5 -23.5 -23.5 18 CP000612/2598111-2598012 -42.0 -39.5 -42.0 -47.5 -39.0 -38.5 19 CP000002/697032-697134 -8.0 -11.0 -10.5 -10.0 -4.5 -7.5 20 CP000002/2295936-2295837 23.5 47.0 31.5 21.0 30.0 22.5 21 AL596170/223345-223246 -0.5 7.0 0.5 -8.5 -10.0 -10.0 22 ABDQ01000005/131908-131807 -33.0 -15.5 -31.5 -31.5 -19.0 -50.0 23 AAOX01000052/9069-8968 -13.5 1.5 -14.0 -21.0 -15.5 -14.5 24 AE017333/4024324-4024425 -29.5 -26.5 -33.5 -24.0 -23.5 -36.0 25 AP006627/1554717-1554818 -31.5 -1.5 -37.0 -44.5 -28.5 -43.5 26 CP000024/1182948-1183043 -0.5 -18.5 -9.0 4.0 2.0 -19.0 27 BA000028/786767-786867 -18.0 -41.5 -48.0 -46.5 -49.0 -44.5 28 ABDP01000002/29688-29587 -34.5 -42.5 -34.5 -37.0 -35.0 -50.0 29 BA000043/272473-272574 -9.5 4.0 -9.5 -10.0 -3.0 -12.5 30 CP000724/944285-944386 -30.5 -21.5 -30.5 -28.5 -26.5 -31.5 31 CP000764/1409725-1409826 14.0 -3.0 -18.0 -24.0 -11.5 -20.0 32 AAEK01000017/86437-86538 -44.5 -44.0 -41.5 -52.0 -35.0 -49.0 33 CP000764/357645-357544 11.0 -13.5 -33.0 -26.0 -18.5 -36.0 NestedAlign similarity scores between the GENE ON structure of the XPT guanine riboswitch of B. subtilis, experimentally determined using in-line probing (see [35]), and the structurally most similar secondary structure among near-optimal structures generated by each of the following six methods: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. These values are plotted in Figure 5, where more details on the computational experiment are given. divided by N.Since Sfold appears to be only available As usual, let Z denote the partition function, repre- as a web server, where the user can not stipulate the senting the sum of Boltzmann factors of all secondary number of sampled structures, we instead use the structures of a , .. ., a ; i.e., 1 n Vienna RNA Package implementation of Sfold,given Z = exp(−E(S)/RT) in RNAsubopt -p[32]. Let a , .. .,a be an arbitrary RNA sequence having 1 n MFE structure of S . Following [9], let Z denote the 0 k Z and let p = denote the Boltzmann probability of all sum of Boltzmann factors of all k-neighbors of S ; i.e., k-neighbors. Given a desired approximation accuracy of ε, a prob- Z = exp(−E(S)/RT). ability p, and an upper bound K, we wish to compute a d (S ,S)=k BP 0 value N = N(ε, p, K), such that whenever we sample at Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 14 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 Table 3 Comparison of NestedAlign similarity scores for the GENE OFF structure of the XPT guanine riboswitch Index EMBL RNAbor RNAborMEA RNAbor-Sample RNAlocopt RNAshapes UNAFold 0 AL591981/205922-205823 27.5 28.5 28.5 25.5 25.5 25.5 1 CP000764/271074-271175 13.0 12.5 11.0 6.5 12.0 5.5 2 CP000764/308099-308200 24.0 26.0 26.5 23.0 24.5 26.5 3 BA000028/760473-760574 18.5 22.0 13.0 20.5 23.5 23.0 4 CP000557/252200-252301 7.0 8.0 7.0 10.0 6.5 4.5 5 X83878/168-267 143.0 143.5 143.0 141.0 143.0 141.0 6 BA000004/1593074-1592973 41.0 39.0 41.0 36.0 38.0 41.0 7 AAOX01000023/19446-19345 47.5 45.5 46.0 42.5 34.0 43.5 8 CP000416/1798040-1798138 17.5 12.5 12.5 13.0 11.5 12.5 9 CP000721/398929-399026 36.5 20.5 23.0 -38.5 34.5 -52.5 10 BA000028/1103943-1104044 32.0 29.5 32.0 27.5 30.5 30.0 11 ABDQ01000002/251055-251152 27.0 26.0 26.5 24.0 25.5 7.5 12 AAXV01000026/31334-31233 37.5 38.5 38.0 32.5 35.0 36.0 13 AE016877/298774-298875 24.0 25.5 23.0 19.0 23.0 22.5 14 BA000004/676475-676576 9.0 4.5 6.5 -35.5 5.0 9.0 15 AE017333/692981-693082 -30.0 -9.5 -23.5 -25.5 -17.0 -70.5 16 AM180355/256217-256318 -23.5 -24.0 -25.0 -27.0 -23.5 -27.0 17 AM406671/1321062-1320965 -0.5 3.5 1.0 -10.0 1.0 0.5 18 CP000612/2598111-2598012 -12.0 -9.0 -8.0 -8.5 -9.5 -9.0 19 CP000002/697032-697134 16.5 7.0 12.0 14.0 16.5 7.5 20 CP000002/2295936-2295837 75.0 73.0 75.5 71.0 72.0 69.5 21 AL596170/223345-223246 30.5 31.5 30.5 28.5 29.5 29.5 22 ABDQ01000005/131908-131807 12.5 3.0 13.0 10.5 13.5 4.5 23 AAOX01000052/9069-8968 12.5 14.5 13.5 11.0 12.0 12.0 24 AE017333/4024324-4024425 -3.5 2.5 3.5 6.0 -2.5 -1.5 25 AP006627/1554717-1554818 22.5 18.0 22.5 14.5 25.5 12.5 26 CP000024/1182948-1183043 6.0 7.0 6.5 6.0 5.0 6.0 27 BA000028/786767-786867 -23.5 -19.5 -23.0 -24.5 -21.0 -24.0 28 ABDP01000002/29688-29587 3.0 1.0 2.5 1.0 4.5 0.5 29 BA000043/272473-272574 17.5 12.5 12.5 13.5 12.5 11.5 30 CP000724/944285-944386 10.0 11.0 10.5 7.0 12.0 9.5 31 CP000764/1409725-1409826 32.5 36.0 32.0 26.5 35.0 30.5 32 AAEK01000017/86437-86538 11.5 11.5 13.0 8.0 13.0 11.0 33 CP000764/357645-357544 23.5 22.0 24.5 24.0 22.0 22.5 NestedAlign similarity scores between the GENE OFF structure of the XPT guanine riboswitch of B. subtilis, experimentally determined using in-line probing (see [35]), and the structurally most similar secondary structure among near-optimal structures generated by each of the following six methods: RNAbor, RNAborMEA, RNAbor-Sample, RNAlocopt, RNAshapes, UNAFold. These values are plotted in Figure 6, where more details on the computational experiment are given. Table 4 Number of times that the most similar structure was produced Method greatest similarity to gene on greatest similarity to gene off RNAborMEA 18 11 RNAbor 711 RNAbor-Sample 28 RNAlocopt 32 RNAshapes 58 UNAFold 13 Number of times that the most similar structure to the GENE ON resp. GENE OFF structure of the B. subtilis XPT riboswitch was produced by each of the six methods investigated. Although the test was made with 34 sequences from the seed alignment of Rfam family RF00167 [31], the sums of each column may exceed 34; this is because If two or more methods produced the maximum score, then each was counted. Structural similarity was measured using the NestedAlign structural alignment algorithm [36]. While the GENE OFF structure involves a terminator loop, that is often correctly found by thermodynamics- based software, the GENE ON secondary structure, having higher free energy (hence less stable thermodynamically) is less likely to be found using thermodynamics-based approaches. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 15 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 least N secondary structures from the Boltzmann p (1 − p ) k k ε = z · . ensemble using Sfold,the relative frequency f of k- α/2 neighbors sampled is within ε of the probability p of k- neighbors, for all 0 ≤ k< K, with confidence level of (1 It follows that - p). Formally, this means that for each 0 ≤ k<K, 2 2 z z α/2 α/2 P(|f − p | <ε) ≥ 1 − p. N = N(α, ε)= ≥ · p (1 − p ) (4) k k k k 2 2 4ε ε Consider the value k as bin number.For afixed bin k, provides a sufficient lower bound on number of sam- let us denote the exact value of by p . If we sample N ples necessary to guarantee margin of error ε.Let α = structures, each falling in the kth bin with probability p , and define then the number of structures in the kth bin is given by the random variable X having binomial distribution p −1 p 2K 2K (6) with mean N · p and variance N · p (1 - p ). It follows N = N(ε, p, K)= = . k k k 4ε 4ε that the proportion of structures in the kth bin has mean µ = p and standard deviation We have just shown that for N ≥ N(ε, p, K), p (1−p ) −1 P k k 1 √ . To determine minimum sample , hence σ = < P(|z| > | |) < 2K K 2 N size sufficient to ensure a certain approximation accu- ⎛ ⎞ racy with certain confidence interval, we adapt a stan- |f − p | p p ⎜ k k ⎟ −1 dard argument from statistics [37] (see equation (24.35) P > | | < . ⎝ ⎠ p (1−p ) 2K K k k on p. 529), by approximating the binomial distribution by the standard normal distribution using Z-scores. Before starting, we mention that it will suffice for our The following is now a key step. If we have K bins, intended application of RNAbor-Sample to have a pre- and we desire to have a small probability p that we are cise approximation of those p which exceed some mod- off by more than ε in our estimate of the probability of est lower bound, such as δ =0.01or δ = 0.0001. Thus any bin (in our intended application, the kth bin, for 0 ≤ we intend to prove that for all 0 ≤ k<K, if p ≥ δ,then k<K, is the collection of k-neighbors of S , i.e., those Equation (4) holds. structures S, whose base pair distance with S is k)then Temporarily, we fix k.Let X be a Bernouilli random it sufficesthatwehaveaprobability that we are off variable with success probability p , corresponding to by more than ε in any single bin. Indeed, let Y denote the indicator random variable that returns 1 if a single the indicator random variable, with value 1 provided sampled secondary structure is a k-neighbor of S .Pro- that |f - p |> ε,where f is the relative frequency of k k k vided that we sample a number N of structures, which sampling a k-neighbor of S , after sampling N secondary satisfies N · p ≥ 30, then the standard normal distribu- structures, where by Equation (5), N is chosen so that tion can be used to approximate the left and right tail of the distribution of Z-scores of sampled proportions N(|f − p |) p k k P(|z| >ε)= P >ε < i=1 , defined by f = K p (1 − p ) k k k x − μ f − p N(f − p ) k k k k then z = = = . (5) σ p (1−p ) k k p (1 − p ) k k N P(Y ∨ ··· ∨ Y ) < K · p/K = p. 0 K−1 1 2 Let (z)= exp(−x /2)dx denote the cumu- Putting everything together, we have shown that for −∞ 2π lative distribution function (CDF) for the standard nor- given ε, p, K, we can define by defining N mal distribution. Given desired confidence interval of C −1 =1- a, recall that the critical value z is defined by ( ) a/2 2k (7) N = N(ε, p, K)= 4ε −1 −1 z = (1 − α/2) = | (α/2)|. α/2 we have If ε is the margin of error in the left tail (-∞,-z ) and a/2 ⎛ ⎡ ⎤ ⎞ right tail (z ,+∞) for the normal approximation of the a/2 |f − p | p ⎜ ⎢ k k ⎥ ⎟ −1 binomial distribution, then by a well-known argument P [∃0 ≤ k < K] > < p ⎝ ⎣ ⎦ ⎠ p (1−p ) 2K k k (e.g. equation (24.35) on p. 529 of [37]), we have N Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 16 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 We have completed a more rigorous argument using previous section. At the same time, the value M(j, i, k), Chernoff bounds, but prefer the exposition given here stored in the lower triangular portion of matrix M,will for simplicity. Note that the same argument, given ver- consist of a triple r, k , k of numbers, such that the fol- 0 1 batim, can be used for any binning procedure. In parti- lowing approximately holds (in this section, we provide cular, this approach provides information on sufficient the motivating idea; the actual algorithm description, number of samples in order to approximate the result of which deviates slightly from the description here, is RNAshapes [8,38,39] by means of sampling. given inthe next sectionand in Figures10and 11). (i) We can make some basic conclusions from the above If r =0then M(i, j, k) is maximized by a k-neighbor S analysis. The number of samples sufficient to ensure of S [i, j] for the subsequence a , .. ., a in which a is 0 i j j that |f - p |< ε for 0 ≤ k< K with confidence 1 - p is unpaired. In this case, k = k and k =0. (ii) If r = i, k k 0 1 reasonable, and only slightly increases for a higher num- then M(i, j, k) is maximized by a k-neighbor S of S [i, j] ber K of bins or to ensure greater confidence 1 - p. for the subsequence a , ...,a in which base pair (i, j) Î S. i j However, the number of samples increases greatly when In this case, k = 0 and k = k -1. (i) If i<r ≤ j - θ -1 0 1 higher precision estimates (smaller ε values) are needed, then M(i, j, k) is maximized by a k-neighbor S of S [i, j] even for one bin. for the subsequence a , .. .,a in which base pair (r, j) Î i j In the case of one bin, it is important to remember S.The left portionof S,which is S[i, r -1]willbea k that the value N is a conservative estimate, sufficient to neighbor of S[i, r - 1], while the right portion of S, ensure our conclusion. This estimate uses the worst which is S[r, j] must contain the base pair (r, j) and itself case of 50% probability of being in a bin, which maxi- be a k neighbor of S[r, j]. In summary, the values r, k , 1 0 mizes the standard deviation. For bins with small prob- k will be used in computing the traceback, where the ability, one can re-estimate what is needed. However, maximum expected accurate structure that is a k-neigh- for bins with smaller probability, higher precision is bor of S[i, j] will be constructed by one of the following: needed as well, unless all that is required is to verify (i) MEA k-neighbor of S[i, j - 1], in the event that a is that the bin has small probability. Also, clearly if a bin unpaired in [i, j]; (ii) MEA k -1-neighbor of S[i +1, j - -6 has probability of 10 , then at least on the order of one 1], in the event that a , a form abasepair; (iii) MEA i j million samples are needed, just for a reasonable prob- k -neighbor of S[i, r -1] and the MEA k -neighbor of S 0 1 ability of sampling the bin once. [r, j], where k + k = k, in the event that a , a form a 0 1 r j base pair. Algorithm description Pseudocode for the algorithm RNAborMEA is given in GivenanRNA sequence a = a , .. ., a , a secondary Figures 10 and 11. An array M of size n× n×Kmax is 1 n structure S of a, and a maximum desired value Kmax ≤ required to store the MEA scores in M(i, j, k)for all n,the RNAborMEA algorithm computes, for each 1 ≤ i subsequences [i, j] and all base pair distances 0 ≤ k ≤ <j ≤ n and each 0 ≤ k <Kmax ≤ n, the maximum score Kmax between structures S[i, j] and initially given struc- M(i, j, k) ture S [i, j]. For 1 ≤ i ≤ j ≤ n and all 0 ≤ k ≤ Kmax,the pseudocode in Figure 11 stores a value of the form (x, y, 2αp + βq z) in the lower triangular portion, M(j, i, k), of the array. i,j i (i,j)∈S iunpaired Here, x = 0 indicates that the optimal structure on [i, j], i.e., having maximum MEA score over all k-neighbors of where the first sum is taken over all base pairs (i, j) S [i, j], is obtained by not pairing j with any nucleotide belonging to S, the second sum is taken over all in [i, j]; for values x>0, hence x Î [i, j - θ - 1], the opti- unpaired positions in S,and where p [resp. q ]isthe i,j i mal k-neighbor of S [i, j] is obtained by pairing x with j. probability that i, j are paired [resp. i is unpaired] in the The values y, z correspond to the values k , k ,such 0 1 ensemble of low energy structures, and a, b >0are that: (i) if x = 0, then the optimal k-neighbor of S [i, j] weights. Our computational experiments, as in Figure 9, is obtained by first computing the optimal k -neighbor were carried out with default values of 1 for a, b.(See of S [i, j - 1], where k = k - b , then leaving j unpaired; 0 0 0 Equation 1 for the formal definition of Boltzmann base (ii) if x = i, then the optimal k-neighbor of S [i, j]is pairing probability p .) i,j obtained by first computing the optimal k -neighbor of The dynamic programming computation of M(i, j, k) S [i +1, j -1], where k = k - b , then adding the 0 1 1 is performed by recursion on increasing values of j - i enclosing base pair (i, j); (iii) if x = r Î [i +1, j - θ -1], for all values 1 ≤ i ≤ j ≤ n and 0 ≤ k ≤ Kmax. The value then the optimal k-neighbor of S [i, j] is obtained by of M(i, j, k), stored in the upper triangular portion of first computing the optimal k -neighbor of S [i, r -1]as 0 0 matrix M, will involve taking the maximum over three well as the optimal k -neighbor of S [r +1, j - 1], then 1 0 cases, which correspond to the inductive construction of adding the base pair (r, j). This last calculation must be all secondary structures on a , .. ., a , as described in the i j done over all values k , k such that k + k = k.Using 0 1 0 1 Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 17 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 3. Franch T, Gultyaev AP, Gerdes K: Programmed Cell Death by hok/sok of the values M(j, i, k)=(x, y, z), the traceback can be Plasmid R1: Processing at the hok mRNA 3H-end Triggers Structural easily computed by recursion; see Figure 12 for pseudo- Rearrangements that Allow Translation and Antisense RNA Binding. J code of traceback. Mol Biol 1997, 273:38-51. 4. Mandal M, Boese B, Barrick J, Winkler W, Breaker R: Riboswitches control Inamannersimilar to thepseudocodeofFigures 10 fundamental biochemical pathways in Bacillus subtilis and other and 11 (essentially, one replaces the operation of taking bacteria. Cell 2003, 113(5):577-586. the maximum by the a summation, and one replaces the 5. Cheah MT, Wachter A, Sudarsan N, Breaker RR: Control of alternative RNA splicing and gene expression by eukaryotic riboswitches. Nature 2007, MEA score by the pseudo-Boltzmann factor exp(MEA 447(7143):497-500. (S)/RT)), RNAborMEA also computes the pseudo-Boltz- 6. Ray PS, Jia J, Yao P, Majumder M, Hatzoglou M, Fox PL: A stress-responsive mann partition function values RNA switch regulates VEGFA expression. Nature 2009, 457(7231):915-919. 7. Voss B, Meyer C, Giegerich R: Evaluating the predictability of (k) conformational switching in RNA. Bioinformatics 2004, 20(10):1573-1582. Z = exp(MEA(S/RT)). i,j 8. Voss B, Giegerich R, Rehmsmeier M: Complete probabilistic analysis of {S∈S :d (S ,S)=k) i,j BP 0 RNA shapes. BMC Biol 2006, 4(5). 9. Freyhult E, Moulton V, Clote P: Boltzmann probability of RNA structural (k) 1,n We have graphed the Boltzmann probabilities as neighbors and riboswitch detection. Bioinformatics 2007, 23(16):2054-2062, (k) 1,n N (k) Doi: 10.1093/bioinformatics/btm314. 1,n well as the uniform probabilities ,where is the 1,n 1,n 10. Barash D: Second eigenvalue of the Laplacian matrix for predicting RNA number of k-neighbors, and N is the total number of 1,n conformational switch by mutation. Bioinformatics 2004, 20(12):1861-1869. secondary structures. When RT = n, which normalizes 11. Mandal M, Breaker RR: Adenine riboswitches and gene activation by disruption of a transcription terminator. Nat Struct Mol Biol 2004, 11:29-35. the MEA score to a maximum of 1, it appears that the 12. Serganov A, Yuan Y, Pikovskaya O, Polonskaia A, Malinina L, Phan A, Boltzmann distribution is the same as the uniform dis- Hobartner C, Micura R, Breaker R, Patel D: Structural Basis for tribution, as shown in Figure 13. Discriminative Regulation of Gene Expression by Adenine- and Guanine- Sensing mRNAs. Chem Biol 2004, 11(12):1729-1741. 13. Serganov A, Polonskaia A, Phan AT, Breaker RR, Patel DJ: Structural basis for gene regulation by a thiamine pyrophosphate-sensing riboswitch. Acknowledgements Nature 2006, 441(7097):1167-1171. Funding for the research of P. Clote and F. Lou was provided by the Digiteo 14. Abreu-Goodger C, Merino E: RibEx: A web server for locating riboswitches Foundation for the project RNAomics. Additional funding was provided to P. and other conserved bacterial regulatory elements. Nucleic Acids Res Clote by National Science Foundation grants DMS-1016618 and DMS- 2005, 33:W690-W692. 0817971. Any opinions, findings, and conclusions or recommendations 15. Bengert P, Dandekar T: Riboswitch finder - A tool for identification of expressed in this material are those of the authors and do not necessarily riboswitch RNAs. Nucleic Acids Res 2004, 32:W154-W159. reflect the views of the National Science Foundation. 16. Chang TH, Huang HD, Wu LC, Yeh CT, Liu BJ, Horng JT: Computational This article has been published as part of BMC Bioinformatics Volume 13 identification of riboswitches based on RNA conserved functional Supplement 5, 2012: Selected articles from the First IEEE International sequences and conformations. RNA 2009, 15(7). Conference on Computational Advances in Bio and medical Sciences 17. Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA (ICCABS 2011): Bioinformatics. The full contents of the supplement are alignments. Bioinformatics 2009, 25(10):1335-1337. available online at http://www.biomedcentral.com/bmcbioinformatics/ 18. Weinberg Z, Barrick JE, Yao Z, Roth A, Kim JN, Gore J, Wang JX, Lee ER, supplements/13/S5. Block KF, Sudarsan N, Neph S, Tompa M, Ruzzo WL, Breaker RR: Identification of 22 candidate structured RNAs in bacteria using the Author details 1 CMfinder comparative genomics pipeline. Nucleic Acids Res 2007, Department of Biology, Boston College, Chestnut Hill, MA 02467, USA. 2 35(14):4809-4819. Laboratoire de Recherche en Informatique (LRI), Université Paris-Sud XI, 3 19. Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, 91405 Orsay cedex, France. Laboratoire d’Informatique (LIX), Ecole 4 Wilkinson AC, Finn RD, Griffiths-Jones S, Eddy SR, Bateman A: Rfam: Polytechnique, 91128 Palaiseau, France. Department of Mathematics and updates to the RNA families database. Nucleic Acids Res 2009, Computer Science, Denison University, Granville, OH 43023-0810, USA. 37(Database):D136-D140. 20. Freyhult E, Moulton V, Clote P: Boltzmann probability of RNA structural Authors’ contributions neighbors and riboswitch detection. Bioinformatics 2007, 23(16):2054-2062. RNAbor-Sample was conceived by W.A.L., who provided a proof for 21. Freyhult E, Moulton V, Clote P: RNAbor: a web server for RNA structural sample size sufficient to ensure a desired degree of accuracy with a desired neighbors. Nucleic Acids Res 2007, 35(Web):W305-W309. level of confidence. RNAborMEA was conceived by P.C., then designed and 22. Matthews D, Sabina J, Zuker M, Turner D: Expanded sequence programmed by P.C. (prototype implementation in Python) and F.L. dependence of thermodynamic parameters improves prediction of RNA (implementation in C). Computational experiments were carried out by F.L., secondary structure. J Mol Biol 1999, 288:911-940. figures were created by F.L. and P.C. and the manuscript was written by P.C. 23. Xia T, SantaLucia JJ, Burkard M, Kierzek R, Schroeder S, Jiao X, Cox C, Turner D: Thermodynamic parameters for an expanded nearest-neighbor Competing interests model for formation of RNA duplexes with Watson-Crick base pairs. The authors declare that they have no competing interests. Biochemistry 1999, 37:14719-35. 24. Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic Published: 12 April 2012 consistency-based multiple sequence alignment. Genome Res 2005, 15(2):330-340. References 25. Nussinov R, Jacobson AB: Fast Algorithm for Predicting the Secondary 1. Olsthoorn R, Mertens S, Brederode F, Bol J: A conformational switch at the Structure of Single Stranded RNA. Proceedings of the National Academy of 3’ end of a plant virus RNA regulates viral replication. EMBO J 1999, Sciences, USA 1980, 77(11):6309-6313. 18:4856-4864. 26. Kiryu H, Kin T, Asai K: Robust prediction of consensus secondary 2. Repsilber D, Wiese S, Rachen M, Schroder A, Riesner D, Steger G: Formation structures using averaged base pairing probability matrices. of metastable RNA structures by sequential folding during transcription: Bioinformatics 2007, 23(4):434-441. time-resolved structural analysis of potato spindle tuber viroid 27. McCaskill J: The equilibrium partition function and base pair binding (-)-stranded RNA by temperature-gradient gel. RNA 1999, 5:574-584. probabilities for RNA secondary structure. Biopolymers 1990, 29:1105-1119. Clote et al. BMC Bioinformatics 2012, 13(Suppl 5):S6 Page 18 of 18 http://www.biomedcentral.com/1471-2105/13/S5/S6 28. Lu ZJ, Gloor JW, Mathews DH: Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA 2009, 15(10):1805-1813. 29. Zuker M: On finding all suboptimal foldings of an RNA molecule. Science 1989, 244(7):48-52. 30. Ding Y, Lawrence CE: A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 2003, 31:7280-7301. 31. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: Wikipedia, clans and the “decimal” release. Nucleic Acids Res 2011, 39(Database):D141-D145. 32. Gruber A, Lorenz R, Bernhart S, Neubock R, Hofacker I: The Vienna RNA websuite. Nucleic Acids Research 2008, 36:70-74. 33. Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA 2004, 101(19):7287-7292. 34. Lorenz W, Clote P: Computing the partition function for kinetically trapped RNA secondary structures. Public Library of Science One (PLoS ONE) 2011, 6:316178, Doi:10.1371/journal.pone.0016178. 35. Wakeman CA, Winkler WC, Dann C: Structural features of metabolite- sensing riboswitches. Trends Biochem Sci 2007, 32(9):415-424. 36. Blin G, Denise A, Dulucq S, Herrbach C, Touz H: Alignments of RNA structures. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010. 37. Zar J: Biostatistical Analysis Prentice-Hall, Inc; 1999. 38. Giegerich R, Voss B, Rehmsmeier M: Abstract shapes of RNA. Nucleic Acids Res 2004, 32(16):4843-4851. 39. Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R: RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics 2006, 22(4):500-503. 40. Hofacker I: Vienna RNA secondary structure server. Nucleic Acids Res 2003, 31:3429-3431. 41. Markham NR, Zuker M: UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol 2008, 453:3-31. 42. Ponty Y: Efficient sampling of RNA secondary structures from the Boltzmann ensemble of low-energy: The boustrophedon method. J Math Biol 2008, 56(1-2):107-127. doi:10.1186/1471-2105-13-S5-S6 Cite this article as: Clote et al.: Maximum expected accuracy structural neighbors of an RNA secondary structure. BMC Bioinformatics 2012 13 (Suppl 5):S6. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color ﬁgure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit

Journal

BMC Bioinformatics – Springer Journals

Published: Apr 12, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Maximum expected accuracy structural neighbors of an RNA secondary structure

Maximum expected accuracy structural neighbors of an RNA secondary structure

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Maximum expected accuracy structural neighbors of an RNA secondary structure

Maximum expected accuracy structural neighbors of an RNA secondary structure

References (53)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies