Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments

miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments W68–W76 Nucleic Acids Research, 2009, Vol. 37, Web Server issue Published online 11 May 2009 doi:10.1093/nar/gkp347 miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments 1 2 3,4 Michael Hackenberg , Martin Sturm , David Langenberger , 5 1, ´ ´ Juan Manuel Falcon-Perez and Ana M. Aransay * Functional Genomics Unit, CIC bioGUNE, CIBERehd, Technology Park of Bizkaia, 48160 Derio, Bizkaia, Spain, Institute for Bioinformatics and Systems Biology, German Research Center for Environmental Health, Ingolsta¨ dter Landstrasse 1, D-85764 Neuherberg, Department of Genome-Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universitat Mu¨ nchen, 85350 Freising, Bioinformatics Group, Department of Computer Science, University of Leipzig, Haertelstr. 16-18, D-04107 Leipzig, Germany and Metabolomics Unit, CIC bioGUNE, CIBERehd, Technology Park of Bizkaia, 48160 Derio, Bizkaia, Spain Received February 28, 2009; Revised April 13, 2009; Accepted April 22, 2009 INTRODUCTION ABSTRACT The recent years witnessed a profound change in our Next-generation sequencing allows now the understanding of the regulation of gene expression. sequencing of small RNA molecules and the estima- Small non-coding RNA especially came into focus as it tion of their expression levels. Consequently, there became clear that they are key players in many cellular will be a high demand of bioinformatics tools processes by post-transcriptionally regulating gene expres- to cope with the several gigabytes of sequence sion via either degradation, translational repression, or data generated in each single deep-sequencing both (1,2). MicroRNAs, belonging to the family of small experiment. Given this scene, we developed non-coding RNAs, are endogenous in many animal and miRanalyzer, a web server tool for the analysis of plant genomes and are now recognized to be one of deep-sequencing experiments for small RNAs. The the major regulatory gene families in eukaryotic cells. web server tool requires a simple input file contain- They are believed to regulate the expression of around ing a list of unique reads and its copy numbers one third of all genes in the human genome, involved in (expression levels). Using these data, miRanalyzer many fundamental processes like metabolism, develop- (i) detects all known microRNA sequences anno- ment and regulation of the nervous and immune systems tated in miRBase, (ii) finds all perfect matches (3,4). Furthermore, it has been reported that some microRNAs are actively involved in the development of against other libraries of transcribed sequences pathologies like cancer (5). and (iii) predicts new microRNAs. The prediction of The traditional experimental approach to measure new microRNAs is an especially important point the expression levels of microRNAs involves cloning as there are many species with very few known and Sanger sequencing. This is an expensive and time- microRNAs. Therefore, we implemented a highly consuming procedure, and as a consequence, relatively accurate machine learning algorithm for the predic- little expression data is currently available [see (6) for a tion of new microRNAs that reaches AUC values of microRNA expression atlas]. Moreover, the huge range 97.9% and recall values of up to 75% on unseen of microRNA expression from tens of thousands to just data. The web tool summarizes all the described few molecules per cell complicates the detection steps in a single output page, which provides a of microRNAs expressed at low copy numbers. Hence comprehensive overview of the analysis, adding many undetected microRNA may exist even in well- links to more detailed output pages for each analy- explored species. Recently, microRNA expression profil- sis module. miRanalyzer is available at http://web. ing panels became available for measuring expression bioinformatics.cicbiogune.es/microRNA/. levels by means of hybridization. These panels allow a *To whom correspondence should be addressed. Tel: +34 944 061 325; Fax: +34 944 061 324; Email: [email protected] The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. 2009 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2009, Vol. 37, Web Server issue W69 high-throughput detection of microRNA expression. miRANALYZER However, they do not allow the detection of new Input file description microRNAs. A usual next-generation sequencing experiment produces Next generation sequencing platforms like Genome TM up to several gigabytes of output corresponding to several Analyzer (Illumina Inc.) or Genome Sequencer FLX TM hundred million base pairs. That is by far too many data (454 Life Science and Roche Applied Science) became to send over the web to analyze it using a web server tool. recently available for the sequencing of small RNA mole- However, some reads (tags) obtained in microRNA cules, which allows both the detection of expression levels sequencing experiments can be found multiple times in and new microRNA sequences at high speed and sensitiv- the output. The number of copies detected for a unique ity and low cost. However, each sequencing experiment read is proportional to its expression level. Given this produces up to 3 Gbp of sequence data, whose analysis redundancy, the only information needed for the analysis represents an important bioinformatics challenge. of microRNAs are the sequences of the reads and the Given the importance of microRNAs in the regulation number of times each unique read was encountered in of gene expression, in the coming years many deep- the experiment. This reduces the size of the input file dras- sequencing experiments will be carried out to detect and tically to a few megabytes, which is an acceptable size for a measure their expression. Therefore, user friendly tools web server tool. are required for the processing of the enormous amount The tool accepts two different input formats (see http:// of data that will be generated. To our knowledge, so web.bioinformatics.cicbiogune.es/microRNA/ far there is only one standalone tool available for the manual.html): complete analysis of deep sequencing microRNA data: miRDeep published by Friedla¨ nder et al. (7). Specific (i) a tab separated file with the read sequences and its software for SOLiD data does exist that allows the detec- counts (number of times each read has been tion of known microRNAs but not the prediction of obtained in the experiment) and new microRNAs (http://solidsoftwaretools.com/gf/pro (ii) a multifasta file with the copy number of the unique ject/rna2map). reads (read count) as the description in the header On the other hand, the prediction of microRNA genes (e.g. >ID ‘count’). has been extensively employed over the past years and several distinct approaches have been developed. Some Along with this web-tool, we supply a Perl script, which of the methods used in the purely computational detection counts the reads of a Genome Analyzer (Illumina Inc.) approaches were, for example, conservation of certain experiment, producing the tab separated input format. regions—phylogenetic shadowing (8), different machine The script allows averaging of several lines, filtering for learning methods like support vector machines using low quality reads and a simple analysis of differential structure-sequence features (9), random forest models expression (log ratios between different lines). A more (10) or probabilistic co-learning models (11). Bentwich detailed description of the Perl script can be found on et al. (12) used further features like the stability of the hair- the tutorial page (http://web.bioinformatics.cicbiogune. pin together with an experimental validation. The main es/microRNA/manual.html). drawbacks of these approaches are that they are either limited to conserved microRNAs or that they tend to Input parameters have a high rate of false positive predictions. However, Apart from the file with the read sequences, several other new sequencing experiments open new possibilities in the input options are available as summarized in Table 1. The prediction of microRNAs, allowing the generation of pre- parameters are explained in more detail in the correspond- viously unavailable characteristics like, for example, the ing sections of the manuscript and on the tutorial page traces left by dicer processing. (http://web.bioinformatics.cicbiogune.es/microRNA/ Consequently, we have developed miRanalyzer, a web manual.html). server tool which implements all necessary methods for a comprehensive analysis of deep-sequencing experiments of small RNA molecules. It detects known microRNAs WORKFLOW annotated in miRBase and matches in other transcribed miRanalyzer follows three internal analysis steps sequences (RNA, RFam and RepBase). Furthermore, (Supplementary Figure S1): (i) detection of known miRanalyzer implements a highly accurate machine learn- microRNAs, (ii) mapping against libraries of transcribed ing algorithm to predict new microRNAs (area under sequences (mRNA, ncRNA, etc.) and (iii) prediction of the curve—AUC—value of 97.9%). The algorithm is new microRNAs. After each of these three steps, the based on the random forest classifier and was trained detected reads are removed from the input data following on experimental data. This high accuracy is important the options set by the user (Table 1). for the identification of novel microRNAs, a process which usually results in high false positive rates. The Detection of known microRNAs tool also includes a Perl script for the proper generation of the input file using the Genome Analyzer (Illumina In many of the microRNA experiments, the main purpose Inc.) pipeline results. Currently, miRanalyzer works for will be the detection of the expression levels of known seven frequently used model species (human, mouse, rat, microRNAs (or frequently the differential expression of fruit-fly, round-worm, zebrafish and dog). microRNAs between two samples). Therefore, as the W70 Nucleic Acids Research, 2009, Vol. 37, Web Server issue Table 1. Summary of miRanalyzer input options Input option Description Species The species from which the input reads have been obtained Number of mismatches For the detection of known microRNAs the user can allow matches with up to two mismatches Target gene method Selection of the microRNA target gene prediction method for the ontological analysis. Posterior probability threshold The threshold for the posterior probability calculated by the classification model. Considering adapter sequences The read sequences frequently contain adapter sequences at its 3 end. In this case, the user can take it into account by aligning also sub-sequences of a given minimum length (Data and methods section). Detect just new microRNAs This option skips the detection of known microRNAs. Remove all mRNA matches This option removes all reads which have been perfectly aligned with mRNA sequences. If this option is not set, the program will remove all reads which match in more than five mRNAs as we observed that these reads are frequently poly-A like sequences. Remove RFam/RepBase. These options remove all reads which have mapped to RFam or RepBase. Just predict conserved microRNAs This option limits the prediction of new microRNAs to regions which overlap with a Phylogenetically Conserved Element (PhastCons). first analysis step, miRanalyzer detects the reads which and Methods’ section) and pre-calculated ontological anal- correspond to known microRNAs. To carry out the detec- yses are made available. In the case of ambiguous matches tion of known microRNAs, we used the miRBase repos- where the set of target genes is made up of a combination itory (13) which offers mature (the mature sequences of of various microRNAs, a link to Annotation-Module (14) known microRNAs), mature-star (the sequence which is offered to launch the ontological analysis with the pairs with the mature microRNA in the pre-microRNA obtained gene list. secondary structure) and precursor microRNA sequences Mapping against transcribed sequences (sequence of the hairpin). For some of the microRNA precursors, it is unclear which of the two sequences After detecting reads that correspond to known (mature or mature-star) is biologically functional. In the microRNAs, miRanalyzer maps the remaining reads to case where both sequences are found to be expressed and databases of transcribed sequences as mRNA, non- the predominant product can be clearly detected, the coding RNA (RFam) and (retro)-transposons. Only minor product is labeled with a (mature-star). Apart perfect matches are considered in this analysis. These from the known mature-star sequences we generated a alignments are performed to achieve several aims: library with all other theoretically possible mature-star sequences. This also allows the detection of functional  First, the mapping against the transcriptome should mature-star microRNAs whose expression has not been not yield any matches except for exonic microRNAs observed previously. (1). Therefore, the number of matches can be viewed Many microRNA sequences, especially those belonging as a sample quality parameter (i.e. contamination of to the same microRNA family, exhibit a high degree of the RNA sample with degradation products and poly- sequence similarity. Given that sometimes the read might A tails). be rather short (16 bp), non-unique matches might occur.  Second, the mapping to RFam (and other libraries of A non-unique match exists if a read maps with the same ncRNA) and RepBase has two goals: (i) it might be quality (same number of mismatches) at different posi- interesting to see which other known small ncRNAs tions or to more than one sequence in the library. Often, are in the sample and (ii) the removal of these reads alignment programs such as ELAND (included in will lower the number of false positives in the predic- Illumina Inc. pipeline) do not report these ambiguous tion of new microRNAs (small ncRNA might be con- matches. However, this might result in a loss of important fused with microRNAs). The removal of those information. Therefore, miRanalyzer reports these ambig- sequences is optional (Table 1). uous matches, stating all microRNAs where matches have  Third, we also used the genomic annotation of repeats been found. Note that the groups of microRNAs that and transposons derived by RepeatMasker (http:// have been detected by the same read will normally www.repeatmasker.org). After aligning all reads with belong to the same family. the genome, miRanalyzer checks if the read coordi- The exact order of mapping against known microRNAs nates overlap with those of the RepeatMasker annota- is: mature, mature-star, unknown mature-star and precur- tion. In this way we can detect reads that overlap with sors/hairpin. Both unique matches (a read matches just to ‘degraded’ transposons whose expression might indi- one known microRNA) and ambiguous matches (a read cate ‘domestication’ (acquired function). matches several microRNAs with the same quality) are detected and removed from the input at each step. The Predicting new microRNAs removal is important as otherwise the reads would be detected again in the precursor sequences (hairpins). The detection of new, previously unreported microRNAs After known microRNAs detection, the corresponding is a very important analysis step in miRanalyzer tool as (i) target genes (those genes which are predicted to be regu- a controversy exists over the real number of microRNAs lated by the detected microRNA) are extracted (see ‘Data (15) and therefore it is important to mine sequencing Nucleic Acids Research, 2009, Vol. 37, Web Server issue W71 experiments for new previously undetected microRNAs while the cross-validated results are high, the recall is and (ii) for many species there are none or just a few moderate predicting on unseen data. We highlighted microRNAs known. Consequently, the analysis of (yellow) the worst prediction values on the different test sequencing experiments in these species relies almost com- sets, which are 0.66 (cel/rno), 0.48 (rno/cel) and 0.64 (rno/ pletely on the prediction of new microRNAs. Therefore, hsa). To check whether we can improve prediction power we set up a machine learning approach based on the for those in particular, we merged two datasets and eval- random forest method (16) with a broad range of features. uated against the third set (values highlighted in green). It To train only on the most relevant features, we also can be seen that the prediction improved significantly, employed a feature selection approach (see ‘Data and especially for C. elegans. While trained solely on rat or Methods’ section for detail). human and evaluated on worm a recall of only 0.48 and We used three different data sets from human (hsa), rat 0.67, respectively, could be reached. The merged training (rno) and Caenorhabditis elegans (cel, see ‘Data and set, however, achieves a recall of 0.71, suggesting syner- Methods’ section) for building the final prediction getic effects when integrating instances from different spe- model. The results shown in Table 2 suggest that the clas- cies into the training set. To benefit most from this effect, sifier is highly sensitive and specific not only according to we trained the final classifier on all three data sets. Thus a standard 10-fold cross-validation, but also in a cross- we obtain an area under the curve (AUC) value of 97.9% species test on completely unseen test data. The results with a true positive rate of 0.79 and a false positive rate of shown in the upper part of Table 2 depict the outcome 0.007 for the fixed threshold at 0.9. To test for robustness, when learning with one of the species (training set) and we repeated the cross validation on 10 different negative predicting the remaining ones (test data). For evaluation sets, which resulted in a mean AUC value, true positive of prediction power in the same species, we applied a rate and false positive rate of 97.9%, 0.79 and 0.0077 with 10-fold cross validation approach. It can be seen that the standard deviations of 0.001, 0.01 and 0.003, Table 2. The true positive rates (top part) and false positive rates (bottom part) for different classifiers at a posterior probability threshold of 0.9 Training set Test set rno cel hsa rno-cel rno-hsa cel-hsa rno-cel-hsa True positive rate (threshold: 0.9) CV rno 0.74 0.48 0.64 0.66 0.73 0.57 0.65 CV cel 0.66 0.77 0.69 0.80 0.68 0.79 0.76 CV hsa 0.74 0.67 0.77 0.70 0.84 0.81 0.79 CV rno-cel 0.89 0.91 0.75 0.79 0.80 0.82 0.84 CV rno-hsa 0.91 0.71 0.93 0.80 0.78 0.84 0.86 CV cel-hsa 0.74 0.91 0.91 0.83 0.84 0.81 0.86 CV rno-cel-hsa 0.89 0.91 0.90 0.91 0.91 0.92 0.79 False negative rate (threshold: 0.9) CV rno 0.01 0.008 0.009 0.004 0.008 0.001 0.005 CV cel 0.005 0.004 0.003 0.002 0.01 0 0.005 CV hsa 0.005 0.004 0.01 0.01 0.01 0.005 0.005 CV rno-cel 0.02 0.008 0.01 0.009 0.01 0.007 0.01 CV rno-hsa 0.02 0.01 0.01 0.01 0.01 0.01 0.01 CV cel-hsa 0.005 0.004 0.009 0.004 0.01 0.003 0.01 CV rno-cel-hsa 0.01 0.004 0.003 0.01 0.01 0.009 0.007 The superscripted ‘CV’ denotes that this value was achieved in a standard 10-fold cross-validation approach. The highlighted false positive rates correspond to the true positive rates discussed in the text. Figure 1. Histogram of miRanalyzer scores. Known microRNAs are colored in red, all other data are colored in blue. The insert is a close-up for candidates with scores better than 0.65. W72 Nucleic Acids Research, 2009, Vol. 37, Web Server issue respectively. Note that, the human and C. elegans sets before in the experiment (for example if matched against where also used by Friedlander et al. who reported a RepBase, etc.). recall of 89% on C. elegans and 72% on human. Finally, the last box gives a summary of the filtered and unmapped reads. Table 2 shows that our approach reaches a recall of 75% on human when trained on rat/C. elegans (predicting on unseen data) and 91% on C. elegans using the final prediction model (predicting on previously seen data). CONCLUSIONS Figure 1 shows a cross-species evaluation of miRanalyzer is a web server tool for the integral analysis miRanalyzer trained on human and C. elegans and eval- of next generation sequencing data of small RNA mole- uated on rat. Obviously, most of the data have very low cules. It allows both the detection of known microRNAs scores (the posterior probability assigned by the classifica- and the prediction of new microRNAs. For the prediction tion model to each instance) assigned. We build a close-up of new microRNAs a new sensitive machine learning algo- for the range between 0.65 and 1 to better visualize the rithm was developed which reaches an AUC of 97.9% in high scoring predictions. It can be seen that the known rat our tests. Furthermore, the tool detects matches of the microRNAs are strongly accumulated towards scores of 1, reads against other libraries of transcribed sequences demonstrating the high predictive power of our approach such as mRNA, RFam (RNA) and RepBase and the good ability to generalize. Note that the classifier (Transposons). Currently, the tool works for seven spe- has never seen data from rat before. See also cies, but can easily be extended upon request. Supplementary Figures S3 and S4 for graphical represen- tation of other quality parameters and Receiver Operating Characteristics of different classifiers discussed in this DATA AND METHODS section. Sequence data miRanalyzer uses the newest genome assembly of each A WORKING EXAMPLE available species which were downloaded from the As a working example we used data derived from an UCSC Genome Browser (http://hgdownload.cse.ucsc. experiment carried out in our laboratory with rat hepato- edu/downloads.html): Homo sapiens (hg18, NCBI 36.1), cytes following standard protocols for smallRNA sample Mus musculus (mm8, NCBI 36), Rattus norvegicus (rn4, version 3.4), Drosophila melanogaster (dm3, BDGP preparation and deep-sequencing (http://www.illumina. Release 5), Caenorhabditis elegans (ce6, WUSTL School com/). Figure 2 shows the summary output page of of Medicine GSC and Sanger Institute version WS190), miRanalyzer run on these data. The page is made up of five boxes that reveal the intrinsic workflow of Canis familiaris (canFam2, v2.0) and Danio rerio miRanalyzer. (danRer5). The first box shows the current state of the process The mRNA sequence data were derived from different (executing, pending, etc.) on the left side and depicts a databases: H. sapiens, M. musculus, R. norvegicus and short summary of the process (input data and options) D. rerio from NCBI RefSeq (ftp://ftp.ncbi.nih.gov/ on the right side. refseq/), D. melanogaster from FlyBase (http://flybase. The second box shows the summary of the analysis of org/) and C. elegans from WormBase (http://www.worm known microRNAs. Each column corresponds to the base.org/). The mRNA sequences for C. familiaris were mapping against a different set of sequences (mature, extracted from the genomic sequence using the Galaxy mature-star, etc.). The last row provides a link to detailed platform (17). output for each of the columns. For example, the analysis In addition, mature microRNA sequences were derived of unknown mature-star sequences shows that miR- from mirBase version 12.0 (http://microrna.sanger.ac.uk/ 423-star is moderately expressed (744 copies) while the sequences/); RNA sequences included in RFam version sequence which is annotated in miRBase (mature miR- 9.0 (18) were downloaded from http://rfam.sanger.ac.uk/ 423) has less than 10 copies (Supplementary Figure S2). ; and RepBase version 10.10 (19) were obtained from The third box summarizes the matching of reads to sev- http://www.girinst.org/. Annotations and genomic coordi- eral sets of transcribed sequences. For example the frac- nates of RepeatMasker and PhastCons elements where tion of reads mapped to the transcriptome may give a downloaded from the UCSC table browser (http:// good estimate on the sample quality. It can be seen that genome.ucsc.edu/cgi-bin/hgTables?command=start). We used deep–sequencing data from three different around 8.3% of all reads in this sample originate from experiments: (i) the combined C. elegans data (accession mRNA but this corresponds just to 3% of transcription amount (number of mRNA reads/total number of reads). no. GSE6282 and GSE5990 from GEO database at The fourth box shows the summary of the detection of NCBI), which have been used also in (7) with a total of new microRNAs. In addition, a link is given for further 205 575 unique reads, (ii) data from human HeLa cells (7) information on each read cluster that has been predicted with accession no. GSE10829 and 319 939 unique reads to be a novel microRNA (Supplementary Figure S5). A and (iii) data from rat hepatocytes generated in our lab, link is also provided to a detailed output page with infor- available on our website (http://web.bioinformatics.cic mation on the chromosomal coordinates, the long hairpin biogune.es/microRNA/defaultReads.txt) with 22 086 structure and a verification if the reads have been detected unique reads. Nucleic Acids Research, 2009, Vol. 37, Web Server issue W73 Figure 2. The summary page of miRanalyzer: five boxes are shown which correspond to summary & state of the process, analysis of known microRNA, matches against transcribed sequences, and detection of new microRNAs and summary of unmatched sequences. W74 Nucleic Acids Research, 2009, Vol. 37, Web Server issue Generating ‘unknown mature-star’ sequences site predictions by miRanda software (21) and TargetScan (22). We generated the unknown star sequences by means of the mirBase precursor and mature sequences. First, we Secondary structure prediction calculate the secondary structures for all hairpins using RNAfold (20) with parameters ‘-noLP’. Then, we detect For predicting the secondary structure and its mini- the coordinates of the mature microRNAs within the mum free energy (MFE) we utilized the Vienna RNA pre-microRNA hairpin. By means of these coordinates, package (20). the information of the secondary structure and the char- acteristic ‘2-nt 3 overhang’ caused by Dicer, we extracted The machine learning approach the corresponding sequence pairing with the mature To detect new microRNAs, we set up a machine learning microRNA. approach based on the WEKA (23) implementation of the random forest learning scheme (16) with the number of Read alignment trees set to 100. Note that, the random forest algorithm was applied by Jiang et al. (10) using basically the triplet Read sequences often contain adapter sequences (see stan- structure features introduced by Xue et al. (9). However, dard protocol of small RNA sample preparation at http:// the difference of our approach consists of using a negative www.illumina.com/) at its 3 ends. Therefore, miRanalyzer set derived directly from the experimental data which has two alignment options depending on whether the (i) assures that the sequences are transcribed and reads have adapter sequences or not. In general, the tool (ii) allows the generation of new and previously unused generate a prefix tree of all input reads and subsequently features that seem to be more discriminative than the trip- walk in a single run over the genome to detect the reads. let structure features (see below). By default, miRanalyzer assumes the existence of adapter sequences and therefore, first detects matches of Training and test sets a subsequence of 16 bp starting at the 5 end of the read. When miRanalyzer detects an initial match, it expands the For the machine learning approach we created three data subsequence as long as a perfect match is given. Finally, sets, one from each of the three species: human, C. elegans only matches of the longest subsequence are retained. and rat. First, we extracted all pre-microRNA candidates Note that, in this approach the adapter sequences from the experimental dataset that could be mapped to a are detected implicitly (the sequence at the 3 end of the known microRNA and labeled them as positive instances. read that does not match to the genome is defined as the Second, we selected an equal amount of pre-microRNA adapter) and therefore, the adapter sequences need not to candidates from the same dataset by random selection be known or supplied by the user. with the known microRNAs removed and labeled them as negative. In total we obtained a dataset of 612 instances in human, 468 instances in worm and 376 instances in rat. Ontological analysis We used a recently published tool, Annotation-Modules Features (14), to pre-calculate the significant annotations of all target gene lists for all microRNAs in the miRBase We created a broad variety of features associated with (12.0). Currently, the user can choose between two dif- nucleotide sequence, structure and energy. Table 3 lists ferent target site prediction methods: miRBase target all the features used in this work. Table 3. Features calculated for the generation of the classifier Feature name Description of the feature Read count Number of reads mapping to the pre-microRNA Length The length of the longest hairpin structure Stem length The length of the longest hairpin structure stem Mfe The mean free energy of the hairpin Loop length The number of bases in the loop of the hairpin Loop GC The GC-content of the loop GC The GC-content of the small hairpin Asymmetric bulges The number of asymmetric bulges and mismatches regarding the stem Symmetric bulges The number of symmetric bulges and mismatches regarding the stem Bulges The number of bulges in the stem Longest bulge The number of non-pairing nucleotides of the longest bulge Mismatches pre-microRNA The number of single mismatches in the hairpin Mismatches microRNAs The number of single mismatches in the mature microRNA region of the hairpin Stability The smallest hairpin harbouring the read is extended 10 times 10bp at both ends. The stability is the frequency the original structure is found in the elongated structures Alternating stability Reports whether a structure disappears when extending the sequence, but reappears again. Triplet-SVM features All features that were proposed by Xue et al. (9) Bindings The number of bindings in the stem divided by the hairpin length Nucleic Acids Research, 2009, Vol. 37, Web Server issue W75 The selection of the features with highest prediction ACKNOWLEDGEMENTS power was performed by means of calculating their infor- The authors thank Dmitrij Frishman for careful reading mation gain. Subsequently, we ranked the features accord- of the manuscript and helpful comments, Philipp Pagel for ing to their discrimination power. The top 10 features used helpful suggestions, Ewa Gubb for revising the English for building the final classifier are: stability, mfe, bindings, style and the Genome Analysis Platform staff at CIC stem length, read count, longest bulge, mismatches bioGUNE for their technical support. microRNA, mismatches pre-microRNA, alternating sta- bility and the Triple-SVM feature ‘A ...’. Supplementary Table S6 shows the 10 best features selected for each FUNDING model used for data included in Table 2. It can be seen The Department of Industry, Tourism and Trade of the that nine features are always the same and just their rank- Government of the Autonomous Community of the ing and the Triplet-SVM feature vary. Basque Country [Etortek Research Annual Programs]; the Innovation Technology Department of the Bizkaia County [2008–2009 institutional support to technological Pre-processing platforms]; Junta de Andalucia [P07FQM3163 to MH]. In order to check the reads for putative new microRNAs Funding for open access charge: [Etortek IE08-228]. we perform a pre-processing of the data which contains the following steps: (i) all reads which overlap in the Conflict of interest statement. None declared. genome are clustered together. (ii) Due to erroneous reads, dicer products (mature, mature-star and loop) REFERENCES could be grouped together such that they appear as non- microRNA products (for example producing a long clus- 1. Kim,V.N. and Nam,J.W. (2006) Genomics of microRNA. Trends ter which overlaps the loop of the precursor). To avoid Genet., 22, 165–173. such a situation, we walk along the cluster sequences and 2. Lagos-Quintana,M., Rauhut,R., Lendeckel,W. and Tuschl,T. (2001) Identification of novel genes coding for small expressed RNAs. test whether the start of the current read overlaps less than Science, 294, 853–858. 3 nt with the end positions of previous reads. In that case 3. Ouellet,D.L., Perron,M.P., Gobeil,L.A., Plante,P. and Provost,P. the cluster is split at the current read start position. (2006) MicroRNAs in gene regulation: when the smallest governs it Clusters now contain a non-dicer product, the mature or all. J. Biomed. Biotech., 2006, 69616. 4. Bagasra,O. and Prilliman,K.R. (2004) RNA interference: the the mature-star, but not more than one theoretical pro- molecular immune system. J. Mol. Histol., 35, 545–553. duct. (iii) Clusters of more than 25 bp length are discarded. 5. Lu,J., Getz,G., Miska,E.A., Alvarez-Saavedra,E., Lamb,J., Peck,D., (iv) Since the microRNA can be located either on the 5 Sweet-Cordero,A., Ebert,B.L., Mak,R.H., Ferrando,A.A. et al. arm or the 3 arm of the hairpin, we extract the cluster (2005) MicroRNA expression profiles classify human cancers. Nature, 435, 834–838. sequence twice, with 60 bp upstream and 10 bp down- 6. Landgraf,P., Rusu,M., Sheridan,R., Sewer,A., Iovino,N., stream flanking areas and vice versa. For both sequences Aravin,A., Pfeffer,S., Rice,A., Kamphorst,A.O., Landthaler,M. the secondary structure is predicted via RNAfold, but et al. (2007) A mammalian microRNA expression atlas based on only the energetically favourable is retained. (v) Non-hair- small RNA library sequencing. Cell, 129, 1401–1414. 7. Friedla¨ nder,M.R., Chen,W., Adamidi,C., Maaskola,J., pin structures are discarded. (vi) Structures where the clus- Einspanier,R., Knespel,S. and Rajewsky,N. (2008) Discovering ter sequence is not fully included or spans the loop and a microRNAs from deep sequencing data using miRDeep. Nat. part of the stem cannot be dicer products are consequently Biotech., 26, 407–415. discarded. Finally, since our analysis showed that virtually 8. Berezikov,E., Guryev,V., van de Belt,J., Wienholds,E., Plasterk,R.H. and Cuppen,E. (2005) Phylogenetic shadowing and all known microRNAs show more than 14 bindings in the computational identification of human microRNA genes. Cell, 120, microRNA:microRNA-star duplex, we considered this as 21–24. a mandatory requirement. Having applied the pre-proces- 9. Xue,C., Li,F., He,T., Liu,G.P., Li,Y. and Zhang,X. (2005) sing step to the three experimental data sets, we receive Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC 6967 candidate precursors for rat, 12 233 for worm and Bioinformatics, 6, 310. 43 905 for human. 10. Jiang,P., Wu,H., Wang,W., Ma,W., Sun,X. and Lu,Z. (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Post-processing Nucleic Acids Res., 35, W339–W344. 11. Nam,J.W., Kim,J., Kim,S.K. and Zhang,B.T. (2006) ProMiR II: a After classification of the deep-sequencing data in form of web server for the probabilistic prediction of clustered, nonclus- the clusters created in the pre-processing step, clusters tered, conserved and nonconserved microRNAs. Nucleic Acids Res., 34, W455–W458. containing the mature and mature-star microRNA are 12. Bentwich,I., Avniel,A., Karov,Y., Aharonov,R., Gilad,S., Barad,O., merged such that one cluster represents one microRNA Barzilai,A., Einat,P., Einav,U., Meiri,E. et al. (2005) Identification precursor. of hundreds of conserved and nonconserved human microRNAs. Nat. Genetics, 37, 766–770. 13. Griffiths-Jones,S. (2006) miRBase: the microRNA sequence data- base. Methods Mol. Biol., 342, 129–138. 14. Hackenberg,M. and Matthiesen,R. (2008) Annotation-Modules: a SUPPLEMENTARY DATA tool for finding significant combinations of multisource annotations Supplementary Data are available at NAR Online. for gene lists. Bioinformatics, 24, 1386–1393. W76 Nucleic Acids Research, 2009, Vol. 37, Web Server issue 15. Berezikov,E., van Tetering,G., Verheul,M., van de Belt,J., van 19. Jurka,J., Kapitonov,V.V., Pavlicek,A., Klonowski,P., Kohany,O. Laake,L., Vos,J., Verloop,R., van de Wetering,M., Guryev,V., and Walichiewicz,J. (2005) Repbase update, a database of Takada,S. et al. (2006) Many novel mammalian microRNA candi- eukaryotic repetitive elements. Cytogenetic Genome Res., 110, dates identified by extensive cloning and RAKE analysis. Genome 462–467. Res., 16, 1289–1298. 20. Hofacker,I.L. (2003) Vienna RNA secondary structure server. 16. Breiman,L. (2001) Random forests. Machine Learning, 45, 28. Nucleic Acids Res., 31, 3429–3431. 17. Giardine,B., Riemer,C., Hardison,R.C., Burhans,R., Elnitski,L., 21. Enright,A.J., John,B., Gaul,U., Tuschl,T., Sander,C. and Shah,P., Zhang,Y., Blankenberg,D., Albert,I., Taylor,J. et al. (2005) Marks,D.S. (2003) MicroRNA targets in Drosophila. Genome Biol., Galaxy: a platform for interactive large-scale genome analysis. 5, R1. Genome Res., 15, 1451–1455. 22. Lewis,B.P., Burge,C.B. and Bartel,D.P. (2005) Conserved seed 18. Gardner,P.P., Daub,J., Tate,J.G., Nawrocki,E.P., Kolbe,D.L., pairing, often flanked by adenosines, indicates that thousands of Lindgreen,S., Wilkinson,A.C., Finn,R.D., Griffiths-Jones,S., human genes are microRNA targets. Cell, 120, 15–20. Eddy,S.R. et al. (2009) Rfam: updates to the RNA families data- 23. Witten,I. and E,F. (2005) Data Mining: practical machine learning base. Nucleic Acids Res., 37, D136–D140. tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments

Loading next page...
 
/lp/oxford-university-press/miranalyzer-a-microrna-detection-and-analysis-tool-for-next-generation-Vb72CFPeKJ

References (29)

Publisher
Oxford University Press
Copyright
© 2009 The Author(s)
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gkp347
pmid
19433510
Publisher site
See Article on Publisher Site

Abstract

W68–W76 Nucleic Acids Research, 2009, Vol. 37, Web Server issue Published online 11 May 2009 doi:10.1093/nar/gkp347 miRanalyzer: a microRNA detection and analysis tool for next-generation sequencing experiments 1 2 3,4 Michael Hackenberg , Martin Sturm , David Langenberger , 5 1, ´ ´ Juan Manuel Falcon-Perez and Ana M. Aransay * Functional Genomics Unit, CIC bioGUNE, CIBERehd, Technology Park of Bizkaia, 48160 Derio, Bizkaia, Spain, Institute for Bioinformatics and Systems Biology, German Research Center for Environmental Health, Ingolsta¨ dter Landstrasse 1, D-85764 Neuherberg, Department of Genome-Oriented Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universitat Mu¨ nchen, 85350 Freising, Bioinformatics Group, Department of Computer Science, University of Leipzig, Haertelstr. 16-18, D-04107 Leipzig, Germany and Metabolomics Unit, CIC bioGUNE, CIBERehd, Technology Park of Bizkaia, 48160 Derio, Bizkaia, Spain Received February 28, 2009; Revised April 13, 2009; Accepted April 22, 2009 INTRODUCTION ABSTRACT The recent years witnessed a profound change in our Next-generation sequencing allows now the understanding of the regulation of gene expression. sequencing of small RNA molecules and the estima- Small non-coding RNA especially came into focus as it tion of their expression levels. Consequently, there became clear that they are key players in many cellular will be a high demand of bioinformatics tools processes by post-transcriptionally regulating gene expres- to cope with the several gigabytes of sequence sion via either degradation, translational repression, or data generated in each single deep-sequencing both (1,2). MicroRNAs, belonging to the family of small experiment. Given this scene, we developed non-coding RNAs, are endogenous in many animal and miRanalyzer, a web server tool for the analysis of plant genomes and are now recognized to be one of deep-sequencing experiments for small RNAs. The the major regulatory gene families in eukaryotic cells. web server tool requires a simple input file contain- They are believed to regulate the expression of around ing a list of unique reads and its copy numbers one third of all genes in the human genome, involved in (expression levels). Using these data, miRanalyzer many fundamental processes like metabolism, develop- (i) detects all known microRNA sequences anno- ment and regulation of the nervous and immune systems tated in miRBase, (ii) finds all perfect matches (3,4). Furthermore, it has been reported that some microRNAs are actively involved in the development of against other libraries of transcribed sequences pathologies like cancer (5). and (iii) predicts new microRNAs. The prediction of The traditional experimental approach to measure new microRNAs is an especially important point the expression levels of microRNAs involves cloning as there are many species with very few known and Sanger sequencing. This is an expensive and time- microRNAs. Therefore, we implemented a highly consuming procedure, and as a consequence, relatively accurate machine learning algorithm for the predic- little expression data is currently available [see (6) for a tion of new microRNAs that reaches AUC values of microRNA expression atlas]. Moreover, the huge range 97.9% and recall values of up to 75% on unseen of microRNA expression from tens of thousands to just data. The web tool summarizes all the described few molecules per cell complicates the detection steps in a single output page, which provides a of microRNAs expressed at low copy numbers. Hence comprehensive overview of the analysis, adding many undetected microRNA may exist even in well- links to more detailed output pages for each analy- explored species. Recently, microRNA expression profil- sis module. miRanalyzer is available at http://web. ing panels became available for measuring expression bioinformatics.cicbiogune.es/microRNA/. levels by means of hybridization. These panels allow a *To whom correspondence should be addressed. Tel: +34 944 061 325; Fax: +34 944 061 324; Email: [email protected] The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. 2009 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2009, Vol. 37, Web Server issue W69 high-throughput detection of microRNA expression. miRANALYZER However, they do not allow the detection of new Input file description microRNAs. A usual next-generation sequencing experiment produces Next generation sequencing platforms like Genome TM up to several gigabytes of output corresponding to several Analyzer (Illumina Inc.) or Genome Sequencer FLX TM hundred million base pairs. That is by far too many data (454 Life Science and Roche Applied Science) became to send over the web to analyze it using a web server tool. recently available for the sequencing of small RNA mole- However, some reads (tags) obtained in microRNA cules, which allows both the detection of expression levels sequencing experiments can be found multiple times in and new microRNA sequences at high speed and sensitiv- the output. The number of copies detected for a unique ity and low cost. However, each sequencing experiment read is proportional to its expression level. Given this produces up to 3 Gbp of sequence data, whose analysis redundancy, the only information needed for the analysis represents an important bioinformatics challenge. of microRNAs are the sequences of the reads and the Given the importance of microRNAs in the regulation number of times each unique read was encountered in of gene expression, in the coming years many deep- the experiment. This reduces the size of the input file dras- sequencing experiments will be carried out to detect and tically to a few megabytes, which is an acceptable size for a measure their expression. Therefore, user friendly tools web server tool. are required for the processing of the enormous amount The tool accepts two different input formats (see http:// of data that will be generated. To our knowledge, so web.bioinformatics.cicbiogune.es/microRNA/ far there is only one standalone tool available for the manual.html): complete analysis of deep sequencing microRNA data: miRDeep published by Friedla¨ nder et al. (7). Specific (i) a tab separated file with the read sequences and its software for SOLiD data does exist that allows the detec- counts (number of times each read has been tion of known microRNAs but not the prediction of obtained in the experiment) and new microRNAs (http://solidsoftwaretools.com/gf/pro (ii) a multifasta file with the copy number of the unique ject/rna2map). reads (read count) as the description in the header On the other hand, the prediction of microRNA genes (e.g. >ID ‘count’). has been extensively employed over the past years and several distinct approaches have been developed. Some Along with this web-tool, we supply a Perl script, which of the methods used in the purely computational detection counts the reads of a Genome Analyzer (Illumina Inc.) approaches were, for example, conservation of certain experiment, producing the tab separated input format. regions—phylogenetic shadowing (8), different machine The script allows averaging of several lines, filtering for learning methods like support vector machines using low quality reads and a simple analysis of differential structure-sequence features (9), random forest models expression (log ratios between different lines). A more (10) or probabilistic co-learning models (11). Bentwich detailed description of the Perl script can be found on et al. (12) used further features like the stability of the hair- the tutorial page (http://web.bioinformatics.cicbiogune. pin together with an experimental validation. The main es/microRNA/manual.html). drawbacks of these approaches are that they are either limited to conserved microRNAs or that they tend to Input parameters have a high rate of false positive predictions. However, Apart from the file with the read sequences, several other new sequencing experiments open new possibilities in the input options are available as summarized in Table 1. The prediction of microRNAs, allowing the generation of pre- parameters are explained in more detail in the correspond- viously unavailable characteristics like, for example, the ing sections of the manuscript and on the tutorial page traces left by dicer processing. (http://web.bioinformatics.cicbiogune.es/microRNA/ Consequently, we have developed miRanalyzer, a web manual.html). server tool which implements all necessary methods for a comprehensive analysis of deep-sequencing experiments of small RNA molecules. It detects known microRNAs WORKFLOW annotated in miRBase and matches in other transcribed miRanalyzer follows three internal analysis steps sequences (RNA, RFam and RepBase). Furthermore, (Supplementary Figure S1): (i) detection of known miRanalyzer implements a highly accurate machine learn- microRNAs, (ii) mapping against libraries of transcribed ing algorithm to predict new microRNAs (area under sequences (mRNA, ncRNA, etc.) and (iii) prediction of the curve—AUC—value of 97.9%). The algorithm is new microRNAs. After each of these three steps, the based on the random forest classifier and was trained detected reads are removed from the input data following on experimental data. This high accuracy is important the options set by the user (Table 1). for the identification of novel microRNAs, a process which usually results in high false positive rates. The Detection of known microRNAs tool also includes a Perl script for the proper generation of the input file using the Genome Analyzer (Illumina In many of the microRNA experiments, the main purpose Inc.) pipeline results. Currently, miRanalyzer works for will be the detection of the expression levels of known seven frequently used model species (human, mouse, rat, microRNAs (or frequently the differential expression of fruit-fly, round-worm, zebrafish and dog). microRNAs between two samples). Therefore, as the W70 Nucleic Acids Research, 2009, Vol. 37, Web Server issue Table 1. Summary of miRanalyzer input options Input option Description Species The species from which the input reads have been obtained Number of mismatches For the detection of known microRNAs the user can allow matches with up to two mismatches Target gene method Selection of the microRNA target gene prediction method for the ontological analysis. Posterior probability threshold The threshold for the posterior probability calculated by the classification model. Considering adapter sequences The read sequences frequently contain adapter sequences at its 3 end. In this case, the user can take it into account by aligning also sub-sequences of a given minimum length (Data and methods section). Detect just new microRNAs This option skips the detection of known microRNAs. Remove all mRNA matches This option removes all reads which have been perfectly aligned with mRNA sequences. If this option is not set, the program will remove all reads which match in more than five mRNAs as we observed that these reads are frequently poly-A like sequences. Remove RFam/RepBase. These options remove all reads which have mapped to RFam or RepBase. Just predict conserved microRNAs This option limits the prediction of new microRNAs to regions which overlap with a Phylogenetically Conserved Element (PhastCons). first analysis step, miRanalyzer detects the reads which and Methods’ section) and pre-calculated ontological anal- correspond to known microRNAs. To carry out the detec- yses are made available. In the case of ambiguous matches tion of known microRNAs, we used the miRBase repos- where the set of target genes is made up of a combination itory (13) which offers mature (the mature sequences of of various microRNAs, a link to Annotation-Module (14) known microRNAs), mature-star (the sequence which is offered to launch the ontological analysis with the pairs with the mature microRNA in the pre-microRNA obtained gene list. secondary structure) and precursor microRNA sequences Mapping against transcribed sequences (sequence of the hairpin). For some of the microRNA precursors, it is unclear which of the two sequences After detecting reads that correspond to known (mature or mature-star) is biologically functional. In the microRNAs, miRanalyzer maps the remaining reads to case where both sequences are found to be expressed and databases of transcribed sequences as mRNA, non- the predominant product can be clearly detected, the coding RNA (RFam) and (retro)-transposons. Only minor product is labeled with a (mature-star). Apart perfect matches are considered in this analysis. These from the known mature-star sequences we generated a alignments are performed to achieve several aims: library with all other theoretically possible mature-star sequences. This also allows the detection of functional  First, the mapping against the transcriptome should mature-star microRNAs whose expression has not been not yield any matches except for exonic microRNAs observed previously. (1). Therefore, the number of matches can be viewed Many microRNA sequences, especially those belonging as a sample quality parameter (i.e. contamination of to the same microRNA family, exhibit a high degree of the RNA sample with degradation products and poly- sequence similarity. Given that sometimes the read might A tails). be rather short (16 bp), non-unique matches might occur.  Second, the mapping to RFam (and other libraries of A non-unique match exists if a read maps with the same ncRNA) and RepBase has two goals: (i) it might be quality (same number of mismatches) at different posi- interesting to see which other known small ncRNAs tions or to more than one sequence in the library. Often, are in the sample and (ii) the removal of these reads alignment programs such as ELAND (included in will lower the number of false positives in the predic- Illumina Inc. pipeline) do not report these ambiguous tion of new microRNAs (small ncRNA might be con- matches. However, this might result in a loss of important fused with microRNAs). The removal of those information. Therefore, miRanalyzer reports these ambig- sequences is optional (Table 1). uous matches, stating all microRNAs where matches have  Third, we also used the genomic annotation of repeats been found. Note that the groups of microRNAs that and transposons derived by RepeatMasker (http:// have been detected by the same read will normally www.repeatmasker.org). After aligning all reads with belong to the same family. the genome, miRanalyzer checks if the read coordi- The exact order of mapping against known microRNAs nates overlap with those of the RepeatMasker annota- is: mature, mature-star, unknown mature-star and precur- tion. In this way we can detect reads that overlap with sors/hairpin. Both unique matches (a read matches just to ‘degraded’ transposons whose expression might indi- one known microRNA) and ambiguous matches (a read cate ‘domestication’ (acquired function). matches several microRNAs with the same quality) are detected and removed from the input at each step. The Predicting new microRNAs removal is important as otherwise the reads would be detected again in the precursor sequences (hairpins). The detection of new, previously unreported microRNAs After known microRNAs detection, the corresponding is a very important analysis step in miRanalyzer tool as (i) target genes (those genes which are predicted to be regu- a controversy exists over the real number of microRNAs lated by the detected microRNA) are extracted (see ‘Data (15) and therefore it is important to mine sequencing Nucleic Acids Research, 2009, Vol. 37, Web Server issue W71 experiments for new previously undetected microRNAs while the cross-validated results are high, the recall is and (ii) for many species there are none or just a few moderate predicting on unseen data. We highlighted microRNAs known. Consequently, the analysis of (yellow) the worst prediction values on the different test sequencing experiments in these species relies almost com- sets, which are 0.66 (cel/rno), 0.48 (rno/cel) and 0.64 (rno/ pletely on the prediction of new microRNAs. Therefore, hsa). To check whether we can improve prediction power we set up a machine learning approach based on the for those in particular, we merged two datasets and eval- random forest method (16) with a broad range of features. uated against the third set (values highlighted in green). It To train only on the most relevant features, we also can be seen that the prediction improved significantly, employed a feature selection approach (see ‘Data and especially for C. elegans. While trained solely on rat or Methods’ section for detail). human and evaluated on worm a recall of only 0.48 and We used three different data sets from human (hsa), rat 0.67, respectively, could be reached. The merged training (rno) and Caenorhabditis elegans (cel, see ‘Data and set, however, achieves a recall of 0.71, suggesting syner- Methods’ section) for building the final prediction getic effects when integrating instances from different spe- model. The results shown in Table 2 suggest that the clas- cies into the training set. To benefit most from this effect, sifier is highly sensitive and specific not only according to we trained the final classifier on all three data sets. Thus a standard 10-fold cross-validation, but also in a cross- we obtain an area under the curve (AUC) value of 97.9% species test on completely unseen test data. The results with a true positive rate of 0.79 and a false positive rate of shown in the upper part of Table 2 depict the outcome 0.007 for the fixed threshold at 0.9. To test for robustness, when learning with one of the species (training set) and we repeated the cross validation on 10 different negative predicting the remaining ones (test data). For evaluation sets, which resulted in a mean AUC value, true positive of prediction power in the same species, we applied a rate and false positive rate of 97.9%, 0.79 and 0.0077 with 10-fold cross validation approach. It can be seen that the standard deviations of 0.001, 0.01 and 0.003, Table 2. The true positive rates (top part) and false positive rates (bottom part) for different classifiers at a posterior probability threshold of 0.9 Training set Test set rno cel hsa rno-cel rno-hsa cel-hsa rno-cel-hsa True positive rate (threshold: 0.9) CV rno 0.74 0.48 0.64 0.66 0.73 0.57 0.65 CV cel 0.66 0.77 0.69 0.80 0.68 0.79 0.76 CV hsa 0.74 0.67 0.77 0.70 0.84 0.81 0.79 CV rno-cel 0.89 0.91 0.75 0.79 0.80 0.82 0.84 CV rno-hsa 0.91 0.71 0.93 0.80 0.78 0.84 0.86 CV cel-hsa 0.74 0.91 0.91 0.83 0.84 0.81 0.86 CV rno-cel-hsa 0.89 0.91 0.90 0.91 0.91 0.92 0.79 False negative rate (threshold: 0.9) CV rno 0.01 0.008 0.009 0.004 0.008 0.001 0.005 CV cel 0.005 0.004 0.003 0.002 0.01 0 0.005 CV hsa 0.005 0.004 0.01 0.01 0.01 0.005 0.005 CV rno-cel 0.02 0.008 0.01 0.009 0.01 0.007 0.01 CV rno-hsa 0.02 0.01 0.01 0.01 0.01 0.01 0.01 CV cel-hsa 0.005 0.004 0.009 0.004 0.01 0.003 0.01 CV rno-cel-hsa 0.01 0.004 0.003 0.01 0.01 0.009 0.007 The superscripted ‘CV’ denotes that this value was achieved in a standard 10-fold cross-validation approach. The highlighted false positive rates correspond to the true positive rates discussed in the text. Figure 1. Histogram of miRanalyzer scores. Known microRNAs are colored in red, all other data are colored in blue. The insert is a close-up for candidates with scores better than 0.65. W72 Nucleic Acids Research, 2009, Vol. 37, Web Server issue respectively. Note that, the human and C. elegans sets before in the experiment (for example if matched against where also used by Friedlander et al. who reported a RepBase, etc.). recall of 89% on C. elegans and 72% on human. Finally, the last box gives a summary of the filtered and unmapped reads. Table 2 shows that our approach reaches a recall of 75% on human when trained on rat/C. elegans (predicting on unseen data) and 91% on C. elegans using the final prediction model (predicting on previously seen data). CONCLUSIONS Figure 1 shows a cross-species evaluation of miRanalyzer is a web server tool for the integral analysis miRanalyzer trained on human and C. elegans and eval- of next generation sequencing data of small RNA mole- uated on rat. Obviously, most of the data have very low cules. It allows both the detection of known microRNAs scores (the posterior probability assigned by the classifica- and the prediction of new microRNAs. For the prediction tion model to each instance) assigned. We build a close-up of new microRNAs a new sensitive machine learning algo- for the range between 0.65 and 1 to better visualize the rithm was developed which reaches an AUC of 97.9% in high scoring predictions. It can be seen that the known rat our tests. Furthermore, the tool detects matches of the microRNAs are strongly accumulated towards scores of 1, reads against other libraries of transcribed sequences demonstrating the high predictive power of our approach such as mRNA, RFam (RNA) and RepBase and the good ability to generalize. Note that the classifier (Transposons). Currently, the tool works for seven spe- has never seen data from rat before. See also cies, but can easily be extended upon request. Supplementary Figures S3 and S4 for graphical represen- tation of other quality parameters and Receiver Operating Characteristics of different classifiers discussed in this DATA AND METHODS section. Sequence data miRanalyzer uses the newest genome assembly of each A WORKING EXAMPLE available species which were downloaded from the As a working example we used data derived from an UCSC Genome Browser (http://hgdownload.cse.ucsc. experiment carried out in our laboratory with rat hepato- edu/downloads.html): Homo sapiens (hg18, NCBI 36.1), cytes following standard protocols for smallRNA sample Mus musculus (mm8, NCBI 36), Rattus norvegicus (rn4, version 3.4), Drosophila melanogaster (dm3, BDGP preparation and deep-sequencing (http://www.illumina. Release 5), Caenorhabditis elegans (ce6, WUSTL School com/). Figure 2 shows the summary output page of of Medicine GSC and Sanger Institute version WS190), miRanalyzer run on these data. The page is made up of five boxes that reveal the intrinsic workflow of Canis familiaris (canFam2, v2.0) and Danio rerio miRanalyzer. (danRer5). The first box shows the current state of the process The mRNA sequence data were derived from different (executing, pending, etc.) on the left side and depicts a databases: H. sapiens, M. musculus, R. norvegicus and short summary of the process (input data and options) D. rerio from NCBI RefSeq (ftp://ftp.ncbi.nih.gov/ on the right side. refseq/), D. melanogaster from FlyBase (http://flybase. The second box shows the summary of the analysis of org/) and C. elegans from WormBase (http://www.worm known microRNAs. Each column corresponds to the base.org/). The mRNA sequences for C. familiaris were mapping against a different set of sequences (mature, extracted from the genomic sequence using the Galaxy mature-star, etc.). The last row provides a link to detailed platform (17). output for each of the columns. For example, the analysis In addition, mature microRNA sequences were derived of unknown mature-star sequences shows that miR- from mirBase version 12.0 (http://microrna.sanger.ac.uk/ 423-star is moderately expressed (744 copies) while the sequences/); RNA sequences included in RFam version sequence which is annotated in miRBase (mature miR- 9.0 (18) were downloaded from http://rfam.sanger.ac.uk/ 423) has less than 10 copies (Supplementary Figure S2). ; and RepBase version 10.10 (19) were obtained from The third box summarizes the matching of reads to sev- http://www.girinst.org/. Annotations and genomic coordi- eral sets of transcribed sequences. For example the frac- nates of RepeatMasker and PhastCons elements where tion of reads mapped to the transcriptome may give a downloaded from the UCSC table browser (http:// good estimate on the sample quality. It can be seen that genome.ucsc.edu/cgi-bin/hgTables?command=start). We used deep–sequencing data from three different around 8.3% of all reads in this sample originate from experiments: (i) the combined C. elegans data (accession mRNA but this corresponds just to 3% of transcription amount (number of mRNA reads/total number of reads). no. GSE6282 and GSE5990 from GEO database at The fourth box shows the summary of the detection of NCBI), which have been used also in (7) with a total of new microRNAs. In addition, a link is given for further 205 575 unique reads, (ii) data from human HeLa cells (7) information on each read cluster that has been predicted with accession no. GSE10829 and 319 939 unique reads to be a novel microRNA (Supplementary Figure S5). A and (iii) data from rat hepatocytes generated in our lab, link is also provided to a detailed output page with infor- available on our website (http://web.bioinformatics.cic mation on the chromosomal coordinates, the long hairpin biogune.es/microRNA/defaultReads.txt) with 22 086 structure and a verification if the reads have been detected unique reads. Nucleic Acids Research, 2009, Vol. 37, Web Server issue W73 Figure 2. The summary page of miRanalyzer: five boxes are shown which correspond to summary & state of the process, analysis of known microRNA, matches against transcribed sequences, and detection of new microRNAs and summary of unmatched sequences. W74 Nucleic Acids Research, 2009, Vol. 37, Web Server issue Generating ‘unknown mature-star’ sequences site predictions by miRanda software (21) and TargetScan (22). We generated the unknown star sequences by means of the mirBase precursor and mature sequences. First, we Secondary structure prediction calculate the secondary structures for all hairpins using RNAfold (20) with parameters ‘-noLP’. Then, we detect For predicting the secondary structure and its mini- the coordinates of the mature microRNAs within the mum free energy (MFE) we utilized the Vienna RNA pre-microRNA hairpin. By means of these coordinates, package (20). the information of the secondary structure and the char- acteristic ‘2-nt 3 overhang’ caused by Dicer, we extracted The machine learning approach the corresponding sequence pairing with the mature To detect new microRNAs, we set up a machine learning microRNA. approach based on the WEKA (23) implementation of the random forest learning scheme (16) with the number of Read alignment trees set to 100. Note that, the random forest algorithm was applied by Jiang et al. (10) using basically the triplet Read sequences often contain adapter sequences (see stan- structure features introduced by Xue et al. (9). However, dard protocol of small RNA sample preparation at http:// the difference of our approach consists of using a negative www.illumina.com/) at its 3 ends. Therefore, miRanalyzer set derived directly from the experimental data which has two alignment options depending on whether the (i) assures that the sequences are transcribed and reads have adapter sequences or not. In general, the tool (ii) allows the generation of new and previously unused generate a prefix tree of all input reads and subsequently features that seem to be more discriminative than the trip- walk in a single run over the genome to detect the reads. let structure features (see below). By default, miRanalyzer assumes the existence of adapter sequences and therefore, first detects matches of Training and test sets a subsequence of 16 bp starting at the 5 end of the read. When miRanalyzer detects an initial match, it expands the For the machine learning approach we created three data subsequence as long as a perfect match is given. Finally, sets, one from each of the three species: human, C. elegans only matches of the longest subsequence are retained. and rat. First, we extracted all pre-microRNA candidates Note that, in this approach the adapter sequences from the experimental dataset that could be mapped to a are detected implicitly (the sequence at the 3 end of the known microRNA and labeled them as positive instances. read that does not match to the genome is defined as the Second, we selected an equal amount of pre-microRNA adapter) and therefore, the adapter sequences need not to candidates from the same dataset by random selection be known or supplied by the user. with the known microRNAs removed and labeled them as negative. In total we obtained a dataset of 612 instances in human, 468 instances in worm and 376 instances in rat. Ontological analysis We used a recently published tool, Annotation-Modules Features (14), to pre-calculate the significant annotations of all target gene lists for all microRNAs in the miRBase We created a broad variety of features associated with (12.0). Currently, the user can choose between two dif- nucleotide sequence, structure and energy. Table 3 lists ferent target site prediction methods: miRBase target all the features used in this work. Table 3. Features calculated for the generation of the classifier Feature name Description of the feature Read count Number of reads mapping to the pre-microRNA Length The length of the longest hairpin structure Stem length The length of the longest hairpin structure stem Mfe The mean free energy of the hairpin Loop length The number of bases in the loop of the hairpin Loop GC The GC-content of the loop GC The GC-content of the small hairpin Asymmetric bulges The number of asymmetric bulges and mismatches regarding the stem Symmetric bulges The number of symmetric bulges and mismatches regarding the stem Bulges The number of bulges in the stem Longest bulge The number of non-pairing nucleotides of the longest bulge Mismatches pre-microRNA The number of single mismatches in the hairpin Mismatches microRNAs The number of single mismatches in the mature microRNA region of the hairpin Stability The smallest hairpin harbouring the read is extended 10 times 10bp at both ends. The stability is the frequency the original structure is found in the elongated structures Alternating stability Reports whether a structure disappears when extending the sequence, but reappears again. Triplet-SVM features All features that were proposed by Xue et al. (9) Bindings The number of bindings in the stem divided by the hairpin length Nucleic Acids Research, 2009, Vol. 37, Web Server issue W75 The selection of the features with highest prediction ACKNOWLEDGEMENTS power was performed by means of calculating their infor- The authors thank Dmitrij Frishman for careful reading mation gain. Subsequently, we ranked the features accord- of the manuscript and helpful comments, Philipp Pagel for ing to their discrimination power. The top 10 features used helpful suggestions, Ewa Gubb for revising the English for building the final classifier are: stability, mfe, bindings, style and the Genome Analysis Platform staff at CIC stem length, read count, longest bulge, mismatches bioGUNE for their technical support. microRNA, mismatches pre-microRNA, alternating sta- bility and the Triple-SVM feature ‘A ...’. Supplementary Table S6 shows the 10 best features selected for each FUNDING model used for data included in Table 2. It can be seen The Department of Industry, Tourism and Trade of the that nine features are always the same and just their rank- Government of the Autonomous Community of the ing and the Triplet-SVM feature vary. Basque Country [Etortek Research Annual Programs]; the Innovation Technology Department of the Bizkaia County [2008–2009 institutional support to technological Pre-processing platforms]; Junta de Andalucia [P07FQM3163 to MH]. In order to check the reads for putative new microRNAs Funding for open access charge: [Etortek IE08-228]. we perform a pre-processing of the data which contains the following steps: (i) all reads which overlap in the Conflict of interest statement. None declared. genome are clustered together. (ii) Due to erroneous reads, dicer products (mature, mature-star and loop) REFERENCES could be grouped together such that they appear as non- microRNA products (for example producing a long clus- 1. Kim,V.N. and Nam,J.W. (2006) Genomics of microRNA. Trends ter which overlaps the loop of the precursor). To avoid Genet., 22, 165–173. such a situation, we walk along the cluster sequences and 2. Lagos-Quintana,M., Rauhut,R., Lendeckel,W. and Tuschl,T. (2001) Identification of novel genes coding for small expressed RNAs. test whether the start of the current read overlaps less than Science, 294, 853–858. 3 nt with the end positions of previous reads. In that case 3. Ouellet,D.L., Perron,M.P., Gobeil,L.A., Plante,P. and Provost,P. the cluster is split at the current read start position. (2006) MicroRNAs in gene regulation: when the smallest governs it Clusters now contain a non-dicer product, the mature or all. J. Biomed. Biotech., 2006, 69616. 4. Bagasra,O. and Prilliman,K.R. (2004) RNA interference: the the mature-star, but not more than one theoretical pro- molecular immune system. J. Mol. Histol., 35, 545–553. duct. (iii) Clusters of more than 25 bp length are discarded. 5. Lu,J., Getz,G., Miska,E.A., Alvarez-Saavedra,E., Lamb,J., Peck,D., (iv) Since the microRNA can be located either on the 5 Sweet-Cordero,A., Ebert,B.L., Mak,R.H., Ferrando,A.A. et al. arm or the 3 arm of the hairpin, we extract the cluster (2005) MicroRNA expression profiles classify human cancers. Nature, 435, 834–838. sequence twice, with 60 bp upstream and 10 bp down- 6. Landgraf,P., Rusu,M., Sheridan,R., Sewer,A., Iovino,N., stream flanking areas and vice versa. For both sequences Aravin,A., Pfeffer,S., Rice,A., Kamphorst,A.O., Landthaler,M. the secondary structure is predicted via RNAfold, but et al. (2007) A mammalian microRNA expression atlas based on only the energetically favourable is retained. (v) Non-hair- small RNA library sequencing. Cell, 129, 1401–1414. 7. Friedla¨ nder,M.R., Chen,W., Adamidi,C., Maaskola,J., pin structures are discarded. (vi) Structures where the clus- Einspanier,R., Knespel,S. and Rajewsky,N. (2008) Discovering ter sequence is not fully included or spans the loop and a microRNAs from deep sequencing data using miRDeep. Nat. part of the stem cannot be dicer products are consequently Biotech., 26, 407–415. discarded. Finally, since our analysis showed that virtually 8. Berezikov,E., Guryev,V., van de Belt,J., Wienholds,E., Plasterk,R.H. and Cuppen,E. (2005) Phylogenetic shadowing and all known microRNAs show more than 14 bindings in the computational identification of human microRNA genes. Cell, 120, microRNA:microRNA-star duplex, we considered this as 21–24. a mandatory requirement. Having applied the pre-proces- 9. Xue,C., Li,F., He,T., Liu,G.P., Li,Y. and Zhang,X. (2005) sing step to the three experimental data sets, we receive Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine. BMC 6967 candidate precursors for rat, 12 233 for worm and Bioinformatics, 6, 310. 43 905 for human. 10. Jiang,P., Wu,H., Wang,W., Ma,W., Sun,X. and Lu,Z. (2007) MiPred: classification of real and pseudo microRNA precursors using random forest prediction model with combined features. Post-processing Nucleic Acids Res., 35, W339–W344. 11. Nam,J.W., Kim,J., Kim,S.K. and Zhang,B.T. (2006) ProMiR II: a After classification of the deep-sequencing data in form of web server for the probabilistic prediction of clustered, nonclus- the clusters created in the pre-processing step, clusters tered, conserved and nonconserved microRNAs. Nucleic Acids Res., 34, W455–W458. containing the mature and mature-star microRNA are 12. Bentwich,I., Avniel,A., Karov,Y., Aharonov,R., Gilad,S., Barad,O., merged such that one cluster represents one microRNA Barzilai,A., Einat,P., Einav,U., Meiri,E. et al. (2005) Identification precursor. of hundreds of conserved and nonconserved human microRNAs. Nat. Genetics, 37, 766–770. 13. Griffiths-Jones,S. (2006) miRBase: the microRNA sequence data- base. Methods Mol. Biol., 342, 129–138. 14. Hackenberg,M. and Matthiesen,R. (2008) Annotation-Modules: a SUPPLEMENTARY DATA tool for finding significant combinations of multisource annotations Supplementary Data are available at NAR Online. for gene lists. Bioinformatics, 24, 1386–1393. W76 Nucleic Acids Research, 2009, Vol. 37, Web Server issue 15. Berezikov,E., van Tetering,G., Verheul,M., van de Belt,J., van 19. Jurka,J., Kapitonov,V.V., Pavlicek,A., Klonowski,P., Kohany,O. Laake,L., Vos,J., Verloop,R., van de Wetering,M., Guryev,V., and Walichiewicz,J. (2005) Repbase update, a database of Takada,S. et al. (2006) Many novel mammalian microRNA candi- eukaryotic repetitive elements. Cytogenetic Genome Res., 110, dates identified by extensive cloning and RAKE analysis. Genome 462–467. Res., 16, 1289–1298. 20. Hofacker,I.L. (2003) Vienna RNA secondary structure server. 16. Breiman,L. (2001) Random forests. Machine Learning, 45, 28. Nucleic Acids Res., 31, 3429–3431. 17. Giardine,B., Riemer,C., Hardison,R.C., Burhans,R., Elnitski,L., 21. Enright,A.J., John,B., Gaul,U., Tuschl,T., Sander,C. and Shah,P., Zhang,Y., Blankenberg,D., Albert,I., Taylor,J. et al. (2005) Marks,D.S. (2003) MicroRNA targets in Drosophila. Genome Biol., Galaxy: a platform for interactive large-scale genome analysis. 5, R1. Genome Res., 15, 1451–1455. 22. Lewis,B.P., Burge,C.B. and Bartel,D.P. (2005) Conserved seed 18. Gardner,P.P., Daub,J., Tate,J.G., Nawrocki,E.P., Kolbe,D.L., pairing, often flanked by adenosines, indicates that thousands of Lindgreen,S., Wilkinson,A.C., Finn,R.D., Griffiths-Jones,S., human genes are microRNA targets. Cell, 120, 15–20. Eddy,S.R. et al. (2009) Rfam: updates to the RNA families data- 23. Witten,I. and E,F. (2005) Data Mining: practical machine learning base. Nucleic Acids Res., 37, D136–D140. tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco.

Journal

Nucleic Acids ResearchOxford University Press

Published: Jul 1, 2009

There are no references for this article.