STAR: ultrafast universal RNA-seq aligner

Alexander Dobin; Carrie A. Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R. Gingeras

doi:10.1093/bioinformatics/bts635

STAR: ultrafast universal RNA-seq aligner

Dobin, Alexander; Davis, Carrie A.; Schlesinger, Felix; Drenkow, Jorg; Zaleski, Chris; Jha, Sonali; Batut, Philippe; Chaisson, Mark; Gingeras, Thomas R. 2012-10-25 00:00:00 Vol. 29 no. 1 2013, pages 15–21 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bts635 Sequence analysis Advance Access publication October 25, 2012 1, 1 1 1 1 Alexander Dobin , Carrie A. Davis , Felix Schlesinger ,Jorg Drenkow , Chris Zaleski , 1 1 2 1 Sonali Jha ,Philippe Batut ,Mark Chaisson and Thomas R. Gingeras 1 2 Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA and Pacific Biosciences, Menlo Park, CA, USA Associate Editor: Inanc Birol unique challenges to detection and characterization of spliced ABSTRACT transcripts. Two key tasks make these analyses computationally Motivation: Accurate alignment of high-throughput RNA-seq data is a intensive. The first task is an accurate alignment of reads that challenging and yet unsolved problem because of the non-contiguous contain mismatches, insertions and deletions caused by genomic transcript structure, relatively short read lengths and constantly variations and sequencing errors. The second task involves map- increasing throughput of the sequencing technologies. Currently avail- ping sequences derived from non-contiguous genomic regions able RNA-seq aligners suffer from high mapping error rates, low map- comprising spliced sequence modules that are joined together to ping speed, read length limitation and mapping biases. form spliced RNAs. Although the first task is shared with DNA Results: To align our large (480 billon reads) ENCODE Transcriptome resequencing efforts, the second task is specific and crucial to the RNA-seq dataset, we developed the Spliced Transcripts Alignment to RNA-seq, as it provides the connectivity information needed to a Reference (STAR) software based on a previously undescribed reconstruct the full extent of spliced RNA molecules. These align- RNA-seq alignment algorithm that uses sequential maximum map- ment challenges are further compounded by the presence of mul- pable seed search in uncompressed suffix arrays followed by seed tiple copies of identical or related genomic sequences that are clustering and stitching procedure. STAR outperforms other aligners themselves transcribed, making precise mapping difficult. by a factor of450 in mapping speed, aligning to the human genome Various sequence alignment algorithms have been recently 550 million 2 76 bp paired-end reads per hour on a modest 12-core developed to tackle these challenges (Au et al., 2010; De Bona, server, while at the same time improving alignment sensitivity and et al., 2008; Grant et al., 2011; Han et al., 2011; Trapnell et al., precision. In addition to unbiased de novo detection of canonical junc- 2009; Wang et al., 2010; Wu and Nacu, 2010; Zhang et al., 2012). tions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA se- However, application of these algorithms invokes compromises quences. Using Roche 454 sequencing of reverse transcription poly- in the areas of mapping accuracy (sensitivity and precision) and merase chain reaction amplicons, we experimentally validated 1960 computational resources (run time and disk space) (Grant et al., novel intergenic splice junctions with an 80–90% success rate, corro- 2011). With current advances in sequencing technologies, the borating the high precision of the STAR mapping strategy. computational component is increasingly becoming a through- Availability and implementation: STAR is implemented as a standa- put bottleneck. High mapping speed is especially important for lone Cþþ code. STAR is free open source software distributed under large consortia efforts, such as ENCODE (http://www.genome. GPLv3 license and can be downloaded from http://code.google.com/ gov/encode/), which continuously generate large amounts of p/rna-star/. sequencing data. Contact: [email protected]. Furthermore, most of the cited algorithms were designed to deal with relatively short reads (typically 200 bases), and are Received on May 29, 2012; revised on October 17, 2012; accepted on ill-suited for aligning long read sequences generated by the emer- October 19, 2012 ging third-generation sequencing technologies (Flusberg et al., 2010; Rothberg et al., 2011). The longer read sequences, ideally reaching full lengths of RNA molecules, have a great potential 1 INTRODUCTION for enhancing transcriptome studies by providing more complete Although genomes are composed of linearly ordered sequences of RNA connectivity information. nucleic acids, eukaryotic cells generally reorganize the informa- This report describes an alignment algorithm entitled ‘Spliced tion in the transcriptome by splicing together non-contiguous Transcripts Alignment to a Reference (STAR)’, which was de- exons to create mature transcripts (Hastings and Krainer, 2001). signed to specifically address many of the challenges of RNA-seq The detection and characterization of these spliced RNAs have data mapping, and uses a novel strategy for spliced alignments. been a critical focus of functional analyses of genomes in both the We performed high-throughput validation experiments that cor- normal and disease cell states. Recent advances in sequencing roborated STAR’s precision for detection of novel splice junc- technologies have made transcriptome analyses at the single nu- tions. STAR’s high mapping speed and accuracy were crucial for cleotide level almost routine. However, hundreds of millions of analyzing the large ENCODE transcriptome (Djebali et al., short (36 nt) to medium (200 nt) length sequences (reads) gener- 2012) dataset (480 billion Illumina reads). We also demonstrated ated by such high-throughput sequencing experiments present that STAR has a potential for accurately aligning long (several kilobases) reads that are emerging from the third-generation *To whom correspondence should be addressed. sequencing technologies. The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 15 A.Dobin et al. The MMP in STAR search is implemented through 2ALGORITHM uncompressed suffix arrays (SAs) (Manber and Myers, 1993). Many previously described RNA-seq aligners were developed as Notably, finding MMP is an inherent outcome of the standard extensions of contiguous (DNA) short read mappers, which were binary string search in uncompressed SAs, and does not require used to either align short reads to a database of splice junctions any additional computational effort compared with the full- or align split-read portions contiguously to a reference genome, length exact match searches. The binary nature of the SA or a combination thereof. In contrast to these approaches, STAR search results in a favorable logarithmic scaling of the search was designed to align the non-contiguous sequences directly to time with the reference genome length, allowing fast searching the reference genome. STAR algorithm consists of two major even against large genomes. Advantageously, for each MMP the steps: seed searching step and clustering/stitching/scoring step. SA search can find all distinct exact genomic matches with little computational overhead, which facilitates an accurate alignment 2.1 Seed search of the reads that map to multiple genomic loci (‘‘multimapping’’ reads). The central idea of the STAR seed finding phase is the sequential In addition to detecting splice junctions, the MMP search, search for a Maximal Mappable Prefix (MMP). MMP is similar implemented in STAR, enables finding multiple mismatches to the Maximal Exact (Unique) Match concept used by the and indels, as illustrated in Figure 1b. If the MMP search does large-scale genome alignment tools Mummer (Delcher et al., not reach the end of a read because of the presence of one or 1999, 2002; Kurtz et al.) and MAUVE (Darling et al., 2004, more mismatches, the MMPs will serve as anchors in the genome 2010). Given a read sequence R,read location i and a reference that can be extended, allowing for alignments with mismatches. genome sequence G,the MMP(R,i,G) is defined as the longest In some cases, the extension procedure does not yield a good substring (R , R , .. . , R )thatmatches exactly one or i i þ 1 i þ MML 1 genomic alignment, which allows identification of poly-A tails, more substrings of G,where MML is the maximum mappable library adapter sequences or poor sequencing quality tails length. We will explain this concept using a simple example of a (Fig. 1c). The MMP search is performed in both forward and read that contains a single splice junction and no mismatches reverse directions of the read sequence and can be started from (Fig. 1a). In the first step, the algorithm finds the MMP starting user-defined search start points throughout the read sequence, from the first base of the read. Because the read in this example which facilitates finding anchors for reads with errors near the comprises a splice junction, it cannot be mapped contiguously to ends and improves mapping sensitivity for high sequencing error the genome, and thus the first seed will be mapped to a donor rate conditions. splice site. Next, the MMP search is repeated for the unmapped Besides the efficient MMP search algorithm, uncompressed portion of the read, which, in this case, will be mapped to an SAs also demonstrate a significant speed advantage over the acceptor splice site. Note that this sequential application of compressed SAs implemented in many popular short read MMP search only to the unmapped portions of the read aligners (Supplementary Section 1.8). This speed advantage is makes the STAR algorithm extremely fast and distinguishes it traded off against the increased memory usage by uncompressed from Mummer and MAUVE, which find all possible Maximal arrays, which is assessed further in Section 3.3. Exact Matches. This approach represents a natural way of find- ing precise locations of splice junctions in a read sequence and is advantageous over an arbitrary splitting of read sequences used 2.2 Clustering, stitching and scoring in the split-read methods. The splice junctions are detected in a In the second phase of the algorithm, STAR builds alignments of single alignment pass without any apriori knowledge of splice the entire read sequence by stitching together all the seeds that junctions’ loci or properties, and without a preliminary contigu- were aligned to the genome in the first phase. First, the seeds are ous alignment pass needed by the junction database approaches. clustered together by proximity to a selected set of ‘anchor’ seeds. We found that an optimal procedure for anchor selection is through limiting the number of genomic loci the anchors align (a) Map Map again to. All the seeds that map within user-defined genomic windows MMP 1 MMP 2 around the anchors are stitched together assuming a local linear RNA-seq read transcription model. The size of the genomic windows deter- mines the maximum intron size for the spliced alignments. A frugal dynamic programming algorithm (see Supplementary Section 1.5 for details) is used to stitch each pair of seeds, allow- exons in the genome ing for any number of mismatches but only one insertion or (b)(c) deletion (gap). Map Map Importantly, the seeds from the mates of paired-end RNA-seq MMP 1 Extend MMP 1 Trim reads are clustered and stitched concurrently, with each paired-end read represented as a single sequence, allowing for mismatches A-tail, or adapter, a possible genomic gap or overlap between the inner ends of or poor quality tail the mates. This is a principled way to use the paired-end infor- mation, as it reflects better the nature of the paired-end reads, Fig. 1. Schematic representation of the Maximum Mappable Prefix namely, the fact that the mates are pieces (ends) of the same search in the STAR algorithm for detecting (a)splice junctions, (b)mis- matches and (c) tails sequence. This approach increases the sensitivity of the 16 STAR algorithm, as only one correct anchor from one of the mates is transcript annotations. The maximum number of mismatches sufficient to accurately align the entire read. was set at 10 per paired-end read, and the minimum/maximum If an alignment within one genomic window does not cover the intron sizes were set at 20 b/500 kb (Supplementary Section 2 for entire read sequence, STAR will try to find two or more windows additional information). Note that running comparison between that cover the entire read, resulting in a chimeric alignment, with mappers with their default parameters is a reasonable and com- different parts of the read mapping to distal genomic loci, or monly accepted practice, as all considered aligners were, by de- different chromosomes, or different strands (Supplementary fault, optimized for mammalian genomes and recent RNA-seq Fig. S1). STAR can find chimeric alignments in which the data. mates are chimeric to each other, with a chimeric junction The resulting alignments were compared with the true genomic located in the unsequenced portion of the RNA molecule be- origin of the simulated reads, and true-/false-positive rates of tween two mates. STAR can also find chimeric alignments in splice junction detection were calculated using procedures and which one or both mates are internally chimerically aligned, scripts developed by Grant et al. (2011). ROC curves (Fig. 2) thus pinpointing the precise location of the chimeric junction were computed with the detection (discrimination) threshold in the genome. An example of the BCR-ABL fusion transcript given by the number of reads mapped across each junction, i.e. detection from the K562 erythroleukemia cell line is given in the for each aligner, only junctions supported by at least N reads were selected for each point along the ROC curves, with N varied Supplementary Section 1.7 (Supplementary Fig. S2). The stitching is guided by a local alignment scoring scheme, from 1 (lowest threshold) to 100 (high threshold). All aligners with user-defined scores (penalties) for matches, mismatches, in- exhibit desirable steep ROC curves at high values of detection threshold. At the lowest detection threshold of 1 read per junc- sertions, deletions and splice junction gaps, allowing for a quan- tion, STAR exhibits the lowest false-positive rate while achieving titative assessment of the alignment qualities and ranks (see high sensitivity. Supplementary Figure S5 shows the same ana- Supplementary Section 1.4 for details). The stitched combination lysis for a low error rate-simulated dataset, which yields similar with the highest score is chosen as the best alignment of a read. conclusions. For multimapping reads, all alignments with scores within a cer- tain user-defined range below the highest score are reported. Although the sequential MMP search only finds the seeds 3.2 Performance on experimental RNA-seq data exactly matching the genome, the subsequent stitching procedure For evaluation of the RNA-seq mappers’ performance on experi- is capable of aligning reads with a large number of mismatches, mental RNA-seq data STAR, TopHat2, GSNAP, RUM and indels and splice junctions, scalable with the read length. This MapSplice were run (see Supplementary Section 2 for additional characteristic has become ever more important with the information) on an ENCODE long RNA-seq dataset (K562 emergence of the third-generation sequencing technologies whole cell Aþ sample, 1 Illumina GAIIx lane of 40 million (such as Pacific Biosciences or Ion Torrent) that produce 2 76 reads). STAR and GSNAP aligned the largest percentage longer reads with elevated error rates. of reads (94% both), followed by RUM (86%), MapSplice (85%) and TopHat2 (71%). Different accuracy metrics for splice junction detection with 3 RESULTS respect to the Gencode 7 (Harrow et al., 2012) annotations are 3.1 Performance on simulated RNA-seq data plotted in Figure 3a–c as a function of the detection threshold, defined as the minimum number of RNA-seq reads per junction. First, we used simulated data to evaluate performance of STAR and compare it with other RNA-seq mappers. Simulations allow for a precise calculation of false-positive and -negative rates, al- though artificial error models, used to generate simulated reads, may not adequately represent experimental errors. We used a simulated dataset from a recent study (Grant et al., 2011), in 80 which 10 million of 2 100 nt Illumina-like read sequences with a reasonably high error rate were generated from the mouse transcriptome, including annotated transcripts and artifi- cial ones. Various types of genomic variations and sequencing errors were introduced to mimic real RNA-seq data. The latest available versions of STAR 2.1.3, TopHat2 2.0.0 (Trapnell et al., 2009), GSNAP 2012-07-03 (Wu and Nacu, 2010), RUM 1.11 (Grant et al., 2011) and MapSplice 1.15.2 STAR 20 TopHat2 (Wang et al., 2010) were run on the simulated dataset labeled GSNAP as ‘SIM1-TEST2’ in (Grant et al., 2011). Because the TopHat2 RUM 2.0.0 release represents a major new development of the TopHat MapSplice aligner, which has not been peer reviewed yet, we also made the 0 1 2 3 4 False Positive Rate % comparisons with the previous TopHat version 1.4. We found that the new version yields a slightly better accuracy and faster Fig. 2. True-positive rate versus false-positive rate (ROC-curve) for simu- mapping speed (Supplementary Section 2.1 and Fig. S3). All lated RNA-seq data for STAR, TopHat2, GSNAP, RUM and aligners were run in the de novo mode, i.e. without using gene/ MapSplice True Positive Rate % A.Dobin et al. Fig. 3. Various accuracy metrics for splice junction detection in the experimental RNA-seq data. The color-coding scheme for mappers is the same in all plots. X-axis in plots (a), (b), (d) and (e) is the detection threshold defined as the number of reads mapped across each junction, i.e. each point with the X- value of N represents all junctions that are supported by at least N reads mapped by a given aligner. (a) Total number of detected junctions, annotated (solid lines) and unannotated (dashed lines); (b) percentage of detected junctions that are annotated; (c) pseudo-ROC curve: percentage of all annotated junctions that are detected versus percentage of detected junctions that are unannotated; (d) number of unannotated junctions detected by at least two mappers (solid lines) and number of unannotated junctions detected exclusively by only one mapper (dashed lines); (e) percentage of detected unan- notated junctions that are detected exclusively by only one mapper and (f) pseudo-ROC curve: percentage of unannotated junctions that are detected by at least two mappers versus percentage of detected unannotated junctions that are detected exclusively by only one mapper Although all aligners detect a similar number of annotated junc- (Zhang et al., 2012) and plotted (Fig. 3d) the number of junctions tions (Fig. 3a, solid lines), there are noticeable differences between detected by at least two mappers (pseudo-true positive) and the mappers in the number of detected unannotated junctions (Fig. number of junctions detected exclusively by each mapper 3a, dashed lines). The percentage of the unannotated among all (pseudo-false positive). STAR alignments yield the lowest detected junctions is plotted in Figure 3b as a function of detec- pseudo-false-positive rate, i.e. the lowest proportion of exclu- tion threshold. Because all aligners show similar sensitivities to sively detected junctions (Fig. 3e), while at the same time achiev- annotated junctions, the proportion of annotated among all de- ing the second in class pseudo-sensitivity (Fig. 3f). GSNAP tected junctions may serve as a surrogate of precision. STAR, exhibits the highest pseudo-sensitivity at the cost of a high RUM and TopHat2 perform similarly, while GSNAP exhibits pseudo-false-positive rate. These results qualitatively agree with lower precision at a lower detection threshold, and MapSplice the aligners’ performance on the simulated data, whereas the shows unusual non-monotonic and non-saturating behavior, quantitative differences may be attributed to disparities between which was also noted in Zhang et al. (2012). Pseudo-ROC real and simulated errors. Supplementary Figure S6 shows the curve, i.e. the proportion of annotated junctions that are detected same analysis for a shorter RNA-seq dataset (2 50 b), which (pseudo-sensitivity) versus the proportion of detected junctions indicates that STAR retains high sensitivity and precision even that are unannotated (pseudo-false-positive rate), is plotted in for short reads. Figure 3c. All aligners (except MapSplice) perform similarly at Note that the pseudo-true-/false-positive definitions are based high values of the detection threshold. on the assumption that junctions detected by only one aligner are Because many unannotated junctions represent true novel spli- more likely to be false positive than the junctions detected by two cing events and are not false positives, the percentage of unan- or more aligners; however, these definitions are not rigorous be- notated among all detected junctions is not an accurate proxy for cause the true/false assessments cannot be made for experimental the false-positive rate. To obtain a more accurate estimate of the data. We would also like to stress that these comparisons were false-positive rate, we followed another frequently used approach done for current versions of each tool, with the default 18 STAR parameters and for the present state of Illumina sequencing tech- 3.4 Experimental validation nology. As both sequencing technologies and tools improve, As part of the characterization of human transcriptome by the these rankings may change and have to be reevaluated. ENCODE (Djebali et al., 2012), STAR was used to map poly- Similarly to other RNA-seq aligners, STAR’s default param- adenylated (poly Aþ)long(4200 nt) transcripts isolated from eters are optimized for mammalian genomes. Other species may whole cell extracts of primary human H1ES (embryonic stem require significant modifications of some alignment parameters; cells) and HUVEC (umbilical vein endothelial cells) cell lines. in particular, the maximum and minimum intron sizes have to be These RNAs were sequenced using a duplex-specific nuclease reduced for organisms with smaller introns. protocol (Parkhomchuk et al., 2009) that generated 2 76 bp strand-specific reads. Not surprisingly, unannotated (novel) splice sites show lower 3.3 Speed benchmarks abundance levels than the annotated junctions, as indicated by the significant drop in the number of unannotated junctions with Speed benchmarks were performed on a server equipped with the number of supporting reads (Supplementary Fig. S7). two 6-core Intel Xeon CPUs X5680@ 3.33 GHz and 148 GB Because each of the cell lines was sequenced in biological dupli- of RAM (random-access memory). Six or 12 threads were re- cates, a collection of high confidence splice sites could be identi- quested for each run, using half or full capacity of the server. fied based on their reproducibility between replicas. To assess the All mappers were run with their default parameters on the 40 reproducibility of the detected splice junctions, we developed a million 2 76 Illumina human RNA-seq dataset described in the non-parametric irreproducible discovery rate (npIDR) approach, previous section. specifically suitable for the discrete nature of the RNA-seq data The ‘wall’ time (i.e. the total run time required to complete the (see Supplementary Materials for the detailed description). This mapping) and RAM usage are presented in Table 1. STAR approach is similar to the npIDR concept extensively used in the achieves a speed of 550 million 2 76 Illumina paired-end analysis of the ENCODE ChIP-seq experiments (Landt et al., reads per hour using 12 threads (full capacity of the server), 2012). Supplementary Figure S8 shows the dependence of i.e. 45 million paired reads per hour per processor, outperform- npIDR¼ 0.1 on the read count per junction, providing a prin- ing the second fastest mapper (TopHat2) by a factor450. STAR cipled method for selecting the read count threshold with a desired exhibits close to linear scaling of the throughput rate with the level of reproducibility. For example, five staggered reads per junc- number of threads, losing 10% of per thread mapping speed tion are required to achieve an npIDR of 0.1, i.e. the 90% likeli- when the number of threads is increased from 6 to 12. hood that these junctions will be observed again in another STAR’s high mapping speed is traded off against RAM usage: experiment on the same cell line with the same sequencing depth. STAR requires 27 GB of RAM for aligning to the human Experimental validation was carried out on 1920 novel splice genome. Like all other aligners, with the exception of RUM, junctions in a wide range of RNA-seq reads support, both below the amount of RAM used by STAR does not increase signifi- and above the npIDR threshold. Only splice junctions mapped to cantly with the number of threads, as the SA is shared among all intergenic or antisense loci to Gencode 7 genes (Harrow et al., threads. Although STAR’s RAM requirements would have been 2012) were chosen for validation, as these junctions are more prohibitively expensive several years ago, at the time when the likely to be false positive than the junctions that map within the first short read aligners were developed, recent progress in semi- annotated genes. The high-throughput validation pipeline conductor technologies resulted in a substantial drop of RAM involved reverse transcription polymerase chain reaction amplifi- prices, and modern high performance servers are commonly cation of targeted regions followed by Roche 454 sequencing of equipped with RAM 432 GB. STAR has an option to use the pooled products. The reverse transcription polymerase chain sparse SAs, reducing the RAM consumption to 516 GB for reaction primer design took advantage of the250 nt insert length the human genome at the cost of 25% decrease in the mapping of the paired-end reads supporting targeted junctions, and en- speed, while maintaining the alignment accuracy. tailed the production of long 300–600 nt amplicons. These ampli- cons were pooled and sequenced by a Roche 454 sequencer to provide long and more confidently mappable reads that were Table 1. Mapping speed and RAM benchmarks on the experimental aligned to the genome with BLAT. Detailed description of the RNA-seq dataset experimental protocols can be found in Djebali et al.(2012). We selected 1920 intergenic and antisense splice junctions from Aligner Mapping speed: million Peak physical H1ES and HUVEC cell lines, including both highly read pairs/hour RAM, GB (npIDR50.1) and poorly (npIDR40.1) reproducible junctions. Of all the tested novel intergenic/antisense junctions supported 6 threads 12 threads 6 threads 12 threads by at least five RNA-seq reads (corresponding to npIDR50.1), 82–89% (H1ES) and 84–95% (HUVEC) were corroborated by STAR 309.2 549.9 27.0 28.4 at least two amplicons sequenced by 454 (Table 2). Notably, the STAR sparse 227.6 423.1 15.6 16.0 validation rate remains at a high level of 72% (H1ES) and 74% TopHat2 8.0 10.1 4.1 11.3 (HUVEC) even for the candidate junctions that were supported RUM 5.1 7.6 26.9 53.8 by as few as two RNA-seq reads. These results confirm high MapSplice 3.0 3.1 3.3 3.3 precision of the STAR’s splicing detection algorithm even for GSNAP 1.8 2.8 25.9 27.0 rare novel junctions. 19 A.Dobin et al. Table 2. Number of selected junctions and percentage of selected junctions that were validated by at least two 454 reads, as a function of the RNA-seq read count per junction H1ES HUVEC Read count per Number of Proportion of Read count Number of Proportion of junction from tested junctions validated per junction tested junctions junctions validated two replicates junctions by at least two from two replicates by at least two 454 reads (%) 454 reads (%) 2 192 72.4 2 192 74.0 3 192 77.6 3 192 75.0 4 96 74.0 4 96 76.0 5 96 82.3 5–6 96 84.4 6–7 96 79.2 7–8 96 84.4 8–11 96 81.3 9–12 96 86.5 12–24 96 87.5 13–23 96 94.8 25 96 88.5 24 96 90.6 The upper bound of the false discovery rate (FDR) can be One of the main inherent problems of all de novo RNA-seq estimated from the validation rate (VR) as FDR 1 VR. aligners is the inability to accurately detect splicing events that For low abundance junctions, the experimental FDR is lower involve short (55–10 nt) sequence overhangs on the donor or than the npIDR predicted from the dissimilarity between the acceptor sides of a junction. This causes a significant replicates: for example, although 45% of junctions, supported underdetection of splicing events, and also increases significantly the misalignment rate, as such reads are likely to be mapped with by just two reads, are not reproducible (Supplementary Fig. a few mismatches to a similar contiguous genomic region. In S8),470% of them are successfully validated (Table 2). Hence, addition, this effect also biases the alignments toward processed npIDR can serve as a conservative upper bound FDR estimate pseudogenes, which are abundant in the human genome. in cases where validation experiments are impractical. Similarly to other RNA-seq aligners, to mitigate this problem, STAR has an option to obtain information about possible splice junction loci from annotation databases (Supplementary Section 4DISCUSSION 4). It is also possible to run a second mapping pass, supplying it with splice junction loci found in the first mapping pass. In this Despite several years of ongoing improvements, alignment of the case, STAR will not discover any new junctions but will align non-contiguous RNA-seq reads to a reference genome is not a spliced reads with short overhangs across the previously detected solved problem yet, owing both to its intrinsic complexity and junctions. rapid transformations of the sequencing technologies. Several To demonstrate STAR’s ability to align long reads, we have critical problems have been found to afflict previously published mapped the long (0.5–5 kb) human mRNA sequences from approaches, such as high mapping error rate, alignment biases, GenBank (see Supplementary Section 5 for details). The accur- low sensitivity for unannotated transcripts, poor scalability with acy of STAR alignments is similar or higher than that of BLAT the read length, restrictions in the number of junctions/mis- (Kent, 2002) a popular EST/mRNA aligner. At the same time, matches/indels per read, inability to detect non-linear transcripts STAR outperforms BLAT by more than two orders of magni- (such as chimeric RNAs), and, crucially, low mapping tude in the alignment speed, which is important for throughput. high-throughput sequencing applications. In this work, we described STAR, a novel algorithm for align- The algorithm extensibility to long reads shows that STAR ing high-throughput long and short RNA-seq data to a reference has a potential to serve as a universal alignment tool across a genome, developed to overcome the aforementioned issues. broad spectrum of emerging sequencing platforms. STAR can Unlike many other RNA-seq mappers, STAR is not an extension align reads in a continuous streaming mode, which makes it of a short-read DNA mapper, but was developed as a compatible with novel sequencing technologies such as the one stand-alone Cþþ code. STAR is capable of running parallel recently announced by Oxford Nanopore Technologies. As the threads on multicore systems with close to linear scaling of prod- sequencing technologies and protocols evolve, new mapping stra- uctivity with the number of cores. STAR is fast: on a modern, tegies will have to be developed, and STAR core algorithm can but not overly expensive, 12-core server, it can align 550 million provide a flexible framework to address arising alignment 2 76 nt reads per hour to the human genome, surpassing all challenges. other existing RNA-seq aligners by a factor of 50. At the same time, STAR exhibits better alignment precision and sensitivity Data access than other RNA-seq aligners for both experimental and simu- GEO: GSE38886 (Roche 454 sequencing) lated data. GEO: GSE30567 (Illumina long RNA-Seq) 20 STAR Han,J. et al. (2011) Pre-mRNA splicing: where and when in the nucleus. Trends Funding: This work was funded by NHGRI (NIH) grant Cell. Biol., 21, 336–343. U54HG004557. Harrow,J. et al. (2012) GENCODE: The reference human genome annotation for the ENCODE project. Genome Res., 22, 1760–1774. Conflict of Interest: none declared. Hastings,M.L. and Krainer,A.R. (2001) Pre-mRNA splicing in the new millennium. Curr. Opin. Cell. Biol., 13, 302–309. Kurtz,S. et al. (2004) Versatile and open software for comparing large genomes. REFERENCES Genome Biol., 5,R12. Kent,W.S. (2002) BLAT–the BLAST-like alignment tool. Genome Res., 12, Au,K.F. et al. (2010) Detection of splice junctions from paired-end RNA-seq data 656–664. by SpliceMap. Nucleic Acids Res., 38, 4570–4578. Landt,S.G. et al. (2012) ChIP-seq guidelines and practices of the ENCODE and Darling,A.C. et al. (2004) Mauve: multiple alignment of conserved genomic se- modENCODE consortia. Genome Res., 22, 1813–1831. quence with rearrangements. Genome Res., 14, 1394–1403. Manber,U. and Myers,G. (1993) Suffix arrays—a new method for online string Darling,A.E. et al. (2010) progressiveMauve: multiple genome alignment with gene searches. SIAM J. Comput., 22, 935–948. gain, loss and rearrangement. PLoS One, 5, e11147. Parkhomchuk,D. et al. (2009) Transcriptome analysis by strand-specific sequencing De Bona,F. et al. (2008) Optimal spliced alignments of short sequence reads. of complementary DNA. Nucleic Acids Res., 37,e123. Bioinformatics, 24, i174–180. Rothberg,J.M. et al. (2011) An integrated semiconductor device enabling Delcher,A.L. et al. (1999) Alignment of whole genomes. Nucleic Acids Res., 27, non-optical genome sequencing. Nature, 475, 348–352. 2369–2376. Trapnell,C. et al. (2009) TopHat: discovering splice junctions with RNA-Seq. Delcher,A.L. et al. (2002) Fast algorithms for large-scale genome alignment and Bioinformatics, 25, 1105–1111. comparison. Nucleic Acids Res., 30, 2478–2483. Wang,K. et al. (2010) MapSplice: accurate mapping of RNA-seq reads for splice Djebali,S. et al. (2012) Landscape of transcription in human cells. Nature, 489, junction discovery. Nucleic Acids Res., 38,e178. 101–108. Wu,T.D. and Nacu,S. (2010) Fast and SNP-tolerant detection of complex variants Flusberg,B.A. et al. (2010) Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods, 7, 461–465. and splicing in short reads. Bioinformatics, 26, 873–881. Grant,G.R. et al. (2011) Comparative analysis of RNA-Seq alignment algo- Zhang,Y. et al. (2012) PASSion: a pattern growth algorithm-based pipeline for rithms and the RNA-Seq unified mapper (RUM). Bioinformatics, 27, splice junction detection in paired-end RNA-Seq data. Bioinformatics, 28, 2518–2528. 479–486. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/star-ultrafast-universal-rna-seq-aligner-wiv68O0FAx

Loading next page...

References (23)

K. Au, Hui Jiang, Lan Lin, Yi Xing, W. Wong (2010)
Detection of splice junctions from paired-end RNA-seq data by SpliceMap
Nucleic Acids Research, 38
D. Parkhomchuk, T. Borodina, V. Amstislavskiy, Maria Banaru, Linda Hallén, S. Krobitsch, H. Lehrach, A. Soldatov (2009)
Transcriptome analysis by strand-specific sequencing of complementary DNA
Nucleic Acids Research, 37
A. Delcher, A. Phillippy, J. Carlton, S. Salzberg (2002)
Fast algorithms for large-scale genome alignment and comparison.
Nucleic acids research, 30 11
S. Kurtz, A. Phillippy, A. Delcher, M. Smoot, Martin Shumway, C. Antonescu, S. Salzberg (2004)
Versatile and open software for comparing large genomes
Genome Biology, 5
J. Harrow, A. Frankish, J. Gonzalez, Electra Tapanari, M. Diekhans, F. Kokocinski, Bronwen Aken, D. Barrell, A. Zadissa, S. Searle, If Barnes, A. Bignell, Veronika Boychenko, T. Hunt, M. Kay, Gaurab Mukherjee, J. Rajan, Gloria Despacio-Reyes, Gary Saunders, C. Steward, R. Harte, Michael Lin, C. Howald, Andrea Tanzer, T. Derrien, Jacqueline Chrast, Nathalie Walters, S. Balasubramanian, Baikang Pei, M. Tress, J. Rodríguez, Iakes Ezkurdia, Jeltje Baren, M. Brent, D. Haussler, Manolis Kellis, A. Valencia, A. Reymond, M. Gerstein, R. Guigó, T. Hubbard (2012)
GENCODE: The reference human genome annotation for The ENCODE Project
Genome Research, 22
U. Manber, E. Myers (1993)
Suffix arrays: a new method for on-line string searches
SIAM J. Comput., 22
A. Darling, B. Mau, N. Perna (2010)
progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement
PLoS ONE, 5
B. Flusberg, Dale Webster, Jessa Lee, K. Travers, Eric Olivares, Tyson Clark, J. Korlach, S. Turner (2010)
Direct detection of DNA methylation during single-molecule, real-time sequencing
Nature methods, 7
G. Grant, M. Farkas, A. Pizarro, N. Lahens, J. Schug, B. Brunk, C. Stoeckert, J. Hogenesch, E. Pierce (2011)
Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM)
Bioinformatics, 27 18
Kai Wang, Darshan Singh, Z. Zeng, S. Coleman, Yan Huang, G. Savich, Xiaping He, P. Mieczkowski, Sara Grimm, C. Perou, J. MacLeod, Derek Chiang, J. Prins, Jinze Liu (2010)
MapSplice: Accurate mapping of RNA-seq reads for splice junction discovery
Nucleic Acids Research, 38
Joonhee Han, Jikang Xiong, Dong Wang, Xiang-Dong Fu (2011)
Pre-mRNA splicing: where and when in the nucleus.
Trends in cell biology, 21 6
A. Delcher, A. Delcher, S. Kasif, R. Fleischmann, J. Peterson, O. White, S. Salzberg (1999)
Alignment of whole genomes.
Nucleic acids research, 27 11
C. Trapnell, L. Pachter, Steven Salzberg (2009)
TopHat: discovering splice junctions with RNA-Seq
Bioinformatics, 25
De Bona (2008)
Optimal spliced alignments of short sequence reads
Bioinformatics, 24
A. Darling, B. Mau, F. Blattner, N. Perna (2004)
Mauve: multiple alignment of conserved genomic sequence with rearrangements.
Genome research, 14 7
M. Hastings, A. Krainer (2001)
Pre-mRNA splicing in the new millennium.
Current opinion in cell biology, 13 3
Yanju Zhang, Eric-Wubbo Lameijer, P. Hoen, Z. Ning, P. Slagboom, K. Ye (2012)
PASSion: a pattern growth algorithm-based pipeline for splice junction detection in paired-end RNA-Seq data
Bioinformatics, 28
S. Landt, G. Marinov, A. Kundaje, P. Kheradpour, Florencia Pauli, S. Batzoglou, B. Bernstein, P. Bickel, James Brown, Philip Cayting, Yiwen Chen, Gilberto Desalvo, C. Epstein, Katherine Fisher-Aylor, G. Euskirchen, M. Gerstein, Jason Gertz, A. Hartemink, M. Hoffman, V. Iyer, Y. Jung, S. Karmakar, Manolis Kellis, P. Kharchenko, Qunhua Li, Tao Liu, X. Liu, Lijia Ma, A. Milosavljevic, R. Myers, P. Park, M. Pazin, M. Perry, D. Raha, Timothy Reddy, J. Rozowsky, N. Shoresh, A. Sidow, Matthew Slattery, J. Stamatoyannopoulos, M. Tolstorukov, K. White, S. Xi, P. Farnham, J. Lieb, B. Wold, M. Snyder (2012)
ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia
Genome Research, 22
S. Djebali, Carrie Davis, A. Merkel, A. Dobin, T. Lassmann, A. Mortazavi, Andrea Tanzer, Julien Lagarde, Wei Lin, F. Schlesinger, Chenghai Xue, G. Marinov, Jainab Khatun, B. Williams, C. Zaleski, J. Rozowsky, M. Röder, F. Kokocinski, Rehab Abdelhamid, T. Alioto, I. Antoshechkin, Michael Baer, Nadav Bar, P. Batut, K. Bell, I. Bell, S. Chakrabortty, Xian Chen, Jacqueline Chrast, João Curado, T. Derrien, J. Drenkow, E. Dumais, Jacqueline Dumais, R. Duttagupta, E. Falconnet, Meagan Fastuca, Katalin Fejes-Toth, Pedro Ferreira, S. Foissac, M. Fullwood, Hui Gao, David Gonzalez, Assaf Gordon, H. Gunawardena, C. Howald, Sonali Jha, Rory Johnson, P. Kapranov, Brandon King, C. Kingswood, O. Luo, Eddie Park, K. Persaud, Jonathan Preall, Paolo Ribeca, B. Risk, D. Robyr, M. Sammeth, Lorian Schaffer, L. See, A. Shahab, Jørgen Skancke, A. Suzuki, Hazuki Takahashi, Hagen Tilgner, Diane Trout, Nathalie Walters, Huaien Wang, J. Wrobel, Yanbao Yu, Xiaoan Ruan, Y. Hayashizaki, J. Harrow, M. Gerstein, T. Hubbard, A. Reymond, S. Antonarakis, G. Hannon, Morgan Giddings, Y. Ruan, B. Wold, Piero Carninci, R. Guigó, T. Gingeras (2012)
Landscape of transcription in human cells
Nature, 489
J. Rothberg, W. Hinz, T. Rearick, Jonathan Schultz, W. Mileski, M. Davey, J. Leamon, Kim Johnson, M. Milgrew, M. Edwards, Jeremy Hoon, J. Simons, D. Marran, J. Myers, J. Davidson, Annika Branting, J. Nobile, Bernard Puc, David Light, T. Clark, M. Huber, Jeffrey Branciforte, Isaac Stoner, S. Cawley, Michael Lyons, Yutao Fu, Nils Homer, Marina Sedova, Xin Miao, Brian Reed, Jeffrey Sabina, E. Feierstein, M. Schorn, Mohammad Alanjary, E. Dimalanta, D. Dressman, Rachel Kasinskas, T. Sokolsky, J. Fidanza, Eugeni Namsaraev, K. McKernan, Alan Williams, G. Roth, J. Bustillo (2011)
An integrated semiconductor device enabling non-optical genome sequencing
Nature, 475
Thomas Wu, Serban Nacu (2010)
Fast and SNP-tolerant detection of complex variants and splicing in short reads
Bioinformatics, 26
F. Bona, S. Ossowski, Korbinian Schneeberger, G. Rätsch (2008)
Optimal spliced alignments of short sequence reads
BMC Bioinformatics, 9
W. Kent (2002)
BLAT--the BLAST-like alignment tool.
Genome research, 12 4

Publisher: Oxford University Press
Copyright: © The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected]
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/bts635
pmid: 23104886
Publisher site: See Article on Publisher Site

Abstract

Vol. 29 no. 1 2013, pages 15–21 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bts635 Sequence analysis Advance Access publication October 25, 2012 1, 1 1 1 1 Alexander Dobin , Carrie A. Davis , Felix Schlesinger ,Jorg Drenkow , Chris Zaleski , 1 1 2 1 Sonali Jha ,Philippe Batut ,Mark Chaisson and Thomas R. Gingeras 1 2 Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA and Pacific Biosciences, Menlo Park, CA, USA Associate Editor: Inanc Birol unique challenges to detection and characterization of spliced ABSTRACT transcripts. Two key tasks make these analyses computationally Motivation: Accurate alignment of high-throughput RNA-seq data is a intensive. The first task is an accurate alignment of reads that challenging and yet unsolved problem because of the non-contiguous contain mismatches, insertions and deletions caused by genomic transcript structure, relatively short read lengths and constantly variations and sequencing errors. The second task involves map- increasing throughput of the sequencing technologies. Currently avail- ping sequences derived from non-contiguous genomic regions able RNA-seq aligners suffer from high mapping error rates, low map- comprising spliced sequence modules that are joined together to ping speed, read length limitation and mapping biases. form spliced RNAs. Although the first task is shared with DNA Results: To align our large (480 billon reads) ENCODE Transcriptome resequencing efforts, the second task is specific and crucial to the RNA-seq dataset, we developed the Spliced Transcripts Alignment to RNA-seq, as it provides the connectivity information needed to a Reference (STAR) software based on a previously undescribed reconstruct the full extent of spliced RNA molecules. These align- RNA-seq alignment algorithm that uses sequential maximum map- ment challenges are further compounded by the presence of mul- pable seed search in uncompressed suffix arrays followed by seed tiple copies of identical or related genomic sequences that are clustering and stitching procedure. STAR outperforms other aligners themselves transcribed, making precise mapping difficult. by a factor of450 in mapping speed, aligning to the human genome Various sequence alignment algorithms have been recently 550 million 2 76 bp paired-end reads per hour on a modest 12-core developed to tackle these challenges (Au et al., 2010; De Bona, server, while at the same time improving alignment sensitivity and et al., 2008; Grant et al., 2011; Han et al., 2011; Trapnell et al., precision. In addition to unbiased de novo detection of canonical junc- 2009; Wang et al., 2010; Wu and Nacu, 2010; Zhang et al., 2012). tions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA se- However, application of these algorithms invokes compromises quences. Using Roche 454 sequencing of reverse transcription poly- in the areas of mapping accuracy (sensitivity and precision) and merase chain reaction amplicons, we experimentally validated 1960 computational resources (run time and disk space) (Grant et al., novel intergenic splice junctions with an 80–90% success rate, corro- 2011). With current advances in sequencing technologies, the borating the high precision of the STAR mapping strategy. computational component is increasingly becoming a through- Availability and implementation: STAR is implemented as a standa- put bottleneck. High mapping speed is especially important for lone Cþþ code. STAR is free open source software distributed under large consortia efforts, such as ENCODE (http://www.genome. GPLv3 license and can be downloaded from http://code.google.com/ gov/encode/), which continuously generate large amounts of p/rna-star/. sequencing data. Contact: [email protected]. Furthermore, most of the cited algorithms were designed to deal with relatively short reads (typically 200 bases), and are Received on May 29, 2012; revised on October 17, 2012; accepted on ill-suited for aligning long read sequences generated by the emer- October 19, 2012 ging third-generation sequencing technologies (Flusberg et al., 2010; Rothberg et al., 2011). The longer read sequences, ideally reaching full lengths of RNA molecules, have a great potential 1 INTRODUCTION for enhancing transcriptome studies by providing more complete Although genomes are composed of linearly ordered sequences of RNA connectivity information. nucleic acids, eukaryotic cells generally reorganize the informa- This report describes an alignment algorithm entitled ‘Spliced tion in the transcriptome by splicing together non-contiguous Transcripts Alignment to a Reference (STAR)’, which was de- exons to create mature transcripts (Hastings and Krainer, 2001). signed to specifically address many of the challenges of RNA-seq The detection and characterization of these spliced RNAs have data mapping, and uses a novel strategy for spliced alignments. been a critical focus of functional analyses of genomes in both the We performed high-throughput validation experiments that cor- normal and disease cell states. Recent advances in sequencing roborated STAR’s precision for detection of novel splice junc- technologies have made transcriptome analyses at the single nu- tions. STAR’s high mapping speed and accuracy were crucial for cleotide level almost routine. However, hundreds of millions of analyzing the large ENCODE transcriptome (Djebali et al., short (36 nt) to medium (200 nt) length sequences (reads) gener- 2012) dataset (480 billion Illumina reads). We also demonstrated ated by such high-throughput sequencing experiments present that STAR has a potential for accurately aligning long (several kilobases) reads that are emerging from the third-generation *To whom correspondence should be addressed. sequencing technologies. The Author 2012. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: [email protected] 15 A.Dobin et al. The MMP in STAR search is implemented through 2ALGORITHM uncompressed suffix arrays (SAs) (Manber and Myers, 1993). Many previously described RNA-seq aligners were developed as Notably, finding MMP is an inherent outcome of the standard extensions of contiguous (DNA) short read mappers, which were binary string search in uncompressed SAs, and does not require used to either align short reads to a database of splice junctions any additional computational effort compared with the full- or align split-read portions contiguously to a reference genome, length exact match searches. The binary nature of the SA or a combination thereof. In contrast to these approaches, STAR search results in a favorable logarithmic scaling of the search was designed to align the non-contiguous sequences directly to time with the reference genome length, allowing fast searching the reference genome. STAR algorithm consists of two major even against large genomes. Advantageously, for each MMP the steps: seed searching step and clustering/stitching/scoring step. SA search can find all distinct exact genomic matches with little computational overhead, which facilitates an accurate alignment 2.1 Seed search of the reads that map to multiple genomic loci (‘‘multimapping’’ reads). The central idea of the STAR seed finding phase is the sequential In addition to detecting splice junctions, the MMP search, search for a Maximal Mappable Prefix (MMP). MMP is similar implemented in STAR, enables finding multiple mismatches to the Maximal Exact (Unique) Match concept used by the and indels, as illustrated in Figure 1b. If the MMP search does large-scale genome alignment tools Mummer (Delcher et al., not reach the end of a read because of the presence of one or 1999, 2002; Kurtz et al.) and MAUVE (Darling et al., 2004, more mismatches, the MMPs will serve as anchors in the genome 2010). Given a read sequence R,read location i and a reference that can be extended, allowing for alignments with mismatches. genome sequence G,the MMP(R,i,G) is defined as the longest In some cases, the extension procedure does not yield a good substring (R , R , .. . , R )thatmatches exactly one or i i þ 1 i þ MML 1 genomic alignment, which allows identification of poly-A tails, more substrings of G,where MML is the maximum mappable library adapter sequences or poor sequencing quality tails length. We will explain this concept using a simple example of a (Fig. 1c). The MMP search is performed in both forward and read that contains a single splice junction and no mismatches reverse directions of the read sequence and can be started from (Fig. 1a). In the first step, the algorithm finds the MMP starting user-defined search start points throughout the read sequence, from the first base of the read. Because the read in this example which facilitates finding anchors for reads with errors near the comprises a splice junction, it cannot be mapped contiguously to ends and improves mapping sensitivity for high sequencing error the genome, and thus the first seed will be mapped to a donor rate conditions. splice site. Next, the MMP search is repeated for the unmapped Besides the efficient MMP search algorithm, uncompressed portion of the read, which, in this case, will be mapped to an SAs also demonstrate a significant speed advantage over the acceptor splice site. Note that this sequential application of compressed SAs implemented in many popular short read MMP search only to the unmapped portions of the read aligners (Supplementary Section 1.8). This speed advantage is makes the STAR algorithm extremely fast and distinguishes it traded off against the increased memory usage by uncompressed from Mummer and MAUVE, which find all possible Maximal arrays, which is assessed further in Section 3.3. Exact Matches. This approach represents a natural way of find- ing precise locations of splice junctions in a read sequence and is advantageous over an arbitrary splitting of read sequences used 2.2 Clustering, stitching and scoring in the split-read methods. The splice junctions are detected in a In the second phase of the algorithm, STAR builds alignments of single alignment pass without any apriori knowledge of splice the entire read sequence by stitching together all the seeds that junctions’ loci or properties, and without a preliminary contigu- were aligned to the genome in the first phase. First, the seeds are ous alignment pass needed by the junction database approaches. clustered together by proximity to a selected set of ‘anchor’ seeds. We found that an optimal procedure for anchor selection is through limiting the number of genomic loci the anchors align (a) Map Map again to. All the seeds that map within user-defined genomic windows MMP 1 MMP 2 around the anchors are stitched together assuming a local linear RNA-seq read transcription model. The size of the genomic windows deter- mines the maximum intron size for the spliced alignments. A frugal dynamic programming algorithm (see Supplementary Section 1.5 for details) is used to stitch each pair of seeds, allow- exons in the genome ing for any number of mismatches but only one insertion or (b)(c) deletion (gap). Map Map Importantly, the seeds from the mates of paired-end RNA-seq MMP 1 Extend MMP 1 Trim reads are clustered and stitched concurrently, with each paired-end read represented as a single sequence, allowing for mismatches A-tail, or adapter, a possible genomic gap or overlap between the inner ends of or poor quality tail the mates. This is a principled way to use the paired-end infor- mation, as it reflects better the nature of the paired-end reads, Fig. 1. Schematic representation of the Maximum Mappable Prefix namely, the fact that the mates are pieces (ends) of the same search in the STAR algorithm for detecting (a)splice junctions, (b)mis- matches and (c) tails sequence. This approach increases the sensitivity of the 16 STAR algorithm, as only one correct anchor from one of the mates is transcript annotations. The maximum number of mismatches sufficient to accurately align the entire read. was set at 10 per paired-end read, and the minimum/maximum If an alignment within one genomic window does not cover the intron sizes were set at 20 b/500 kb (Supplementary Section 2 for entire read sequence, STAR will try to find two or more windows additional information). Note that running comparison between that cover the entire read, resulting in a chimeric alignment, with mappers with their default parameters is a reasonable and com- different parts of the read mapping to distal genomic loci, or monly accepted practice, as all considered aligners were, by de- different chromosomes, or different strands (Supplementary fault, optimized for mammalian genomes and recent RNA-seq Fig. S1). STAR can find chimeric alignments in which the data. mates are chimeric to each other, with a chimeric junction The resulting alignments were compared with the true genomic located in the unsequenced portion of the RNA molecule be- origin of the simulated reads, and true-/false-positive rates of tween two mates. STAR can also find chimeric alignments in splice junction detection were calculated using procedures and which one or both mates are internally chimerically aligned, scripts developed by Grant et al. (2011). ROC curves (Fig. 2) thus pinpointing the precise location of the chimeric junction were computed with the detection (discrimination) threshold in the genome. An example of the BCR-ABL fusion transcript given by the number of reads mapped across each junction, i.e. detection from the K562 erythroleukemia cell line is given in the for each aligner, only junctions supported by at least N reads were selected for each point along the ROC curves, with N varied Supplementary Section 1.7 (Supplementary Fig. S2). The stitching is guided by a local alignment scoring scheme, from 1 (lowest threshold) to 100 (high threshold). All aligners with user-defined scores (penalties) for matches, mismatches, in- exhibit desirable steep ROC curves at high values of detection threshold. At the lowest detection threshold of 1 read per junc- sertions, deletions and splice junction gaps, allowing for a quan- tion, STAR exhibits the lowest false-positive rate while achieving titative assessment of the alignment qualities and ranks (see high sensitivity. Supplementary Figure S5 shows the same ana- Supplementary Section 1.4 for details). The stitched combination lysis for a low error rate-simulated dataset, which yields similar with the highest score is chosen as the best alignment of a read. conclusions. For multimapping reads, all alignments with scores within a cer- tain user-defined range below the highest score are reported. Although the sequential MMP search only finds the seeds 3.2 Performance on experimental RNA-seq data exactly matching the genome, the subsequent stitching procedure For evaluation of the RNA-seq mappers’ performance on experi- is capable of aligning reads with a large number of mismatches, mental RNA-seq data STAR, TopHat2, GSNAP, RUM and indels and splice junctions, scalable with the read length. This MapSplice were run (see Supplementary Section 2 for additional characteristic has become ever more important with the information) on an ENCODE long RNA-seq dataset (K562 emergence of the third-generation sequencing technologies whole cell Aþ sample, 1 Illumina GAIIx lane of 40 million (such as Pacific Biosciences or Ion Torrent) that produce 2 76 reads). STAR and GSNAP aligned the largest percentage longer reads with elevated error rates. of reads (94% both), followed by RUM (86%), MapSplice (85%) and TopHat2 (71%). Different accuracy metrics for splice junction detection with 3 RESULTS respect to the Gencode 7 (Harrow et al., 2012) annotations are 3.1 Performance on simulated RNA-seq data plotted in Figure 3a–c as a function of the detection threshold, defined as the minimum number of RNA-seq reads per junction. First, we used simulated data to evaluate performance of STAR and compare it with other RNA-seq mappers. Simulations allow for a precise calculation of false-positive and -negative rates, al- though artificial error models, used to generate simulated reads, may not adequately represent experimental errors. We used a simulated dataset from a recent study (Grant et al., 2011), in 80 which 10 million of 2 100 nt Illumina-like read sequences with a reasonably high error rate were generated from the mouse transcriptome, including annotated transcripts and artifi- cial ones. Various types of genomic variations and sequencing errors were introduced to mimic real RNA-seq data. The latest available versions of STAR 2.1.3, TopHat2 2.0.0 (Trapnell et al., 2009), GSNAP 2012-07-03 (Wu and Nacu, 2010), RUM 1.11 (Grant et al., 2011) and MapSplice 1.15.2 STAR 20 TopHat2 (Wang et al., 2010) were run on the simulated dataset labeled GSNAP as ‘SIM1-TEST2’ in (Grant et al., 2011). Because the TopHat2 RUM 2.0.0 release represents a major new development of the TopHat MapSplice aligner, which has not been peer reviewed yet, we also made the 0 1 2 3 4 False Positive Rate % comparisons with the previous TopHat version 1.4. We found that the new version yields a slightly better accuracy and faster Fig. 2. True-positive rate versus false-positive rate (ROC-curve) for simu- mapping speed (Supplementary Section 2.1 and Fig. S3). All lated RNA-seq data for STAR, TopHat2, GSNAP, RUM and aligners were run in the de novo mode, i.e. without using gene/ MapSplice True Positive Rate % A.Dobin et al. Fig. 3. Various accuracy metrics for splice junction detection in the experimental RNA-seq data. The color-coding scheme for mappers is the same in all plots. X-axis in plots (a), (b), (d) and (e) is the detection threshold defined as the number of reads mapped across each junction, i.e. each point with the X- value of N represents all junctions that are supported by at least N reads mapped by a given aligner. (a) Total number of detected junctions, annotated (solid lines) and unannotated (dashed lines); (b) percentage of detected junctions that are annotated; (c) pseudo-ROC curve: percentage of all annotated junctions that are detected versus percentage of detected junctions that are unannotated; (d) number of unannotated junctions detected by at least two mappers (solid lines) and number of unannotated junctions detected exclusively by only one mapper (dashed lines); (e) percentage of detected unan- notated junctions that are detected exclusively by only one mapper and (f) pseudo-ROC curve: percentage of unannotated junctions that are detected by at least two mappers versus percentage of detected unannotated junctions that are detected exclusively by only one mapper Although all aligners detect a similar number of annotated junc- (Zhang et al., 2012) and plotted (Fig. 3d) the number of junctions tions (Fig. 3a, solid lines), there are noticeable differences between detected by at least two mappers (pseudo-true positive) and the mappers in the number of detected unannotated junctions (Fig. number of junctions detected exclusively by each mapper 3a, dashed lines). The percentage of the unannotated among all (pseudo-false positive). STAR alignments yield the lowest detected junctions is plotted in Figure 3b as a function of detec- pseudo-false-positive rate, i.e. the lowest proportion of exclu- tion threshold. Because all aligners show similar sensitivities to sively detected junctions (Fig. 3e), while at the same time achiev- annotated junctions, the proportion of annotated among all de- ing the second in class pseudo-sensitivity (Fig. 3f). GSNAP tected junctions may serve as a surrogate of precision. STAR, exhibits the highest pseudo-sensitivity at the cost of a high RUM and TopHat2 perform similarly, while GSNAP exhibits pseudo-false-positive rate. These results qualitatively agree with lower precision at a lower detection threshold, and MapSplice the aligners’ performance on the simulated data, whereas the shows unusual non-monotonic and non-saturating behavior, quantitative differences may be attributed to disparities between which was also noted in Zhang et al. (2012). Pseudo-ROC real and simulated errors. Supplementary Figure S6 shows the curve, i.e. the proportion of annotated junctions that are detected same analysis for a shorter RNA-seq dataset (2 50 b), which (pseudo-sensitivity) versus the proportion of detected junctions indicates that STAR retains high sensitivity and precision even that are unannotated (pseudo-false-positive rate), is plotted in for short reads. Figure 3c. All aligners (except MapSplice) perform similarly at Note that the pseudo-true-/false-positive definitions are based high values of the detection threshold. on the assumption that junctions detected by only one aligner are Because many unannotated junctions represent true novel spli- more likely to be false positive than the junctions detected by two cing events and are not false positives, the percentage of unan- or more aligners; however, these definitions are not rigorous be- notated among all detected junctions is not an accurate proxy for cause the true/false assessments cannot be made for experimental the false-positive rate. To obtain a more accurate estimate of the data. We would also like to stress that these comparisons were false-positive rate, we followed another frequently used approach done for current versions of each tool, with the default 18 STAR parameters and for the present state of Illumina sequencing tech- 3.4 Experimental validation nology. As both sequencing technologies and tools improve, As part of the characterization of human transcriptome by the these rankings may change and have to be reevaluated. ENCODE (Djebali et al., 2012), STAR was used to map poly- Similarly to other RNA-seq aligners, STAR’s default param- adenylated (poly Aþ)long(4200 nt) transcripts isolated from eters are optimized for mammalian genomes. Other species may whole cell extracts of primary human H1ES (embryonic stem require significant modifications of some alignment parameters; cells) and HUVEC (umbilical vein endothelial cells) cell lines. in particular, the maximum and minimum intron sizes have to be These RNAs were sequenced using a duplex-specific nuclease reduced for organisms with smaller introns. protocol (Parkhomchuk et al., 2009) that generated 2 76 bp strand-specific reads. Not surprisingly, unannotated (novel) splice sites show lower 3.3 Speed benchmarks abundance levels than the annotated junctions, as indicated by the significant drop in the number of unannotated junctions with Speed benchmarks were performed on a server equipped with the number of supporting reads (Supplementary Fig. S7). two 6-core Intel Xeon CPUs X5680@ 3.33 GHz and 148 GB Because each of the cell lines was sequenced in biological dupli- of RAM (random-access memory). Six or 12 threads were re- cates, a collection of high confidence splice sites could be identi- quested for each run, using half or full capacity of the server. fied based on their reproducibility between replicas. To assess the All mappers were run with their default parameters on the 40 reproducibility of the detected splice junctions, we developed a million 2 76 Illumina human RNA-seq dataset described in the non-parametric irreproducible discovery rate (npIDR) approach, previous section. specifically suitable for the discrete nature of the RNA-seq data The ‘wall’ time (i.e. the total run time required to complete the (see Supplementary Materials for the detailed description). This mapping) and RAM usage are presented in Table 1. STAR approach is similar to the npIDR concept extensively used in the achieves a speed of 550 million 2 76 Illumina paired-end analysis of the ENCODE ChIP-seq experiments (Landt et al., reads per hour using 12 threads (full capacity of the server), 2012). Supplementary Figure S8 shows the dependence of i.e. 45 million paired reads per hour per processor, outperform- npIDR¼ 0.1 on the read count per junction, providing a prin- ing the second fastest mapper (TopHat2) by a factor450. STAR cipled method for selecting the read count threshold with a desired exhibits close to linear scaling of the throughput rate with the level of reproducibility. For example, five staggered reads per junc- number of threads, losing 10% of per thread mapping speed tion are required to achieve an npIDR of 0.1, i.e. the 90% likeli- when the number of threads is increased from 6 to 12. hood that these junctions will be observed again in another STAR’s high mapping speed is traded off against RAM usage: experiment on the same cell line with the same sequencing depth. STAR requires 27 GB of RAM for aligning to the human Experimental validation was carried out on 1920 novel splice genome. Like all other aligners, with the exception of RUM, junctions in a wide range of RNA-seq reads support, both below the amount of RAM used by STAR does not increase signifi- and above the npIDR threshold. Only splice junctions mapped to cantly with the number of threads, as the SA is shared among all intergenic or antisense loci to Gencode 7 genes (Harrow et al., threads. Although STAR’s RAM requirements would have been 2012) were chosen for validation, as these junctions are more prohibitively expensive several years ago, at the time when the likely to be false positive than the junctions that map within the first short read aligners were developed, recent progress in semi- annotated genes. The high-throughput validation pipeline conductor technologies resulted in a substantial drop of RAM involved reverse transcription polymerase chain reaction amplifi- prices, and modern high performance servers are commonly cation of targeted regions followed by Roche 454 sequencing of equipped with RAM 432 GB. STAR has an option to use the pooled products. The reverse transcription polymerase chain sparse SAs, reducing the RAM consumption to 516 GB for reaction primer design took advantage of the250 nt insert length the human genome at the cost of 25% decrease in the mapping of the paired-end reads supporting targeted junctions, and en- speed, while maintaining the alignment accuracy. tailed the production of long 300–600 nt amplicons. These ampli- cons were pooled and sequenced by a Roche 454 sequencer to provide long and more confidently mappable reads that were Table 1. Mapping speed and RAM benchmarks on the experimental aligned to the genome with BLAT. Detailed description of the RNA-seq dataset experimental protocols can be found in Djebali et al.(2012). We selected 1920 intergenic and antisense splice junctions from Aligner Mapping speed: million Peak physical H1ES and HUVEC cell lines, including both highly read pairs/hour RAM, GB (npIDR50.1) and poorly (npIDR40.1) reproducible junctions. Of all the tested novel intergenic/antisense junctions supported 6 threads 12 threads 6 threads 12 threads by at least five RNA-seq reads (corresponding to npIDR50.1), 82–89% (H1ES) and 84–95% (HUVEC) were corroborated by STAR 309.2 549.9 27.0 28.4 at least two amplicons sequenced by 454 (Table 2). Notably, the STAR sparse 227.6 423.1 15.6 16.0 validation rate remains at a high level of 72% (H1ES) and 74% TopHat2 8.0 10.1 4.1 11.3 (HUVEC) even for the candidate junctions that were supported RUM 5.1 7.6 26.9 53.8 by as few as two RNA-seq reads. These results confirm high MapSplice 3.0 3.1 3.3 3.3 precision of the STAR’s splicing detection algorithm even for GSNAP 1.8 2.8 25.9 27.0 rare novel junctions. 19 A.Dobin et al. Table 2. Number of selected junctions and percentage of selected junctions that were validated by at least two 454 reads, as a function of the RNA-seq read count per junction H1ES HUVEC Read count per Number of Proportion of Read count Number of Proportion of junction from tested junctions validated per junction tested junctions junctions validated two replicates junctions by at least two from two replicates by at least two 454 reads (%) 454 reads (%) 2 192 72.4 2 192 74.0 3 192 77.6 3 192 75.0 4 96 74.0 4 96 76.0 5 96 82.3 5–6 96 84.4 6–7 96 79.2 7–8 96 84.4 8–11 96 81.3 9–12 96 86.5 12–24 96 87.5 13–23 96 94.8 25 96 88.5 24 96 90.6 The upper bound of the false discovery rate (FDR) can be One of the main inherent problems of all de novo RNA-seq estimated from the validation rate (VR) as FDR 1 VR. aligners is the inability to accurately detect splicing events that For low abundance junctions, the experimental FDR is lower involve short (55–10 nt) sequence overhangs on the donor or than the npIDR predicted from the dissimilarity between the acceptor sides of a junction. This causes a significant replicates: for example, although 45% of junctions, supported underdetection of splicing events, and also increases significantly the misalignment rate, as such reads are likely to be mapped with by just two reads, are not reproducible (Supplementary Fig. a few mismatches to a similar contiguous genomic region. In S8),470% of them are successfully validated (Table 2). Hence, addition, this effect also biases the alignments toward processed npIDR can serve as a conservative upper bound FDR estimate pseudogenes, which are abundant in the human genome. in cases where validation experiments are impractical. Similarly to other RNA-seq aligners, to mitigate this problem, STAR has an option to obtain information about possible splice junction loci from annotation databases (Supplementary Section 4DISCUSSION 4). It is also possible to run a second mapping pass, supplying it with splice junction loci found in the first mapping pass. In this Despite several years of ongoing improvements, alignment of the case, STAR will not discover any new junctions but will align non-contiguous RNA-seq reads to a reference genome is not a spliced reads with short overhangs across the previously detected solved problem yet, owing both to its intrinsic complexity and junctions. rapid transformations of the sequencing technologies. Several To demonstrate STAR’s ability to align long reads, we have critical problems have been found to afflict previously published mapped the long (0.5–5 kb) human mRNA sequences from approaches, such as high mapping error rate, alignment biases, GenBank (see Supplementary Section 5 for details). The accur- low sensitivity for unannotated transcripts, poor scalability with acy of STAR alignments is similar or higher than that of BLAT the read length, restrictions in the number of junctions/mis- (Kent, 2002) a popular EST/mRNA aligner. At the same time, matches/indels per read, inability to detect non-linear transcripts STAR outperforms BLAT by more than two orders of magni- (such as chimeric RNAs), and, crucially, low mapping tude in the alignment speed, which is important for throughput. high-throughput sequencing applications. In this work, we described STAR, a novel algorithm for align- The algorithm extensibility to long reads shows that STAR ing high-throughput long and short RNA-seq data to a reference has a potential to serve as a universal alignment tool across a genome, developed to overcome the aforementioned issues. broad spectrum of emerging sequencing platforms. STAR can Unlike many other RNA-seq mappers, STAR is not an extension align reads in a continuous streaming mode, which makes it of a short-read DNA mapper, but was developed as a compatible with novel sequencing technologies such as the one stand-alone Cþþ code. STAR is capable of running parallel recently announced by Oxford Nanopore Technologies. As the threads on multicore systems with close to linear scaling of prod- sequencing technologies and protocols evolve, new mapping stra- uctivity with the number of cores. STAR is fast: on a modern, tegies will have to be developed, and STAR core algorithm can but not overly expensive, 12-core server, it can align 550 million provide a flexible framework to address arising alignment 2 76 nt reads per hour to the human genome, surpassing all challenges. other existing RNA-seq aligners by a factor of 50. At the same time, STAR exhibits better alignment precision and sensitivity Data access than other RNA-seq aligners for both experimental and simu- GEO: GSE38886 (Roche 454 sequencing) lated data. GEO: GSE30567 (Illumina long RNA-Seq) 20 STAR Han,J. et al. (2011) Pre-mRNA splicing: where and when in the nucleus. Trends Funding: This work was funded by NHGRI (NIH) grant Cell. Biol., 21, 336–343. U54HG004557. Harrow,J. et al. (2012) GENCODE: The reference human genome annotation for the ENCODE project. Genome Res., 22, 1760–1774. Conflict of Interest: none declared. Hastings,M.L. and Krainer,A.R. (2001) Pre-mRNA splicing in the new millennium. Curr. Opin. Cell. Biol., 13, 302–309. Kurtz,S. et al. (2004) Versatile and open software for comparing large genomes. REFERENCES Genome Biol., 5,R12. Kent,W.S. (2002) BLAT–the BLAST-like alignment tool. Genome Res., 12, Au,K.F. et al. (2010) Detection of splice junctions from paired-end RNA-seq data 656–664. by SpliceMap. Nucleic Acids Res., 38, 4570–4578. Landt,S.G. et al. (2012) ChIP-seq guidelines and practices of the ENCODE and Darling,A.C. et al. (2004) Mauve: multiple alignment of conserved genomic se- modENCODE consortia. Genome Res., 22, 1813–1831. quence with rearrangements. Genome Res., 14, 1394–1403. Manber,U. and Myers,G. (1993) Suffix arrays—a new method for online string Darling,A.E. et al. (2010) progressiveMauve: multiple genome alignment with gene searches. SIAM J. Comput., 22, 935–948. gain, loss and rearrangement. PLoS One, 5, e11147. Parkhomchuk,D. et al. (2009) Transcriptome analysis by strand-specific sequencing De Bona,F. et al. (2008) Optimal spliced alignments of short sequence reads. of complementary DNA. Nucleic Acids Res., 37,e123. Bioinformatics, 24, i174–180. Rothberg,J.M. et al. (2011) An integrated semiconductor device enabling Delcher,A.L. et al. (1999) Alignment of whole genomes. Nucleic Acids Res., 27, non-optical genome sequencing. Nature, 475, 348–352. 2369–2376. Trapnell,C. et al. (2009) TopHat: discovering splice junctions with RNA-Seq. Delcher,A.L. et al. (2002) Fast algorithms for large-scale genome alignment and Bioinformatics, 25, 1105–1111. comparison. Nucleic Acids Res., 30, 2478–2483. Wang,K. et al. (2010) MapSplice: accurate mapping of RNA-seq reads for splice Djebali,S. et al. (2012) Landscape of transcription in human cells. Nature, 489, junction discovery. Nucleic Acids Res., 38,e178. 101–108. Wu,T.D. and Nacu,S. (2010) Fast and SNP-tolerant detection of complex variants Flusberg,B.A. et al. (2010) Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods, 7, 461–465. and splicing in short reads. Bioinformatics, 26, 873–881. Grant,G.R. et al. (2011) Comparative analysis of RNA-Seq alignment algo- Zhang,Y. et al. (2012) PASSion: a pattern growth algorithm-based pipeline for rithms and the RNA-Seq unified mapper (RUM). Bioinformatics, 27, splice junction detection in paired-end RNA-Seq data. Bioinformatics, 28, 2518–2528. 479–486.

Journal

Bioinformatics – Oxford University Press

Published: Oct 25, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

STAR: ultrafast universal RNA-seq aligner

STAR: ultrafast universal RNA-seq aligner

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

STAR: ultrafast universal RNA-seq aligner

STAR: ultrafast universal RNA-seq aligner

References (23)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies