PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text

Yuri Pirola; Raffaella Rizzi; Ernesto Picardi; Graziano Pesole; Gianluca Della Vedova; Paola Bonizzoni

doi:10.1186/1471-2105-13-S5-S2

PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text

Pirola, Yuri; Rizzi, Raffaella; Picardi, Ernesto; Pesole, Graziano; Della Vedova, Gianluca; Bonizzoni, Paola 2012-04-12 00:00:00 Background: A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task. Results: We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output. The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts. Conclusions: PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/ PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations. Background complex regulatory system mediates the splicing process A key step in the post-transcriptional modification pro- which, under different conditions, may produce alterna- cess is called splicing and consists of the excision of the tive mature mRNAs (also called transcript isoforms) intronic regions of the premature mRNA (pre-mRNA) starting from a single pre-mRNA molecule. Alternative while the exonic regions are then reconnected to form a Splicing (AS), i.e. the production of alternative tran- single continuous molecule, the mature mRNA. A scripts from the same gene, is the main mechanism responsible for the expansion of the transcriptome (the set of transcripts generated by the genome of one * Correspondence: [email protected] organism) in eukaryotes and it is also involved in the † Contributed equally Dipartimento di Informatica Sistemistica e Comunicazione, Univ. degli Studi onset of several diseases [1]. di Milano-Bicocca, Milano, 20126, Italy Full list of author information is available at the end of the article © 2012 Pirola et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 2 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 A great extent of work has been performed to solve support a common putative gene structure. In [18] a two basic problems on AS: characterizing the exon- detailed discussion of this issue is provided. intron structure of a gene and finding the set of differ- ent transcript isoforms that are produced from the same Methods gene. Some computational approaches, based on tran- In this paper we show how to efficiently solve the inte- script data, for these crucial problems have been pro- gration of the two steps of finding the (possibly differ- posed; indeed good implementations are available [2-9]. ent) spliced alignments of a cluster of transcripts and Recently, some tools related to the problem, but limited using them to compute a common gene structure. to the specific task of predicting splice junctions from Overall, our new combinatorial method for exon- Next-Generation Sequencing (NGS) data, have been intron structure prediction can be summarized as a designed [10-13]. These tools are computationally inten- four-stage pipeline where we: sive and would require a post-processing step to filter the correct data that can be related to the alternative 1. Compute and implicitly represent all the spliced exon-intron structure of a gene. Moreover, the literature alignments of a transcript sequence (EST or mRNA) provides efficient solutions for computing a specific against a genomic reference sequence by a novel spliced alignment of an EST against the genome (for graph representation, called embedding graph,ofthe example Exonerate [14], GMAP [15] and Spaln [16]). common substrings of the transcripts and the gen- However these tools are designed to compute only ome. In this paper we provide efficient algorithms spliced alignments and not to directly provide the com- for building and, subsequently, visiting the embed- plete exon-intron structure of a gene and its full-length ding graph. isoforms. 2. Filter all biologically meaningful spliced align- In this paper we provide a specifically designed algo- ments. This step is performed with a carefully tai- rithm - efficient from both a theoretical and an empiri- lored visit of the embedding graph. cal point of view - to predict the exon-intron structure 3. Reconcile the spliced alignments of a set of corre- of a gene from general transcript data that is optimal lated transcript sequences into a maximum parsi- with respect to constraints derived by the input data. mony consensus gene structure. To complete this The algorithm is implemented in a tool, called PIntron. task we use the Minimum Factorization Agreement Similarly as recent programs [5,7], PIntron is a method (MFA) approach [17] applied to the data produced for exon-intron structure prediction, but differently by the previous step. Indeed, the MFA approach from these tools is able to efficiently process complex gives an effective method to amalgamate some genes or genes associated with a large cluster of ESTs. spliced alignments into a consensus gene structure Indeed, the accurate prediction of the exon-intron struc- (notice that an EST sequence only provides informa- ture of ageneisacomputationalhardtaskwhenthe tion on a partial region of the whole gene). redundancy of the information given by EST data must 4. Extract, classify, and refine the resulting introns in be taken into account. More precisely, combinatorial order to provide a putative gene structure supported methods for the problem are highly accurate when they by transcript evidences. are able to combine two different steps: (1) producing putative spliced alignments of ESTs against the gene We point out that our implementation also has a fifth region and (2) selecting among the different putative step where it predicts a set of full-length isoforms by spliced alignments of each EST those confirming the employing the graph-based method in [19]. same gene structure under some optimization criteria. Our method computes a consensus gene structure This second step has been proved to be NP-hard [17] minimizing the number of exons, called maximum par- thus requires efficient heuristics. simony consensus gene structure. Such a structure is On the other hand, finding putative spliced alignments strictly associated to a set of spliced alignments for each (first phase) could be a challenging task when more than sequence in the cluster of transcript data that is also one alignment exists for the same transcript. Indeed, for output by our algorithm. Informally, a gene structure instance, there could be different possible splicing junc- (depicted in Figure 1) is the description of the location tions between consecutive exons because of the pre- of coding (exon) and noncoding (intron) regions along sence sequencing errors or repeated genomic regions. the genomic sequence. Due to alternative splicing As a consequence, choosing the correct spliced align- events, such as exon skipping, intron retention and ment of a single EST sequence requires to perform a competing exons, a portion of the genomic sequence multiple comparison between several spliced alignments could be both coding and noncoding with respect to dif- of all the EST sequences in order to find the ones that ferent transcripts. Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 3 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Figure 1 The colored directed graph representing a gene structure. The represented gene structure, induced by compositions, is composed by 6 genomic exons: A, B, C, C’, D, E. Dashed edges represent noncoding regions, bold edges represent regions included into all the gene isoforms, and the remaining normal edges represent regions that are both coding and noncoding (i.e. are included into some gene isoform and are retained as a part of an intron into some other isoform). For clarity, we indicated an exon with a curve above the graph, and an intron with two connected segments below the graph. Observe that C and C’ are competing exons, while exons B and D are cassette exons. In this paper, we will evaluate all steps of the pipeline. compute, given a sequence P (the EST or the mRNA) Accuracy and efficiency of PIntron have been assessed and a reference sequence T (the genomic sequence), two sets F F = {f , ..., f } by an experimental comparison with ASPic [20] and ={f ,..., f }and T of strings P 1 k 1 k T = pf i f i ··· f i f s Exogean [21]. The experimental results show that PIn- such that P = f ... f , , and for 1 2 k−1 1 k 1 2 k−1 k tron is much faster than ASPic and competitive with each i, the edit distance between f and f is small. The Exogean. PIntron scales much better than Exogean (in sequence of pairs (f , f ) is called composition of P on T, terms of execution time) when processing genes with a each factor f is called spliced sequence factor (or EST large number of transcript sequences. The predictions factor), and each f is called genomic factor (or exon). made by PIntron are more accurate than those by ASPic Allowing a small edit distance between the two factors and Exogean. Moreover, PIntron is the only tool that is is justified by the fact that EST data contain mismatches able to successfully complete all genes that have been (deletions and insertions) against the genome because of considered. Finally, our results indicate that PIntron also sequencing errors and polymorphisms. Unfortunately, improves the reconstruction of exact transcripts when this also makes computationally harder the spliced compared with the other two tools. alignment problem, especially when the transcript and In this experimental comparison, we focused on the genomic sequence are large. human genes given their excellent annotation status. In our novel alignment method, we exploit the small (f , f ) However, PIntron has been conceived to facilitate gen- edit distance between each pair of corresponding ome annotation in a variety of organisms in which factors: in fact, in this case, there must exist a sequence expressed sequences as well as the reference genome are of some sufficiently long common substrings of the EST available. Given the experimental results we summarized factor f and the genomic factor .Wecallthe above, our program enables the investigation of the sequence of the occurrences of perfectly matching sub- impact of alternative splicing on large-scale. strings an embedding of the EST sequence P in the The rest of this section is devoted to present each genomic sequence and, clearly, it reveals the basic algorithmic step of our four-stage pipeline. “building blocks” of the spliced alignment. Our align- ment algorithm is based on the construction of a com- Implicit computation of spliced alignments pact and implicit representation of all the embeddings The first stage of our gene structure prediction method by means of a graph called embedding graph.Sucha computes the set of all possible spliced alignments of a graph can be efficiently computed from the EST transcript (EST or mRNA) sequenceagainst thegeno- sequence P and the genomic sequence T in time O(|P|+ mic sequence. |T|+ |V| ), where V is its vertex set, and it can be used A spliced alignment is a particular kind of alignment in the second stage of our pipeline in order to efficiently that takes into account the effects of the excision of the enumerate all the biologically meaningful compositions. intronic regions during the RNA splicing process. The In the following we detail the notion and construction spliced sequence alignment problem requires to of the embedding graph. Let us first recall, that Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 4 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 according to the traditional notation, given a string S = then ε is contained in (in short )ifand only if ε ε ε v v s s ... s ,wedenotewith|S| its length and with S[i, j] for each i in ε there exists a pairing in such that 1 2 q j v v the substring s s ... s . i .Given theset of the embeddings of P in T, i i+1 j E A fundamental notion is that of pairing of two strings. we say that is maximal iff there does not exist ε ∈ E More formally, a pairing (p, t, l) of two sequences P and ε = ε , ε = ε , such that . ε ε T (which generalizes the notion of pair of a sequence Not all embeddings induce a biologically meaningful [22]) represents the positions p on P and t on T of a composition. For example, an embedding made of sev- common substring P[p, p+l -1]= T[t, t+l -1] of P and eral short pairings “scattered” along the genome cannot T. In other words, a pairing (p, t, l) represents a com- be considered a valid spliced alignment. In order to mon substring x of P and T,called factor induced by restrict embeddings to be useful for building a spliced the pairing, such that x is of length l starting in posi- alignment, we fix three parameters ℓ , ℓ and ℓ .Intui- E D I tions p and t on P and T respectively. The positions p tively, the parameter ℓ is the minimum length of a and t are called starting positions, while p + l and t + l pairing, ℓ limits the maximum number of consecutive are called ending positions. mismatches that can appear in a single exon, and ℓ We say that a pairing v = (p , t , l ) is contained in a represents the minimum length of an intron. Then a 1 1 1 1 v p pairing (in short v v )ifthe positions and t of representative embedding is a maximal embedding 2 1 1 1 2 can be extended to the left or to the right on both ε = v , ... , v such that l ≥ ℓ , p -p - l ≤ ℓ ,and 1 m i E i+1 i i D the sequences P and T in order to obtain . Clearly, the either (i)|t -t -(p - p )| ≤ ℓ or (ii) t - t -(p i+1 i i+1 i D i+1 i i+1 factor induced by v is a substring of the factor induced - p ) ≥ ℓ is true. It is easy to see that only representative 1 i I by .Moreover, we say that v is a prefix-pairing (suf- embeddings might induce a biologically plausible v v fix-pairing,resp.)of iff v v and shares the composition. 2 1 2 1 same starting (ending, resp.) positions on and of . Indeed, a careful choice of the three parameters ℓ , ℓ P T 2 E D This fact implies that the factor induced by 1 is a prefix and ℓ allows to recover a spliced alignment of P in T (suffix, resp.) of the factor induced by 2on P and T.A with a fixed (small) error rate from some representative pairing v is maximal if and only if there does not exist a embeddings. Therefore, we propose the problem of find- distinct pairing containing v.Inother words, v is maxi- ing all representative embeddings of P in T, formalized mal if and only if the common factor induced by can- as the REPRESENTATIVE EMBEDDING problem (RE), not be “extended” neither to the left nor to the right on where we are given a pattern P, a text T, and three para- both P and T. meters ℓ , ℓ and ℓ . The goal is to compute the set E E D I r A sequence of non-overlapping pairings (i.e. pairings of the representative embeddings of P in T. that represent non-overlapping occurrences of common In this first stage of the pipeline, we tackle the RE substrings) is called an embedding (see Figure 2). Given problem by using the embedding graph defined as two embeddings ε = v , ... , v and ε = v , ··· , v , follows. 1 n m Figure 2 An embedding and its relationships with the genome and a transcript. The x ,...,x are substrings shared by the genome and the 1 9 transcript corresponding to pairings. Each common substring (pairing) is longer than a fixed threshold ℓ . Intuitively, when the distance (measured on the genome) between two consecutive pairings is smaller than ℓ then we assume that those pairings belong to the same exon. When the same distance is larger than ℓ then those pairings belong to different exons. I Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 5 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Definition (Embedding Graph). Given a pattern P and Such a procedure visits the embedding graph examin- atext T,the embedding graph of P in T is a directed ing and extracting only pairwise-distinct representative graph G =(V, E) such that the vertex set V is the set of embeddings that are biologically meaningful (for exam- maximal pairings of P and T that are longer than ℓ . ple with respect to the length of gaps representing Two pairings v =(p , t , l ) and v =(p , t , l ) are con- errors or introns). More precisely, the visit of a vertex k 1 1 1 1 2 2 2 2 nected by an edge (v , v ) Î E if and only if: (i) p -(p from the extended source s reconstructs the set of 1 2 2 1 E + l ) ≤ ℓ , and (ii) |t - t -(p - p )| ≤ ℓ or t - t - biologically meaningful representative embeddings that 1 D 2 1 2 1 D 2 1 P = s, v , ..., v (p - p ) ≥ ℓ . are induced by the path traversed dur- 2 1 I 1 k Basically the conditions of the definition of Embedding ing the visit of the embedding graph. Graph ensure the following crucial property: Two maxi- We will now explain the main steps of the procedure. mal pairings v and v are connected by an edge in the During the visit of vertex v , we examine each outgoing 1 2 embedding graph if and only if there exists a representa- edge (v , v )and we “extend” each embedding k k+1 tive embedding ε in which there are two consecutive ε = e , ... , e of . How the extension is performed 1 k pairings v and v such that v is contained in and depends on the relative position, on P and T,of e in ε i i+1 i v is contained in . and the new vertex v that are depicted in Figure 3. In k+1 i+1 We will use this property to build representative the exposition of the different possible cases, let e = embeddings from an embedding graph. Observe that (p , t , l ) and v =(p , t , l ). Observe that given k k k k+1 k+1 k+1 k+1 such a property derives from the maximality of the two pairings that are connected by an edge in the representative embeddings and from the uniqueness of embedding graph, the corresponding factors might be the maximal pairing containing a pairing which belongs overlapping in the text or in the pattern. To simplify the to a representative embedding. notation, in the following we identify a pairing with the We designed an algorithm that builds the embedding factor it induces. graph of a pattern P and a text T in time O(|T|+ |P|+| Case (a). Factors e and v overlap on both T and P. k k+1 V| ). The algorithm is composed of two steps. In the Two different sub-cases must be analyzed. The first case first step, the vertex set V is computed by visiting the occurs when the distance between the two initial posi- suffix tree of the text T. This step requires O(|T|) time tions of the factors e and v on P differs from the same k k+1 for the suffix tree construction and O(|P|+ |V|) time distance on T of a value (positive or negative) less than for the computation of maximal pairings. In the second ℓ , while the second case occurs when such a distance step, edges are then computed by checking the condi- differs of a value greater than ℓ . If the first case occurs tions of the definition of embedding graph on each pair when |(t -t )-(p - p )| ≤ ℓ then the two pairings k+1 k k+1 k D of maximal pairings, leading to an O(|V| )procedure. may belong to the same factor of the induced composi- Since the number of maximal pairings is usually very tion. Thus, the algorithm replaces pairing e in ε with the small compared to the length of P and T,the embed- shortest maximal prefix-pairing e of e and the longest ding graph construction procedure is efficient even on maximal suffix-pairing e of v such that they do not k+1 k+1 large patterns P and texts T. overlap and that both e and e are at least ℓ long. k+1 E Thesecond caseoccurswhen(t - t )- (p - p ) ≥ ℓ . k+1 k k+1 k I Extraction of relevant spliced alignments This case deserves a special discussion from the biologi- The next stage of our pipeline is devoted to analyzing cal point of view since it could be related to an intron as and mining the embedding graph to compute the repre- well as to a tandem repeat in T.Thenfactor e could be sentative embeddings that also induce distinct biologi- extended to include the repetition in v to produce a k+1 cally meaningful compositions. Indeed, it must be unique factor (exon) of the embedding ε. pointed out that different representative embeddings Case (b). Factors induced by e and v overlap in T k k+1 can induce the same compositions or spliced align- but not in P. This case is equivalent to the first sub-case ments. Algorithm ComputeCompositions is a two-step of Case (a). procedure. Initially it extracts a subset of representative Case (c). Factors e and v overlap in P but not in T. k k+1 embeddings by performing a visit of the embedding Just as in Case (a) two different sub-cases must be ana- graph. Then the algorithm computes the compositions lyzed, that is either |(t - t )-(p -p )| ≤ ℓ or t k+1 k k+1 k D k+1 by merging consecutive pairings that are separated by -t -(p - p ) ≥ ℓ . The first case is solved as in Case k k+1 k I short gaps. (a). Notice that when the second subcase occurs then Embedding graph visit the splice site placement is ambiguous because a suffix The first step of ComputeCompositions is a recursive of the donor exon is equal to a prefix of the acceptor visit of the embedding graph starting from a subset of exon. Also in this case, basic biological criteria are used vertices that we call extended sources. to reduce the impact of the ambiguity. Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 6 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Figure 3 Possible relative positions of two maximal pairings connected by an embedding graph edge. The figure presents the possible configurations of relative positions of two maximal pairings e =(p , t , l ) and v =(p , t , l ) connected by an embedding graph edge k k k k k+1 k+1 k+1 k+1 (e , v ). Each box represents a common maximal factor on T (top) and P (bottom) of a maximal pairing. Each maximal pairing is represented by k k+1 two boxes connected by lines (boxes representing e are in bold). For each case, t corresponds to the left border of the upper bold box, p is k k k the left border of the lower bold box, t is the left border of the upper normal box, and p is the left border of the lower normal box. k+1 k+1 Distance |(t - t )- (p - p )| has been represented by a double ended arrow, while factor overlaps are highlighted by grey shades. Four k+1 k k+1 k possible cases are presented: (a) e , v overlap on both T and P, (b) e , v overlap on T but not on P , (c) e , v overlap on P but not on T, k k+1 k k+1 k k+1 and (d) e , v do not overlap neither on T nor on P. k k+1 Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 7 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Case (d). Factors e and v do not overlap neither in composition is retained if the edit distance between k k+1 P nor in T.Let G and G be the two substrings which each EST factor and the corresponding genomic factor T P separate e and v in T and P, respectively. Since G is not larger than a fixed acceptable threshold. k k+1 P and G do not form a pairing, they must contain a cer- tain number of mismatches; we must determine if they Building a gene structure support the possibilities that (i) e and v are part of The first two stages of our pipeline are applied sepa- k k+1 the same factor or (ii) there is an intron between e and rately to each transcript sequence P of the input data (a k i v . Similarly to Case (a), two different sub-cases may genomic sequence T and a set S of transcripts) com- k+1 arise. If |(t - t )-(p - p )| ≤ ℓ ,then e and v puting a set C(P ) of biologically meaningful composi- k+1 k k+1 k D k k+1 i might belong to the same factor of the induced compo- tions for each P . The main goal of the third stage is to sition. More precisely, e and v belong to the same extract a composition for each transcript that explains k k+1 factor if the edit distance between G and G is below a the putative gene structure. As stated before, informally T P certain threshold - in which case v is added to a gene structure is the description of the location of k+1 embedding ε, otherwise the edge is discarded from the coding and noncoding regions along the genomic visit. Instead, if t -t - p + p ≥ ℓ , the two pairings sequence, where by a coding region we mean an exon k+1 k k+1 k I are separated by an intron, and we must determine the and by noncoding region we mean an intron. Note that splice sites of such an intron. In this case, the algorithm the boundaries between an exon and an intron is called G G computes a prefix and a suffix of G that mini- splice junction or splice site. T T T mize the edit distance between G and the concatena- We aim to produce a maximum parsimony consensus G G tion of and . Also in this case, if the resulting gene structure for which consists of a minimum set T T edit distance is larger than an acceptable threshold, the of genomic exons or coding regions compatible with a edge (v , v ) is discarded, otherwise v is added to ε. high quality composition C for each transcript data P . k k+1 k+1 i i Notice that computing the edit distance is not too The minimization criteria is used to avoid overpredic- expensive, since all strings involved are no longer than tion of splice junctions. For this task we propose a for- 2ℓ . malization of the problem of finding a putative gene The definition of embedding graph allows the pre- structure, called CONSENSUS GENE STRUCTURE sence of directed cycles, which potentially might be problem (CG) and discuss a solution of this problem. troublesome. However, we claim that the embeddings, The input of the CG problem consists of a set C(P )of computed from a path P containing a cycle C ,would compositions for each transcript P in a set S and a induce compositions with essentially the same set of fac- finite ordered set F = ⟨f , f ,..., f ⟩ of genomic factors 1 2 |F| tors of the compositions induced by the embeddings induced by the compositions in ∪C(P ). Ordering of fac- computed from the visit of the simple path P \C .The tors is assigned by considering their left splice junctions. visit performed in the first step of algorithm. Compute- Then CG asks for the minimum cardinality subset F’ of Compositions guarantees that each possible representa- F such each P has a composition with all genomic fac- tive embedding is analyzed. However, the biological tors in F ’.Inother words F ’ is the minimum set of criteria that we employ allow to consider only pairings exons explaining a spliced alignment of each EST data. belonging to biologically meaningful embeddings. Since Now, the CG problem can be faced by using the the visit computes pairwise-distinct representative approach [17] called Minimum Factorization Agreement embeddings and every case presented above requires O (MFA). More precisely, we use the MFA problem to (1) time, the overall computational complexity of the compute a gene structure minimizing the number of O( |ε|) visit is clearly bounded by , that is the total exons. ε∈E size of the representative embeddings that have been Let us recall the definition of the MFA problem. Let F computed during the visit. = ⟨f , f ,..., f ⟩ be a finite ordered set of sequences over 1 2 |F| Composition reconstruction alphabet Σ, called factors and let S be a set of sequences The set of representative embeddings computed by over alphabet Σ.Given a sequence s Î S,a factor-com- the visit of the embedding graph directly leads to a set position (f-composition in short) of s consists of the C of compositions. In fact, the visit guarantees that two sequence f = f , f , ··· , f such that s = f , f , ··· , f i i i i i i 1 2 n 1 2 n consecutive pairings of a representative embedding are and i <i for 1 ≤ j< n.Thenthe set {f , f , ··· , f } j j+1 i i i 1 2 n either separated by a small gap due to errors or by a is called the factor set of f and is denoted as F(f). While large gap representing an intron of the spliced align- the notion of f-composition depends on the set of fac- ments. Hence, the algorithm simply merges into a factor tors, such set of factors is usually clear from the context a sequence of factors induced by consecutive pairings v and is therefore omitted. Please notice that a sequence s =(p , t , l )and v =(p , t , l ) separated by can admit different f-compositions: thus let F(s)bethe k k k k+1 k+1 k+1 k+1 small gaps, that is |t - t - p + p | ≤ ℓ . Finally, the set of compositions of s. Moreover, by extension, we k+1 k k+1 k D Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 8 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 will denote by F(S)the set ∪ F(s) of all f-compositions which the excision is mediated by the major spliceoso- sÎS of a set S of sequences. Given a subset F’ ⊆ F of factors mal pathway or the minor spliceosomal pathway, respec- and the set F(S), then F’ is a factorization agreement set tively. Notice that RefSeq transcripts are usually full- for F(S) if and only if for each sequence s Î S,there length and error-free, that GT - AG, GC - AG and AT - exists a f-composition f in F(s) whose factor set is a sub- AC are the most frequent rules [23] and those rules are set of F’, i.e. F(f) ⊆ F’. associated to U12/U2 introns [24]. Hence we assume The Minimum Factorization Agreement problem, that only introns that do not follow one of the U12/U2 given a set F of factors and a set S of sequences, asks rulesand arenot supportedbyaRefSeqtranscript for a minimum cardinality subset F’ ⊆ F such that F’ is should be reduced. The input of our intron-reduction a factorization agreement set for F(S). Then the CG pro- procedure is a set X of pairs (i, s) computed by the pre- blem can be reduced to the MFA problem by posing S vious steps. Then, R is the set of pairs in X such that s to be the cluster of transcript sequences P and F is the is a RefSeq, C , C , C and N are the set of pairs in X \ i 1 2 3 set of all genomic factors (exons) used to produce the R following the GT - AG, GC - AG, AT - AC and a compositions C(P ) for each P ,i.e. F(S)consistsofall non-U12/U2 rule respectively. Our procedure basically i i the compositions of each sequence in S. Then the con- tries to reduce elements in N to some intron in R and, sensus gene structure consists of a minimum factoriza- if this is not possible, it tries to reduce to some element tion agreement set for the set of compositions of the in the first set of the sequence C , C , C that allows the 1 2 3 transcripts data. When solving the MFA problem on reduction. such data, the solution F’ provides a minimum set of factors explaining all transcript sequences and a single Results composition of each transcript can be obtained from set We implemented the approach described in the previous F’. section as a set of programs in the software package By applying the algorithm in [17] we can filter effi- PIntron. PIntron receives a genomic sequence and a set ciently a set of spliced alignments agreeing to the same of transcripts - ESTs and/or mRNAs - and computes a gene structure that are successively refined by the intron representation of the exon-intron structure of the gene reduction step. as well as a set of predicted full-length annotated iso- forms. PIntron outputs the list of the predicted introns Intron reduction with information such as relative and absolute start and Although the intron boundaries of the EST spliced com- end positions, intron lengths, the donor and the accep- positions are computed by finding the best transcript- tor splice sites, and intron types (U12, U2 or unclassi- genome alignment over the splice site regions and the fied). The output gives the composition as exons of each most frequent intron pattern (i.e. the first and the last isoform and, for each exon, the start and end positions two nucleotides of an intron) according to [23], the set as relative and absolute coordinates, if a polyA signal is of predicted introns may still contain false positives very present, and the length of 5’UTR and 3’UTR. Moreover close to true predictions. Thus, we designed a procedure several additional information are given for each pre- for comparing the intron set computed by the EST dicted isoform, such as its length, the CDS starting and spliced compositions in order to correct and reduce the ending positions, the RefSeqID (if it exists) and the set of false positives. length of the associated protein. In the following, let the pair (i, s) denotes a genomic PIntron source code and binaries are available under intron (eventually specified by a pair of genomic coordi- the GNU AGPLv3 license at http://www.algolab.eu/ nates) and a spliced composition of an EST s supporting PIntron. the intron i, i.e. the composition has two consecutive In the following, we discuss an experimental in-silico factors f , f inducing intron i when aligned to the gen- analysis on real human data aiming to evaluate our j j+1 ome. Then, given an error bound b, we say that (i, s)is approach. Such an experimental evaluation is organized b-reducible to (i’, s) iff there exists a boundary shift of in two parts. The first part has been designed to assess factors f and f of a new spliced composition of s the prediction accuracy of PIntron, while the aim of the j j+1 inducing intron i’ with at most b additional errors with second part is to show the scalability of our method and respect to the previous alignment of the two factors its effectiveness on genes that are very large or complex against the genome. To improve the accuracy of the and are currently outside the comfort zone of the most step, we also consider if the intron is supported by a used methods. RefSeq transcript and if it can be categorized as an U12/ We have assessed the accuracy achieved by PIntron by U2 intron. A RefSeq sequence is a validated full-length comparing it with ASPic [20] and Exogean [21]. In par- mRNA stored and annotated in the NCBI RefSeq data- ticular, ASPic is a well-established software to predict alternative isoforms by multiple EST/mRNA alignments base. U2 and U12 refers to two intron categories for Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 9 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 against the corresponding genomic regions. For each experimental evaluation along with some of their main input EST, the ASPic algorithm attempts to compute a characteristics. ESTs and mRNAs related to each gene single spliced alignment with the minimum number of were obtained from UniGene database. exons. Instead, PIntron implicitly provides several candi- The results of our first assessment are summarized in date spliced alignments for each EST, among which the Table 2, while the details are presented in Supplemen- best one is selected by using the MFA agreement tary Tables S.2, S.3, S.4 in Additional file 1. The three approach, thus allowing a greater accuracy in predicting tools have been evaluated according to two dimensions: prediction quality and time efficiency. The first impor- the putative gene structure. Moreover, PIntron is much tant observation is that only PIntron was able to predict faster than ASPic because of the more efficient data structure used for performing the EST alignments (i.e. the gene structures for all 112 ENCODE loci, while the embedding graph instead of the hash table of the ASPic and Exogean completed 93 and 104 genes, genomic seeds employed by ASPic). For this reason, respectively. Moreover, PIntron has been the fastest of ASPic requires a genomic sequence trimmed at the bor- the three in the experiment over the whole set of genes, ders of a single gene locus, while PIntron is able to effi- producing its results in about 49 minutes (on average 26 ciently process a large region of the genome (i.e. seconds per gene). On the genes that have been success- spanning tens of gene loci) and a large set of expressed fully processed, instead, Exogean took 57 minutes and sequences. ASPic more than 46 hours. Such results clearly indicate Exogean is a gene prediction tool based on pre-aligned a computational improvement of PIntron over Exogean (by Blat [25]) ESTs/mRNAs or proteins. Exogean and especially ASPic in processing genes that are critical resulted one of the most accurate gene finding system in terms of number of ESTs. Indeed Table 3 shows that in the last EGASP competition [26]. In Exogean, gene PIntron scales much better than Exogean and ASPic structures are reconstructed according to a graph-based when the number of transcripts is over 10,000, thus strategy mimicking the human annotation process. making our new software implementation particularly The accuracy assessment has been performed on 13 amenable to analyze large EST clusters. Notice that the ENCODE human regions [26] used as training set in the running time of Exogean includes the preprocessing EGASP competition. The regions have been chosen time required by Blat to align the transcripts. However, since they present different gene density and different the preprocessing time is almost negligible compared to conservation to the mouse genome. This dataset con- thetimerequiredbyExogean.In fact, Blat required tains 112 well-annotated gene loci, supported by 98, 064 approximately 4 minutes (7% of the total running time) to process all the genes. UniGene transcripts for a overall length of approxi- mately 62 Mb (Table 1). The 13 ENCODE regions Prediction quality has been evaluated by calculating represent, approximately, 8.5 Mb of the human genomic sensitivity (Sn) and specificity (Sp) between ENCODE sequence. Supplementary Table S.1 in Additional file 1 annotations and predictions at nucleotide, exon, intron, reports the complete list of the genes used in this and transcript level, according to Burset and Guigó [27]. We adhered to the nomenclature established in the lit- erature aimed to the evaluation of gene structure predic- Table 1 Main characteristics of the dataset used for the tion tools, even if the definition of specificity that we accuracy assessment of PIntron use here is called positive predictive value in statistical Region Genomic Number Number of Overall transcript length (nt) of genes transcripts length (nt) Table 2 Summary of the experimental results on the 112 ENm004 1,700,000 18 6,964 4,497,709 gene loci on the 13 ENCODE regions ENm006 1,338,447 35 18,230 11,377,148 PIntron Exogean ASPic ENr111 500,000 2 171 113,356 Exon level Sn 0.529 0.444 0.390 ENr114 500,000 1 35 120,734 Sp 0.622 0.606 0.427 ENr132 500,000 4 855 551,266 ENr222 500,000 2 461 277,554 Intron level Sn 0.874 0.733 0.633 ENr223 500,000 5 50,607 32,732,634 Sp 0.789 0.777 0.567 ENr231 500,000 11 5,637 3,534,406 Transcript level Sn 0.564 0.251 0.342 ENr232 500,000 9 4,779 2,505,934 Sp 0.418 0.450 0.252 ENr323 500,000 5 1,670 997,647 Nucleotide level Sn 0.889 0.657 0.635 ENr324 500,000 1 487 343,220 Sp 0.916 0.865 0.632 ENr333 500,000 12 7,179 4,381,534 Annotated genes 112 104 93 ENr334 500,000 7 989 611,795 Total running time (seconds) 2,961 3,446 168,607 Total 8,538,447 112 98,064 62,044,937 The best value of each row is highlighted in boldface. Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 10 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Table 3 Running times of PIntron and Exogean on the 26 gene structure (several tens of exons), or (2) a particu- “critical” genes larly large cluster of expressed sequences, or (3) a large Gene Genomic Number of Running time (seconds) genomic sequence. length (nt) transcripts To this aim, we selected 26 “critical” genes and we PIntron Exogean processed them with PIntron and Exogean on a 4-node ACTB 36,634 26,248 287.35 371.22 linux cluster running CentOS 5.5. Each node is ALB 24,299 16,920 144.17 369.38 equipped with a quad-core 2.40 GHz CPU and 32 GiB ANKS1B 1,258,645 406 15.60 0.92 of RAM. The genomic sequence has an average length ANXA1 512,535 2,087 20.65 7.63 of about 848 Kb, and is longer than 1 Mb for 11 of the ATP1A1 619,226 3,241 27.82 11.90 26 genes. Moreover, the selected genes have on average ATP5A1 405,213 9,864 143.33 70.93 more than 5,000 transcripts, and 5 genes have more CDH13 1,169,823 507 10.34 1.02 than 15,000 transcripts. The total running time was 65 CNTNAP2 2,304,964 227 30.86 1.01 minutes for PIntron and 48 minutes for Exogean. In this CTNNA2 1,463,710 261 12.71 0.96 evaluation, we did not take into account ASPic since it CUGBP2 1,081,163 864 18.04 2.42 was not able to give a solution for any of these genes DAB1 1,551,956 164 14.51 0.85 within an acceptable time. Table 3 reports the complete DLG2 2,172,263 279 21.18 1.15 list of genes considered in this experimental part along DMD 2,241,933 329 35.35 2.21 with their main characteristics and the running times of ENO1 185,661 13,131 119.84 125.51 PIntron and Exogean. While Exogean and PIntron run- FGG 579,042 2,033 15.40 3.56 ning times were both acceptable, PIntron averaged 149 FHIT ( ) 1,502,110 134 202.35 n.a. sec/gene and Exogean 109 sec/gene. This is remarkable, GAPDH 46,975 15,518 149.64 232.81 since Exogean is based on the fast progressive EST-to- HINT1 873,331 844 12.02 3.08 genome mapping program Blat and does not take into HSP90AA1 384,611 6,710 47.37 13.87 account potential alignment errors at splicing sites HSPA8 90,642 15,850 118.47 152.84 which, in turn, is likely to result in predictions that are KCNIP4 1,220,613 107 10.09 0.65 not as accurate as those given by PIntron. The compari- MBP 154,857 21,071 251.70 1,344.42 son of running times confirms our previous observation: NCAM1 317,404 1,293 12.54 1.63 PIntron, although slower than Exogean on genes with RPL3 187,677 12,208 90.15 108.12 small transcript clusters, scales significantly better than TBC1D22A 1,378,585 467 115.99 2.27 Exogean when the cluster size increases. In fact, PIntron TTN 304,814 1,349 1,952.58 6.77 was systematically faster than Exogean on the subset of genes whose transcript cluster is composed by more Total 22,068,686 152,112 3,880.05 2,837.94 † than 10, 000 sequences (genes ACTB, ALB, ENO1, Exogean did not successfully compute a gene structure for FHIT. GAPDH, HSPA8, MBP,and RPL3), while it was slower than Exogean on the other genes. In almost all the cases where PIntron was slower than Exogean, the difference literature [28]. As shown in Table 2 and Figure 4, PIn- between the running times of the two tools is small. tron appears the most accurate program at diverse pre- Thus the running time of PIntron can be considered diction levels. Moreover, PIntron exhibits sensitivity and acceptable also on these genes. One notable exception is specificity levels that are quite similar. This fact, which gene TTN where PIntron took about 32 minutes to pre- is highly desirable in any prediction tool, shows that dict the gene structure, while Exogean required only a PIntron does not advantage any of them to the detri- few seconds. The likely reason is that the input tran- ment of the other one. In addition, our results (see the average sensitivity at transcript level in Table 2) indicate script set of TTN contains sequences that are more than that PIntron improves the reconstruction of exact tran- 80 Kb long. Since EST sequences have a lower quality scripts when compared with ASPic and Exogean. More- than mRNA sequences, computing their spliced align- over, we want to recall that PIntron has completed the ment requires a considerable amount of computational analysis of all 112 input genes, while Exogean and ASPic resources. did not complete the task for 8 and 29 genes We want to point out that our second experiment has respectively. limited scope. In fact a complete comparison of PIntron Our second experimental analysis is devoted to evalu- and Exogean would also include the accuracy dimen- ating the efficiency and the scalability of our approach sions. The results of the first experiment suggests that PIntron is more accurate than Exogean. If confirmed, on a subset of critical human genes that are particularly the greater accuracy would justify the small increase in hard to analyze with the currently available programs the running times that we have observed. because those genes have (1) a particularly complex Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 11 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Figure 4 Accuracy achieved by PIntron, Exogean and ASPic at various levels. The boxplot presents the distribution of specificity and sensitivity achieved by the three tools at the exon, intron, transcript and nucleotide levels. The vertical edges of the boxes represent the first quartile, the median and the third quartiles (from left to right). The cross is the average. The vertical dashed lines represent an estimate of the 95% confidence interval of the median. The circles are all the outliers with respect to such confidence interval. The analysis of the running times of the first and the spliced alignments of a transcript against a genome, and second part of the experimentation has not shown any an efficient algorithm that exploits the inherent redun- significant correlation between the length of the genes dancy of information in a cluster of transcripts to select, and the running times, hence confirming our conjecture among all possible factorizations of EST sequences, that the behavior of our algorithm depends on some those allowing to infer splice site junctions that are lar- properties of the Embedding Graph, and not on the size gely confirmed by the input data. PIntron is freely avail- of the instance. In particular, the structure of the able at http://www.algolab.eu/PIntron under GNU Embedding Graph is strictly related to the quality of the Affero General Public Licence (AGPL). The experimen- transcripts and to the presence of repetitions and highly tal evaluation of PIntron has shown that it has been duplicated regions in the genomic sequence that, in able to compute accurate predictions (whose level is turn, could influence the size of the graph. Also these comparable with that of other prediction tools) while results have confirmed our beliefs, since the average achieving a good scalability to critical genes, especially if running time of the second experiment (149 sec/gene) is associated with a large transcript cluster. not too far from the running times on the smaller genes of the first experiment, where the average value is 26 Additional material sec/gene. A fundamental observation is that PIntron has successfully completed the analysis of all 26 “critical” Additional file 1: Supplementary tables. Characteristics of the first dataset and detailed results obtained in the experimental comparison. genes, while Exogean did not complete the analysis for FHIT. Conclusions Acknowledgements We thank Marcello Varisco for the implementation of some parts of the In this work, we presented a new computational pipeline pipeline. This research was supported in part by FAR MIUR 60% grant - PIntron - for predicting the gene structure into exons “Algorithmic methods and combinatorial structures in Bioinformatics” (Univ. and introns from a cluster of transcript (EST, mRNA) di Milano-Bicocca) to YP, RR, GDV, and PB, grant “Dote ricerca applicata” 21_ARA (FSE, Regione Lombardia) to YP, and Ministero dell’Istruzione, sequences. PIntron combines two ideas: a novel algo- dell’Università e della Ricerca, Italy: Fondo Italiano Ricerca di Base, rithm of proved small time complexity for computing Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 12 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 “Laboratorio Internazionale di Bioinformatica” (LIBI), “Laboratorio di 15. Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment Bioinformatica per la Biodiversità Molecolare” (DM19410), PRIN 2009; program for mRNA and EST sequence. Bioinformatics 2005, Progetto Strategico Regione Puglia PS 012; Progetto EPIGEN (CNR) to GP. 21(9):1859-1875. This article has been published as part of BMC Bioinformatics Volume 13 16. Gotoh O: A space-efficient and accurate method for mapping and Supplement 5, 2012: Selected articles from the First IEEE International aligning cDNA sequences onto genomic sequence. Nucleic Acids Research Conference on Computational Advances in Bio and medical Sciences 2008, 36(8):2630-2638. (ICCABS 2011): Bioinformatics. The full contents of the supplement are 17. Bonizzoni P, Della Vedova G, Dondi R, Pirola Y, Rizzi R: Minimum available online at http://www.biomedcentral.com/bmcbioinformatics/ factorization agreement of spliced ESTs. In Proc 9th International supplements/13/S5. Workshop on Algorithms in Bioinformatics (WABI), Volume 5724 of LNCS. Springer;Salzberg SL, Warnow T 2009:1-12[http://dx.doi.org/10.1007/978-3- Author details 642-04241-6_1]. Dipartimento di Informatica Sistemistica e Comunicazione, Univ. degli Studi 18. Bonizzoni P, Rizzi R, Pesole G: Computational methods for alternative di Milano-Bicocca, Milano, 20126, Italy. Centro Ricerche e Studi splicing prediction. Briefings in Functional Genomics and Proteomics 2006, Agroalimentari, Parco Tecnologico Padano, Lodi, 26900, Italy. Dipartimento 5(1):46-51. di Biochimica e Biologia Molecolare “E. Quagliariello”, Univ. degli Studi di 19. Bonizzoni P, Mauri G, Pesole G, Picardi E, Pirola Y, Rizzi R: Detecting Bari, Bari, 70126, Italy. Istituto di Biomembrane e Bioenergetica, Consiglio alternative gene structures from spliced ESTs: a computational Nazionale delle Ricerche, Bari, 70126, Italy. Dipartimento di Statistica, Univ. approach. Journal of Computational Biology 2009, 16(1):43-66. degli Studi di Milano-Bicocca, Milano, 20126, Italy. 20. Bonizzoni P, Rizzi R, Pesole G: ASPIC: a novel method to predict the exon- intron structure of a gene that is optimally compatible to a set of Authors’ contributions transcript sequences. BMC Bioinformatics 2005, 6:244. YP and RR designed the algorithm, developed the pipeline, designed and 21. Djebali S, Delaplace F, Crollius HR: Exogean: a framework for annotating helped to perform the experiments, and drafted the manuscript. EP helped protein-coding genes in eukaryotic genomic DNA. Genome Biology 2006, to design and to perform the experiments, and interpreted the results. GP 7(Suppl 1):S7. helped to design the experiments and supervised the interpretation of the 22. Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and results. GDV helped to design the algorithm, to develop the pipeline, and to Computational Biology Cambridge: Cambridge University Press; 1997. draft the manuscript. PB designed the algorithm, helped to draft the 23. Burset M, Seledtsov I, Solovyev V: Analysis of canonical and non-canonical manuscript, and supervised the research. All authors read and approved the splice sites in mammalian genomes. Nucleic Acids Research 2000, final manuscript. 28(21):4364-4375. 24. Sheth N, Roca X, Hastings ML, Roeder T, Krainer AR, Sachidanandam R: Competing interests Comprehensive splice-site analysis using comparative genomics. Nucleic The authors declare that they have no competing interests. Acids Research 2006, 34(14):3955-3967. 25. Kent JJ: BLAT-the BLAST-like alignment tool. Genome Research 2002, Published: 12 April 2012 12(4):656-664. 26. Guigó R, Flicek P, Abril J, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, References Harrow J, Hubbard T, Lewis SE, Reese MG: EGASP: the human ENCODE 1. Caceres J, Kornblihtt A: Alternative splicing: multiple control mechanisms Genome Annotation Assessment Project. Genome Biology 2006, 7(Suppl and involvement in human disease. Trends Genet 2002, 18(4):186-193. 1):S2. 2. Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA: Splicing graphs and 27. Burset M, Guigo R: Evaluation of Gene Structure Prediction Programs. EST assembly problem. Bioinformatics 2002, 18(Suppl 1):S181-S188. Genomics 1996, 34:353-357. 3. Leipzig J, Pevzner P, Heber S: The Alternative Splicing Gallery (ASG): 28. Altman DG, Bland JM: Statistics Notes: Diagnostic tests 1: sensitivity and bridging the gap between genome and transcriptome. Nucleic Acids specificity. BMJ 1994, 308(6943):1552. Research 2004, 32(13):3977-3983. 4. Xing Y, Resch A, Lee C: The multiassembly problem: reconstructing doi:10.1186/1471-2105-13-S5-S2 multiple transcript isoforms from EST fragment mixtures. Genome Cite this article as: Pirola et al.: PIntron: a fast method for detecting the Research 2004, 14(3):426-441. gene structure due to alternative splicing via maximal pairings of a 5. Kim N, Shin S, Lee S: ECgene: genome-based EST clustering and gene pattern and a text. BMC Bioinformatics 2012 13(Suppl 5):S2. modeling for alternative splicing. Genome Research 2005, 15(4):566-576. 6. Eyras E, Caccamo M, Curwen V, Clamp M: ESTGenes: alternative splicing from ESTs in Ensembl. Genome Research 2004, 14(5):976-987. 7. Castrignanò T, Rizzi R, Talamo IG, D’Onorio De Meo P, Anselmo A, Bonizzoni P, Pesole G: ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization. Nucleic Acids Research 2006, 34(Suppl 2):W440-W443. 8. Kan Z, Rouchka EC, Gish WR, States DJ: Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Research 2001, 11(5):889-900. 9. Gupta S, Zink D, Korn B, Vingron M, Haas S: Genome wide identification and classification of alternative splicing based on EST data. Bioinformatics Submit your next manuscript to BioMed Central 2004, 20(16):2579-2585. 10. De Bona F, Ossowski S, Schneeberger K, Rätsch G: Optimal spliced and take full advantage of: alignments of short sequence reads. Bioinformatics 2008, 24:i174-i180. 11. Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions • Convenient online submission with RNA-Seq. Bioinformatics 2009, 25(9):1105-1111. • Thorough peer review 12. Bryant DW, Shen R, Priest HD, Wong WK, Mockler TC: Supersplat–spliced RNA-seq alignment. Bioinformatics 2010, 26(12):1500-1505. • No space constraints or color ﬁgure charges 13. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, • Immediate publication on acceptance Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, • Inclusion in PubMed, CAS, Scopus and Google Scholar Liu J: MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research 2010, 38(18):e178. • Research which is freely available for redistribution 14. Slater G, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005, 6:31. Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals http://www.deepdyve.com/lp/springer-journals/pintron-a-fast-method-for-detecting-the-gene-structure-due-to-RTZ73NyVDs

Loading next page...

References (29)

(2006)
Exogean: a framework for annotating protein-coding genes in eukaryotic genomic DNA
Genome Biology, 7
(2008)
A space-efficient and accurate method for mapping and aligning cDNA sequences onto genomic sequence
Nucleic Acids Research, 36
(1994)
Statistics Notes: Diagnostic tests 1: sensitivity and specificity
BMJ, 308
(2004)
The Alternative Splicing Gallery (ASG): bridging the gap between genome and transcriptome
Nucleic Acids Research, 32
(2002)
Splicing graphs and EST assembly problem
Bioinformatics, 18
(2000)
Analysis of canonical and non-canonical splice sites in mammalian genomes
Nucleic Acids Research, 28
(2004)
Genome wide identification and classification of alternative splicing based on EST data
Bioinformatics, 20
(2009)
TopHat: discovering splice junctions with RNA-Seq
Bioinformatics, 25
(2004)
ESTGenes: alternative splicing from ESTs in Ensembl
Genome Research, 14
(1996)
Evaluation of Gene Structure Prediction Programs
Genomics, 34
(2006)
EGASP: the human ENCODE Genome Annotation Assessment Project
Genome Biology, 7
(2002)
BLAT-the BLAST-like alignment tool
Genome Research, 12
(2009)
Minimum factorization agreement of spliced ESTs
Proc 9th International Workshop on Algorithms in Bioinformatics (WABI), Volume 5724 of LNCS
(2005)
Automated generation of heuristics for biological sequence comparison
BMC Bioinformatics, 6
(2006)
Computational methods for alternative splicing prediction
Briefings in Functional Genomics and Proteomics, 5
(2004)
The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures
Genome Research, 14
D Gusfield (1997)
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology
P Bonizzoni, G Della Vedova, R Dondi, Y Pirola, R Rizzi (2009)
Proc 9th International Workshop on Algorithms in Bioinformatics (WABI), Volume 5724 of LNCS
(2001)
Gene structure prediction and alternative splicing analysis using genomically aligned ESTs
Genome Research, 11
(2010)
MapSplice: accurate mapping of RNA-seq reads for splice junction discovery
Nucleic Acids Research, 38
(2006)
ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization
Nucleic Acids Research, 34
(2008)
Optimal spliced alignments of short sequence reads
Bioinformatics, 24
(2010)
Supersplat--spliced RNA-seq alignment
Bioinformatics, 26
(2005)
ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences
BMC Bioinformatics, 6
(2006)
Comprehensive splice-site analysis using comparative genomics
Nucleic Acids Research, 34
(2005)
ECgene: genome-based EST clustering and gene modeling for alternative splicing
Genome Research, 15
(2002)
Alternative splicing: multiple control mechanisms and involvement in human disease
Trends Genet, 18
(2005)
GMAP: a genomic mapping and alignment program for mRNA and EST sequence
Bioinformatics, 21
(2009)
Detecting alternative gene structures from spliced ESTs: a computational approach
Journal of Computational Biology, 16

Publisher: Springer Journals
Copyright: Copyright © 2012 by Pirola et al.; licensee BioMed Central Ltd.
Subject: Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN: 1471-2105
DOI: 10.1186/1471-2105-13-S5-S2
pmid: 22537006
Publisher site: See Article on Publisher Site

Abstract

Background: A challenging issue in designing computational methods for predicting the gene structure into exons and introns from a cluster of transcript (EST, mRNA) sequences, is guaranteeing accuracy as well as efficiency in time and space, when large clusters of more than 20,000 ESTs and genes longer than 1 Mb are processed. Traditionally, the problem has been faced by combining different tools, not specifically designed for this task. Results: We propose a fast method based on ad hoc procedures for solving the problem. Our method combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are largely confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings, that are sequences obtained from paths of a graph structure, called embedding graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the length of P and T and in the size of the output. The method was implemented into the PIntron package. PIntron requires as input a genomic sequence or region and a set of EST and/or mRNA sequences. Besides the prediction of the full-length transcript isoforms potentially expressed by the gene, the PIntron package includes a module for the CDS annotation of the predicted transcripts. Conclusions: PIntron, the software tool implementing our methodology, is available at http://www.algolab.eu/ PIntron under GNU AGPL. PIntron has been shown to outperform state-of-the-art methods, and to quickly process some critical genes. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when benchmarked with ENCODE annotations. Background complex regulatory system mediates the splicing process A key step in the post-transcriptional modification pro- which, under different conditions, may produce alterna- cess is called splicing and consists of the excision of the tive mature mRNAs (also called transcript isoforms) intronic regions of the premature mRNA (pre-mRNA) starting from a single pre-mRNA molecule. Alternative while the exonic regions are then reconnected to form a Splicing (AS), i.e. the production of alternative tran- single continuous molecule, the mature mRNA. A scripts from the same gene, is the main mechanism responsible for the expansion of the transcriptome (the set of transcripts generated by the genome of one * Correspondence: [email protected] organism) in eukaryotes and it is also involved in the † Contributed equally Dipartimento di Informatica Sistemistica e Comunicazione, Univ. degli Studi onset of several diseases [1]. di Milano-Bicocca, Milano, 20126, Italy Full list of author information is available at the end of the article © 2012 Pirola et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 2 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 A great extent of work has been performed to solve support a common putative gene structure. In [18] a two basic problems on AS: characterizing the exon- detailed discussion of this issue is provided. intron structure of a gene and finding the set of differ- ent transcript isoforms that are produced from the same Methods gene. Some computational approaches, based on tran- In this paper we show how to efficiently solve the inte- script data, for these crucial problems have been pro- gration of the two steps of finding the (possibly differ- posed; indeed good implementations are available [2-9]. ent) spliced alignments of a cluster of transcripts and Recently, some tools related to the problem, but limited using them to compute a common gene structure. to the specific task of predicting splice junctions from Overall, our new combinatorial method for exon- Next-Generation Sequencing (NGS) data, have been intron structure prediction can be summarized as a designed [10-13]. These tools are computationally inten- four-stage pipeline where we: sive and would require a post-processing step to filter the correct data that can be related to the alternative 1. Compute and implicitly represent all the spliced exon-intron structure of a gene. Moreover, the literature alignments of a transcript sequence (EST or mRNA) provides efficient solutions for computing a specific against a genomic reference sequence by a novel spliced alignment of an EST against the genome (for graph representation, called embedding graph,ofthe example Exonerate [14], GMAP [15] and Spaln [16]). common substrings of the transcripts and the gen- However these tools are designed to compute only ome. In this paper we provide efficient algorithms spliced alignments and not to directly provide the com- for building and, subsequently, visiting the embed- plete exon-intron structure of a gene and its full-length ding graph. isoforms. 2. Filter all biologically meaningful spliced align- In this paper we provide a specifically designed algo- ments. This step is performed with a carefully tai- rithm - efficient from both a theoretical and an empiri- lored visit of the embedding graph. cal point of view - to predict the exon-intron structure 3. Reconcile the spliced alignments of a set of corre- of a gene from general transcript data that is optimal lated transcript sequences into a maximum parsi- with respect to constraints derived by the input data. mony consensus gene structure. To complete this The algorithm is implemented in a tool, called PIntron. task we use the Minimum Factorization Agreement Similarly as recent programs [5,7], PIntron is a method (MFA) approach [17] applied to the data produced for exon-intron structure prediction, but differently by the previous step. Indeed, the MFA approach from these tools is able to efficiently process complex gives an effective method to amalgamate some genes or genes associated with a large cluster of ESTs. spliced alignments into a consensus gene structure Indeed, the accurate prediction of the exon-intron struc- (notice that an EST sequence only provides informa- ture of ageneisacomputationalhardtaskwhenthe tion on a partial region of the whole gene). redundancy of the information given by EST data must 4. Extract, classify, and refine the resulting introns in be taken into account. More precisely, combinatorial order to provide a putative gene structure supported methods for the problem are highly accurate when they by transcript evidences. are able to combine two different steps: (1) producing putative spliced alignments of ESTs against the gene We point out that our implementation also has a fifth region and (2) selecting among the different putative step where it predicts a set of full-length isoforms by spliced alignments of each EST those confirming the employing the graph-based method in [19]. same gene structure under some optimization criteria. Our method computes a consensus gene structure This second step has been proved to be NP-hard [17] minimizing the number of exons, called maximum par- thus requires efficient heuristics. simony consensus gene structure. Such a structure is On the other hand, finding putative spliced alignments strictly associated to a set of spliced alignments for each (first phase) could be a challenging task when more than sequence in the cluster of transcript data that is also one alignment exists for the same transcript. Indeed, for output by our algorithm. Informally, a gene structure instance, there could be different possible splicing junc- (depicted in Figure 1) is the description of the location tions between consecutive exons because of the pre- of coding (exon) and noncoding (intron) regions along sence sequencing errors or repeated genomic regions. the genomic sequence. Due to alternative splicing As a consequence, choosing the correct spliced align- events, such as exon skipping, intron retention and ment of a single EST sequence requires to perform a competing exons, a portion of the genomic sequence multiple comparison between several spliced alignments could be both coding and noncoding with respect to dif- of all the EST sequences in order to find the ones that ferent transcripts. Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 3 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Figure 1 The colored directed graph representing a gene structure. The represented gene structure, induced by compositions, is composed by 6 genomic exons: A, B, C, C’, D, E. Dashed edges represent noncoding regions, bold edges represent regions included into all the gene isoforms, and the remaining normal edges represent regions that are both coding and noncoding (i.e. are included into some gene isoform and are retained as a part of an intron into some other isoform). For clarity, we indicated an exon with a curve above the graph, and an intron with two connected segments below the graph. Observe that C and C’ are competing exons, while exons B and D are cassette exons. In this paper, we will evaluate all steps of the pipeline. compute, given a sequence P (the EST or the mRNA) Accuracy and efficiency of PIntron have been assessed and a reference sequence T (the genomic sequence), two sets F F = {f , ..., f } by an experimental comparison with ASPic [20] and ={f ,..., f }and T of strings P 1 k 1 k T = pf i f i ··· f i f s Exogean [21]. The experimental results show that PIn- such that P = f ... f , , and for 1 2 k−1 1 k 1 2 k−1 k tron is much faster than ASPic and competitive with each i, the edit distance between f and f is small. The Exogean. PIntron scales much better than Exogean (in sequence of pairs (f , f ) is called composition of P on T, terms of execution time) when processing genes with a each factor f is called spliced sequence factor (or EST large number of transcript sequences. The predictions factor), and each f is called genomic factor (or exon). made by PIntron are more accurate than those by ASPic Allowing a small edit distance between the two factors and Exogean. Moreover, PIntron is the only tool that is is justified by the fact that EST data contain mismatches able to successfully complete all genes that have been (deletions and insertions) against the genome because of considered. Finally, our results indicate that PIntron also sequencing errors and polymorphisms. Unfortunately, improves the reconstruction of exact transcripts when this also makes computationally harder the spliced compared with the other two tools. alignment problem, especially when the transcript and In this experimental comparison, we focused on the genomic sequence are large. human genes given their excellent annotation status. In our novel alignment method, we exploit the small (f , f ) However, PIntron has been conceived to facilitate gen- edit distance between each pair of corresponding ome annotation in a variety of organisms in which factors: in fact, in this case, there must exist a sequence expressed sequences as well as the reference genome are of some sufficiently long common substrings of the EST available. Given the experimental results we summarized factor f and the genomic factor .Wecallthe above, our program enables the investigation of the sequence of the occurrences of perfectly matching sub- impact of alternative splicing on large-scale. strings an embedding of the EST sequence P in the The rest of this section is devoted to present each genomic sequence and, clearly, it reveals the basic algorithmic step of our four-stage pipeline. “building blocks” of the spliced alignment. Our align- ment algorithm is based on the construction of a com- Implicit computation of spliced alignments pact and implicit representation of all the embeddings The first stage of our gene structure prediction method by means of a graph called embedding graph.Sucha computes the set of all possible spliced alignments of a graph can be efficiently computed from the EST transcript (EST or mRNA) sequenceagainst thegeno- sequence P and the genomic sequence T in time O(|P|+ mic sequence. |T|+ |V| ), where V is its vertex set, and it can be used A spliced alignment is a particular kind of alignment in the second stage of our pipeline in order to efficiently that takes into account the effects of the excision of the enumerate all the biologically meaningful compositions. intronic regions during the RNA splicing process. The In the following we detail the notion and construction spliced sequence alignment problem requires to of the embedding graph. Let us first recall, that Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 4 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 according to the traditional notation, given a string S = then ε is contained in (in short )ifand only if ε ε ε v v s s ... s ,wedenotewith|S| its length and with S[i, j] for each i in ε there exists a pairing in such that 1 2 q j v v the substring s s ... s . i .Given theset of the embeddings of P in T, i i+1 j E A fundamental notion is that of pairing of two strings. we say that is maximal iff there does not exist ε ∈ E More formally, a pairing (p, t, l) of two sequences P and ε = ε , ε = ε , such that . ε ε T (which generalizes the notion of pair of a sequence Not all embeddings induce a biologically meaningful [22]) represents the positions p on P and t on T of a composition. For example, an embedding made of sev- common substring P[p, p+l -1]= T[t, t+l -1] of P and eral short pairings “scattered” along the genome cannot T. In other words, a pairing (p, t, l) represents a com- be considered a valid spliced alignment. In order to mon substring x of P and T,called factor induced by restrict embeddings to be useful for building a spliced the pairing, such that x is of length l starting in posi- alignment, we fix three parameters ℓ , ℓ and ℓ .Intui- E D I tions p and t on P and T respectively. The positions p tively, the parameter ℓ is the minimum length of a and t are called starting positions, while p + l and t + l pairing, ℓ limits the maximum number of consecutive are called ending positions. mismatches that can appear in a single exon, and ℓ We say that a pairing v = (p , t , l ) is contained in a represents the minimum length of an intron. Then a 1 1 1 1 v p pairing (in short v v )ifthe positions and t of representative embedding is a maximal embedding 2 1 1 1 2 can be extended to the left or to the right on both ε = v , ... , v such that l ≥ ℓ , p -p - l ≤ ℓ ,and 1 m i E i+1 i i D the sequences P and T in order to obtain . Clearly, the either (i)|t -t -(p - p )| ≤ ℓ or (ii) t - t -(p i+1 i i+1 i D i+1 i i+1 factor induced by v is a substring of the factor induced - p ) ≥ ℓ is true. It is easy to see that only representative 1 i I by .Moreover, we say that v is a prefix-pairing (suf- embeddings might induce a biologically plausible v v fix-pairing,resp.)of iff v v and shares the composition. 2 1 2 1 same starting (ending, resp.) positions on and of . Indeed, a careful choice of the three parameters ℓ , ℓ P T 2 E D This fact implies that the factor induced by 1 is a prefix and ℓ allows to recover a spliced alignment of P in T (suffix, resp.) of the factor induced by 2on P and T.A with a fixed (small) error rate from some representative pairing v is maximal if and only if there does not exist a embeddings. Therefore, we propose the problem of find- distinct pairing containing v.Inother words, v is maxi- ing all representative embeddings of P in T, formalized mal if and only if the common factor induced by can- as the REPRESENTATIVE EMBEDDING problem (RE), not be “extended” neither to the left nor to the right on where we are given a pattern P, a text T, and three para- both P and T. meters ℓ , ℓ and ℓ . The goal is to compute the set E E D I r A sequence of non-overlapping pairings (i.e. pairings of the representative embeddings of P in T. that represent non-overlapping occurrences of common In this first stage of the pipeline, we tackle the RE substrings) is called an embedding (see Figure 2). Given problem by using the embedding graph defined as two embeddings ε = v , ... , v and ε = v , ··· , v , follows. 1 n m Figure 2 An embedding and its relationships with the genome and a transcript. The x ,...,x are substrings shared by the genome and the 1 9 transcript corresponding to pairings. Each common substring (pairing) is longer than a fixed threshold ℓ . Intuitively, when the distance (measured on the genome) between two consecutive pairings is smaller than ℓ then we assume that those pairings belong to the same exon. When the same distance is larger than ℓ then those pairings belong to different exons. I Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 5 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Definition (Embedding Graph). Given a pattern P and Such a procedure visits the embedding graph examin- atext T,the embedding graph of P in T is a directed ing and extracting only pairwise-distinct representative graph G =(V, E) such that the vertex set V is the set of embeddings that are biologically meaningful (for exam- maximal pairings of P and T that are longer than ℓ . ple with respect to the length of gaps representing Two pairings v =(p , t , l ) and v =(p , t , l ) are con- errors or introns). More precisely, the visit of a vertex k 1 1 1 1 2 2 2 2 nected by an edge (v , v ) Î E if and only if: (i) p -(p from the extended source s reconstructs the set of 1 2 2 1 E + l ) ≤ ℓ , and (ii) |t - t -(p - p )| ≤ ℓ or t - t - biologically meaningful representative embeddings that 1 D 2 1 2 1 D 2 1 P = s, v , ..., v (p - p ) ≥ ℓ . are induced by the path traversed dur- 2 1 I 1 k Basically the conditions of the definition of Embedding ing the visit of the embedding graph. Graph ensure the following crucial property: Two maxi- We will now explain the main steps of the procedure. mal pairings v and v are connected by an edge in the During the visit of vertex v , we examine each outgoing 1 2 embedding graph if and only if there exists a representa- edge (v , v )and we “extend” each embedding k k+1 tive embedding ε in which there are two consecutive ε = e , ... , e of . How the extension is performed 1 k pairings v and v such that v is contained in and depends on the relative position, on P and T,of e in ε i i+1 i v is contained in . and the new vertex v that are depicted in Figure 3. In k+1 i+1 We will use this property to build representative the exposition of the different possible cases, let e = embeddings from an embedding graph. Observe that (p , t , l ) and v =(p , t , l ). Observe that given k k k k+1 k+1 k+1 k+1 such a property derives from the maximality of the two pairings that are connected by an edge in the representative embeddings and from the uniqueness of embedding graph, the corresponding factors might be the maximal pairing containing a pairing which belongs overlapping in the text or in the pattern. To simplify the to a representative embedding. notation, in the following we identify a pairing with the We designed an algorithm that builds the embedding factor it induces. graph of a pattern P and a text T in time O(|T|+ |P|+| Case (a). Factors e and v overlap on both T and P. k k+1 V| ). The algorithm is composed of two steps. In the Two different sub-cases must be analyzed. The first case first step, the vertex set V is computed by visiting the occurs when the distance between the two initial posi- suffix tree of the text T. This step requires O(|T|) time tions of the factors e and v on P differs from the same k k+1 for the suffix tree construction and O(|P|+ |V|) time distance on T of a value (positive or negative) less than for the computation of maximal pairings. In the second ℓ , while the second case occurs when such a distance step, edges are then computed by checking the condi- differs of a value greater than ℓ . If the first case occurs tions of the definition of embedding graph on each pair when |(t -t )-(p - p )| ≤ ℓ then the two pairings k+1 k k+1 k D of maximal pairings, leading to an O(|V| )procedure. may belong to the same factor of the induced composi- Since the number of maximal pairings is usually very tion. Thus, the algorithm replaces pairing e in ε with the small compared to the length of P and T,the embed- shortest maximal prefix-pairing e of e and the longest ding graph construction procedure is efficient even on maximal suffix-pairing e of v such that they do not k+1 k+1 large patterns P and texts T. overlap and that both e and e are at least ℓ long. k+1 E Thesecond caseoccurswhen(t - t )- (p - p ) ≥ ℓ . k+1 k k+1 k I Extraction of relevant spliced alignments This case deserves a special discussion from the biologi- The next stage of our pipeline is devoted to analyzing cal point of view since it could be related to an intron as and mining the embedding graph to compute the repre- well as to a tandem repeat in T.Thenfactor e could be sentative embeddings that also induce distinct biologi- extended to include the repetition in v to produce a k+1 cally meaningful compositions. Indeed, it must be unique factor (exon) of the embedding ε. pointed out that different representative embeddings Case (b). Factors induced by e and v overlap in T k k+1 can induce the same compositions or spliced align- but not in P. This case is equivalent to the first sub-case ments. Algorithm ComputeCompositions is a two-step of Case (a). procedure. Initially it extracts a subset of representative Case (c). Factors e and v overlap in P but not in T. k k+1 embeddings by performing a visit of the embedding Just as in Case (a) two different sub-cases must be ana- graph. Then the algorithm computes the compositions lyzed, that is either |(t - t )-(p -p )| ≤ ℓ or t k+1 k k+1 k D k+1 by merging consecutive pairings that are separated by -t -(p - p ) ≥ ℓ . The first case is solved as in Case k k+1 k I short gaps. (a). Notice that when the second subcase occurs then Embedding graph visit the splice site placement is ambiguous because a suffix The first step of ComputeCompositions is a recursive of the donor exon is equal to a prefix of the acceptor visit of the embedding graph starting from a subset of exon. Also in this case, basic biological criteria are used vertices that we call extended sources. to reduce the impact of the ambiguity. Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 6 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Figure 3 Possible relative positions of two maximal pairings connected by an embedding graph edge. The figure presents the possible configurations of relative positions of two maximal pairings e =(p , t , l ) and v =(p , t , l ) connected by an embedding graph edge k k k k k+1 k+1 k+1 k+1 (e , v ). Each box represents a common maximal factor on T (top) and P (bottom) of a maximal pairing. Each maximal pairing is represented by k k+1 two boxes connected by lines (boxes representing e are in bold). For each case, t corresponds to the left border of the upper bold box, p is k k k the left border of the lower bold box, t is the left border of the upper normal box, and p is the left border of the lower normal box. k+1 k+1 Distance |(t - t )- (p - p )| has been represented by a double ended arrow, while factor overlaps are highlighted by grey shades. Four k+1 k k+1 k possible cases are presented: (a) e , v overlap on both T and P, (b) e , v overlap on T but not on P , (c) e , v overlap on P but not on T, k k+1 k k+1 k k+1 and (d) e , v do not overlap neither on T nor on P. k k+1 Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 7 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Case (d). Factors e and v do not overlap neither in composition is retained if the edit distance between k k+1 P nor in T.Let G and G be the two substrings which each EST factor and the corresponding genomic factor T P separate e and v in T and P, respectively. Since G is not larger than a fixed acceptable threshold. k k+1 P and G do not form a pairing, they must contain a cer- tain number of mismatches; we must determine if they Building a gene structure support the possibilities that (i) e and v are part of The first two stages of our pipeline are applied sepa- k k+1 the same factor or (ii) there is an intron between e and rately to each transcript sequence P of the input data (a k i v . Similarly to Case (a), two different sub-cases may genomic sequence T and a set S of transcripts) com- k+1 arise. If |(t - t )-(p - p )| ≤ ℓ ,then e and v puting a set C(P ) of biologically meaningful composi- k+1 k k+1 k D k k+1 i might belong to the same factor of the induced compo- tions for each P . The main goal of the third stage is to sition. More precisely, e and v belong to the same extract a composition for each transcript that explains k k+1 factor if the edit distance between G and G is below a the putative gene structure. As stated before, informally T P certain threshold - in which case v is added to a gene structure is the description of the location of k+1 embedding ε, otherwise the edge is discarded from the coding and noncoding regions along the genomic visit. Instead, if t -t - p + p ≥ ℓ , the two pairings sequence, where by a coding region we mean an exon k+1 k k+1 k I are separated by an intron, and we must determine the and by noncoding region we mean an intron. Note that splice sites of such an intron. In this case, the algorithm the boundaries between an exon and an intron is called G G computes a prefix and a suffix of G that mini- splice junction or splice site. T T T mize the edit distance between G and the concatena- We aim to produce a maximum parsimony consensus G G tion of and . Also in this case, if the resulting gene structure for which consists of a minimum set T T edit distance is larger than an acceptable threshold, the of genomic exons or coding regions compatible with a edge (v , v ) is discarded, otherwise v is added to ε. high quality composition C for each transcript data P . k k+1 k+1 i i Notice that computing the edit distance is not too The minimization criteria is used to avoid overpredic- expensive, since all strings involved are no longer than tion of splice junctions. For this task we propose a for- 2ℓ . malization of the problem of finding a putative gene The definition of embedding graph allows the pre- structure, called CONSENSUS GENE STRUCTURE sence of directed cycles, which potentially might be problem (CG) and discuss a solution of this problem. troublesome. However, we claim that the embeddings, The input of the CG problem consists of a set C(P )of computed from a path P containing a cycle C ,would compositions for each transcript P in a set S and a induce compositions with essentially the same set of fac- finite ordered set F = ⟨f , f ,..., f ⟩ of genomic factors 1 2 |F| tors of the compositions induced by the embeddings induced by the compositions in ∪C(P ). Ordering of fac- computed from the visit of the simple path P \C .The tors is assigned by considering their left splice junctions. visit performed in the first step of algorithm. Compute- Then CG asks for the minimum cardinality subset F’ of Compositions guarantees that each possible representa- F such each P has a composition with all genomic fac- tive embedding is analyzed. However, the biological tors in F ’.Inother words F ’ is the minimum set of criteria that we employ allow to consider only pairings exons explaining a spliced alignment of each EST data. belonging to biologically meaningful embeddings. Since Now, the CG problem can be faced by using the the visit computes pairwise-distinct representative approach [17] called Minimum Factorization Agreement embeddings and every case presented above requires O (MFA). More precisely, we use the MFA problem to (1) time, the overall computational complexity of the compute a gene structure minimizing the number of O( |ε|) visit is clearly bounded by , that is the total exons. ε∈E size of the representative embeddings that have been Let us recall the definition of the MFA problem. Let F computed during the visit. = ⟨f , f ,..., f ⟩ be a finite ordered set of sequences over 1 2 |F| Composition reconstruction alphabet Σ, called factors and let S be a set of sequences The set of representative embeddings computed by over alphabet Σ.Given a sequence s Î S,a factor-com- the visit of the embedding graph directly leads to a set position (f-composition in short) of s consists of the C of compositions. In fact, the visit guarantees that two sequence f = f , f , ··· , f such that s = f , f , ··· , f i i i i i i 1 2 n 1 2 n consecutive pairings of a representative embedding are and i <i for 1 ≤ j< n.Thenthe set {f , f , ··· , f } j j+1 i i i 1 2 n either separated by a small gap due to errors or by a is called the factor set of f and is denoted as F(f). While large gap representing an intron of the spliced align- the notion of f-composition depends on the set of fac- ments. Hence, the algorithm simply merges into a factor tors, such set of factors is usually clear from the context a sequence of factors induced by consecutive pairings v and is therefore omitted. Please notice that a sequence s =(p , t , l )and v =(p , t , l ) separated by can admit different f-compositions: thus let F(s)bethe k k k k+1 k+1 k+1 k+1 small gaps, that is |t - t - p + p | ≤ ℓ . Finally, the set of compositions of s. Moreover, by extension, we k+1 k k+1 k D Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 8 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 will denote by F(S)the set ∪ F(s) of all f-compositions which the excision is mediated by the major spliceoso- sÎS of a set S of sequences. Given a subset F’ ⊆ F of factors mal pathway or the minor spliceosomal pathway, respec- and the set F(S), then F’ is a factorization agreement set tively. Notice that RefSeq transcripts are usually full- for F(S) if and only if for each sequence s Î S,there length and error-free, that GT - AG, GC - AG and AT - exists a f-composition f in F(s) whose factor set is a sub- AC are the most frequent rules [23] and those rules are set of F’, i.e. F(f) ⊆ F’. associated to U12/U2 introns [24]. Hence we assume The Minimum Factorization Agreement problem, that only introns that do not follow one of the U12/U2 given a set F of factors and a set S of sequences, asks rulesand arenot supportedbyaRefSeqtranscript for a minimum cardinality subset F’ ⊆ F such that F’ is should be reduced. The input of our intron-reduction a factorization agreement set for F(S). Then the CG pro- procedure is a set X of pairs (i, s) computed by the pre- blem can be reduced to the MFA problem by posing S vious steps. Then, R is the set of pairs in X such that s to be the cluster of transcript sequences P and F is the is a RefSeq, C , C , C and N are the set of pairs in X \ i 1 2 3 set of all genomic factors (exons) used to produce the R following the GT - AG, GC - AG, AT - AC and a compositions C(P ) for each P ,i.e. F(S)consistsofall non-U12/U2 rule respectively. Our procedure basically i i the compositions of each sequence in S. Then the con- tries to reduce elements in N to some intron in R and, sensus gene structure consists of a minimum factoriza- if this is not possible, it tries to reduce to some element tion agreement set for the set of compositions of the in the first set of the sequence C , C , C that allows the 1 2 3 transcripts data. When solving the MFA problem on reduction. such data, the solution F’ provides a minimum set of factors explaining all transcript sequences and a single Results composition of each transcript can be obtained from set We implemented the approach described in the previous F’. section as a set of programs in the software package By applying the algorithm in [17] we can filter effi- PIntron. PIntron receives a genomic sequence and a set ciently a set of spliced alignments agreeing to the same of transcripts - ESTs and/or mRNAs - and computes a gene structure that are successively refined by the intron representation of the exon-intron structure of the gene reduction step. as well as a set of predicted full-length annotated iso- forms. PIntron outputs the list of the predicted introns Intron reduction with information such as relative and absolute start and Although the intron boundaries of the EST spliced com- end positions, intron lengths, the donor and the accep- positions are computed by finding the best transcript- tor splice sites, and intron types (U12, U2 or unclassi- genome alignment over the splice site regions and the fied). The output gives the composition as exons of each most frequent intron pattern (i.e. the first and the last isoform and, for each exon, the start and end positions two nucleotides of an intron) according to [23], the set as relative and absolute coordinates, if a polyA signal is of predicted introns may still contain false positives very present, and the length of 5’UTR and 3’UTR. Moreover close to true predictions. Thus, we designed a procedure several additional information are given for each pre- for comparing the intron set computed by the EST dicted isoform, such as its length, the CDS starting and spliced compositions in order to correct and reduce the ending positions, the RefSeqID (if it exists) and the set of false positives. length of the associated protein. In the following, let the pair (i, s) denotes a genomic PIntron source code and binaries are available under intron (eventually specified by a pair of genomic coordi- the GNU AGPLv3 license at http://www.algolab.eu/ nates) and a spliced composition of an EST s supporting PIntron. the intron i, i.e. the composition has two consecutive In the following, we discuss an experimental in-silico factors f , f inducing intron i when aligned to the gen- analysis on real human data aiming to evaluate our j j+1 ome. Then, given an error bound b, we say that (i, s)is approach. Such an experimental evaluation is organized b-reducible to (i’, s) iff there exists a boundary shift of in two parts. The first part has been designed to assess factors f and f of a new spliced composition of s the prediction accuracy of PIntron, while the aim of the j j+1 inducing intron i’ with at most b additional errors with second part is to show the scalability of our method and respect to the previous alignment of the two factors its effectiveness on genes that are very large or complex against the genome. To improve the accuracy of the and are currently outside the comfort zone of the most step, we also consider if the intron is supported by a used methods. RefSeq transcript and if it can be categorized as an U12/ We have assessed the accuracy achieved by PIntron by U2 intron. A RefSeq sequence is a validated full-length comparing it with ASPic [20] and Exogean [21]. In par- mRNA stored and annotated in the NCBI RefSeq data- ticular, ASPic is a well-established software to predict alternative isoforms by multiple EST/mRNA alignments base. U2 and U12 refers to two intron categories for Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 9 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 against the corresponding genomic regions. For each experimental evaluation along with some of their main input EST, the ASPic algorithm attempts to compute a characteristics. ESTs and mRNAs related to each gene single spliced alignment with the minimum number of were obtained from UniGene database. exons. Instead, PIntron implicitly provides several candi- The results of our first assessment are summarized in date spliced alignments for each EST, among which the Table 2, while the details are presented in Supplemen- best one is selected by using the MFA agreement tary Tables S.2, S.3, S.4 in Additional file 1. The three approach, thus allowing a greater accuracy in predicting tools have been evaluated according to two dimensions: prediction quality and time efficiency. The first impor- the putative gene structure. Moreover, PIntron is much tant observation is that only PIntron was able to predict faster than ASPic because of the more efficient data structure used for performing the EST alignments (i.e. the gene structures for all 112 ENCODE loci, while the embedding graph instead of the hash table of the ASPic and Exogean completed 93 and 104 genes, genomic seeds employed by ASPic). For this reason, respectively. Moreover, PIntron has been the fastest of ASPic requires a genomic sequence trimmed at the bor- the three in the experiment over the whole set of genes, ders of a single gene locus, while PIntron is able to effi- producing its results in about 49 minutes (on average 26 ciently process a large region of the genome (i.e. seconds per gene). On the genes that have been success- spanning tens of gene loci) and a large set of expressed fully processed, instead, Exogean took 57 minutes and sequences. ASPic more than 46 hours. Such results clearly indicate Exogean is a gene prediction tool based on pre-aligned a computational improvement of PIntron over Exogean (by Blat [25]) ESTs/mRNAs or proteins. Exogean and especially ASPic in processing genes that are critical resulted one of the most accurate gene finding system in terms of number of ESTs. Indeed Table 3 shows that in the last EGASP competition [26]. In Exogean, gene PIntron scales much better than Exogean and ASPic structures are reconstructed according to a graph-based when the number of transcripts is over 10,000, thus strategy mimicking the human annotation process. making our new software implementation particularly The accuracy assessment has been performed on 13 amenable to analyze large EST clusters. Notice that the ENCODE human regions [26] used as training set in the running time of Exogean includes the preprocessing EGASP competition. The regions have been chosen time required by Blat to align the transcripts. However, since they present different gene density and different the preprocessing time is almost negligible compared to conservation to the mouse genome. This dataset con- thetimerequiredbyExogean.In fact, Blat required tains 112 well-annotated gene loci, supported by 98, 064 approximately 4 minutes (7% of the total running time) to process all the genes. UniGene transcripts for a overall length of approxi- mately 62 Mb (Table 1). The 13 ENCODE regions Prediction quality has been evaluated by calculating represent, approximately, 8.5 Mb of the human genomic sensitivity (Sn) and specificity (Sp) between ENCODE sequence. Supplementary Table S.1 in Additional file 1 annotations and predictions at nucleotide, exon, intron, reports the complete list of the genes used in this and transcript level, according to Burset and Guigó [27]. We adhered to the nomenclature established in the lit- erature aimed to the evaluation of gene structure predic- Table 1 Main characteristics of the dataset used for the tion tools, even if the definition of specificity that we accuracy assessment of PIntron use here is called positive predictive value in statistical Region Genomic Number Number of Overall transcript length (nt) of genes transcripts length (nt) Table 2 Summary of the experimental results on the 112 ENm004 1,700,000 18 6,964 4,497,709 gene loci on the 13 ENCODE regions ENm006 1,338,447 35 18,230 11,377,148 PIntron Exogean ASPic ENr111 500,000 2 171 113,356 Exon level Sn 0.529 0.444 0.390 ENr114 500,000 1 35 120,734 Sp 0.622 0.606 0.427 ENr132 500,000 4 855 551,266 ENr222 500,000 2 461 277,554 Intron level Sn 0.874 0.733 0.633 ENr223 500,000 5 50,607 32,732,634 Sp 0.789 0.777 0.567 ENr231 500,000 11 5,637 3,534,406 Transcript level Sn 0.564 0.251 0.342 ENr232 500,000 9 4,779 2,505,934 Sp 0.418 0.450 0.252 ENr323 500,000 5 1,670 997,647 Nucleotide level Sn 0.889 0.657 0.635 ENr324 500,000 1 487 343,220 Sp 0.916 0.865 0.632 ENr333 500,000 12 7,179 4,381,534 Annotated genes 112 104 93 ENr334 500,000 7 989 611,795 Total running time (seconds) 2,961 3,446 168,607 Total 8,538,447 112 98,064 62,044,937 The best value of each row is highlighted in boldface. Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 10 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Table 3 Running times of PIntron and Exogean on the 26 gene structure (several tens of exons), or (2) a particu- “critical” genes larly large cluster of expressed sequences, or (3) a large Gene Genomic Number of Running time (seconds) genomic sequence. length (nt) transcripts To this aim, we selected 26 “critical” genes and we PIntron Exogean processed them with PIntron and Exogean on a 4-node ACTB 36,634 26,248 287.35 371.22 linux cluster running CentOS 5.5. Each node is ALB 24,299 16,920 144.17 369.38 equipped with a quad-core 2.40 GHz CPU and 32 GiB ANKS1B 1,258,645 406 15.60 0.92 of RAM. The genomic sequence has an average length ANXA1 512,535 2,087 20.65 7.63 of about 848 Kb, and is longer than 1 Mb for 11 of the ATP1A1 619,226 3,241 27.82 11.90 26 genes. Moreover, the selected genes have on average ATP5A1 405,213 9,864 143.33 70.93 more than 5,000 transcripts, and 5 genes have more CDH13 1,169,823 507 10.34 1.02 than 15,000 transcripts. The total running time was 65 CNTNAP2 2,304,964 227 30.86 1.01 minutes for PIntron and 48 minutes for Exogean. In this CTNNA2 1,463,710 261 12.71 0.96 evaluation, we did not take into account ASPic since it CUGBP2 1,081,163 864 18.04 2.42 was not able to give a solution for any of these genes DAB1 1,551,956 164 14.51 0.85 within an acceptable time. Table 3 reports the complete DLG2 2,172,263 279 21.18 1.15 list of genes considered in this experimental part along DMD 2,241,933 329 35.35 2.21 with their main characteristics and the running times of ENO1 185,661 13,131 119.84 125.51 PIntron and Exogean. While Exogean and PIntron run- FGG 579,042 2,033 15.40 3.56 ning times were both acceptable, PIntron averaged 149 FHIT ( ) 1,502,110 134 202.35 n.a. sec/gene and Exogean 109 sec/gene. This is remarkable, GAPDH 46,975 15,518 149.64 232.81 since Exogean is based on the fast progressive EST-to- HINT1 873,331 844 12.02 3.08 genome mapping program Blat and does not take into HSP90AA1 384,611 6,710 47.37 13.87 account potential alignment errors at splicing sites HSPA8 90,642 15,850 118.47 152.84 which, in turn, is likely to result in predictions that are KCNIP4 1,220,613 107 10.09 0.65 not as accurate as those given by PIntron. The compari- MBP 154,857 21,071 251.70 1,344.42 son of running times confirms our previous observation: NCAM1 317,404 1,293 12.54 1.63 PIntron, although slower than Exogean on genes with RPL3 187,677 12,208 90.15 108.12 small transcript clusters, scales significantly better than TBC1D22A 1,378,585 467 115.99 2.27 Exogean when the cluster size increases. In fact, PIntron TTN 304,814 1,349 1,952.58 6.77 was systematically faster than Exogean on the subset of genes whose transcript cluster is composed by more Total 22,068,686 152,112 3,880.05 2,837.94 † than 10, 000 sequences (genes ACTB, ALB, ENO1, Exogean did not successfully compute a gene structure for FHIT. GAPDH, HSPA8, MBP,and RPL3), while it was slower than Exogean on the other genes. In almost all the cases where PIntron was slower than Exogean, the difference literature [28]. As shown in Table 2 and Figure 4, PIn- between the running times of the two tools is small. tron appears the most accurate program at diverse pre- Thus the running time of PIntron can be considered diction levels. Moreover, PIntron exhibits sensitivity and acceptable also on these genes. One notable exception is specificity levels that are quite similar. This fact, which gene TTN where PIntron took about 32 minutes to pre- is highly desirable in any prediction tool, shows that dict the gene structure, while Exogean required only a PIntron does not advantage any of them to the detri- few seconds. The likely reason is that the input tran- ment of the other one. In addition, our results (see the average sensitivity at transcript level in Table 2) indicate script set of TTN contains sequences that are more than that PIntron improves the reconstruction of exact tran- 80 Kb long. Since EST sequences have a lower quality scripts when compared with ASPic and Exogean. More- than mRNA sequences, computing their spliced align- over, we want to recall that PIntron has completed the ment requires a considerable amount of computational analysis of all 112 input genes, while Exogean and ASPic resources. did not complete the task for 8 and 29 genes We want to point out that our second experiment has respectively. limited scope. In fact a complete comparison of PIntron Our second experimental analysis is devoted to evalu- and Exogean would also include the accuracy dimen- ating the efficiency and the scalability of our approach sions. The results of the first experiment suggests that PIntron is more accurate than Exogean. If confirmed, on a subset of critical human genes that are particularly the greater accuracy would justify the small increase in hard to analyze with the currently available programs the running times that we have observed. because those genes have (1) a particularly complex Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 11 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 Figure 4 Accuracy achieved by PIntron, Exogean and ASPic at various levels. The boxplot presents the distribution of specificity and sensitivity achieved by the three tools at the exon, intron, transcript and nucleotide levels. The vertical edges of the boxes represent the first quartile, the median and the third quartiles (from left to right). The cross is the average. The vertical dashed lines represent an estimate of the 95% confidence interval of the median. The circles are all the outliers with respect to such confidence interval. The analysis of the running times of the first and the spliced alignments of a transcript against a genome, and second part of the experimentation has not shown any an efficient algorithm that exploits the inherent redun- significant correlation between the length of the genes dancy of information in a cluster of transcripts to select, and the running times, hence confirming our conjecture among all possible factorizations of EST sequences, that the behavior of our algorithm depends on some those allowing to infer splice site junctions that are lar- properties of the Embedding Graph, and not on the size gely confirmed by the input data. PIntron is freely avail- of the instance. In particular, the structure of the able at http://www.algolab.eu/PIntron under GNU Embedding Graph is strictly related to the quality of the Affero General Public Licence (AGPL). The experimen- transcripts and to the presence of repetitions and highly tal evaluation of PIntron has shown that it has been duplicated regions in the genomic sequence that, in able to compute accurate predictions (whose level is turn, could influence the size of the graph. Also these comparable with that of other prediction tools) while results have confirmed our beliefs, since the average achieving a good scalability to critical genes, especially if running time of the second experiment (149 sec/gene) is associated with a large transcript cluster. not too far from the running times on the smaller genes of the first experiment, where the average value is 26 Additional material sec/gene. A fundamental observation is that PIntron has successfully completed the analysis of all 26 “critical” Additional file 1: Supplementary tables. Characteristics of the first dataset and detailed results obtained in the experimental comparison. genes, while Exogean did not complete the analysis for FHIT. Conclusions Acknowledgements We thank Marcello Varisco for the implementation of some parts of the In this work, we presented a new computational pipeline pipeline. This research was supported in part by FAR MIUR 60% grant - PIntron - for predicting the gene structure into exons “Algorithmic methods and combinatorial structures in Bioinformatics” (Univ. and introns from a cluster of transcript (EST, mRNA) di Milano-Bicocca) to YP, RR, GDV, and PB, grant “Dote ricerca applicata” 21_ARA (FSE, Regione Lombardia) to YP, and Ministero dell’Istruzione, sequences. PIntron combines two ideas: a novel algo- dell’Università e della Ricerca, Italy: Fondo Italiano Ricerca di Base, rithm of proved small time complexity for computing Pirola et al. BMC Bioinformatics 2012, 13(Suppl 5):S2 Page 12 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S2 “Laboratorio Internazionale di Bioinformatica” (LIBI), “Laboratorio di 15. Wu TD, Watanabe CK: GMAP: a genomic mapping and alignment Bioinformatica per la Biodiversità Molecolare” (DM19410), PRIN 2009; program for mRNA and EST sequence. Bioinformatics 2005, Progetto Strategico Regione Puglia PS 012; Progetto EPIGEN (CNR) to GP. 21(9):1859-1875. This article has been published as part of BMC Bioinformatics Volume 13 16. Gotoh O: A space-efficient and accurate method for mapping and Supplement 5, 2012: Selected articles from the First IEEE International aligning cDNA sequences onto genomic sequence. Nucleic Acids Research Conference on Computational Advances in Bio and medical Sciences 2008, 36(8):2630-2638. (ICCABS 2011): Bioinformatics. The full contents of the supplement are 17. Bonizzoni P, Della Vedova G, Dondi R, Pirola Y, Rizzi R: Minimum available online at http://www.biomedcentral.com/bmcbioinformatics/ factorization agreement of spliced ESTs. In Proc 9th International supplements/13/S5. Workshop on Algorithms in Bioinformatics (WABI), Volume 5724 of LNCS. Springer;Salzberg SL, Warnow T 2009:1-12[http://dx.doi.org/10.1007/978-3- Author details 642-04241-6_1]. Dipartimento di Informatica Sistemistica e Comunicazione, Univ. degli Studi 18. Bonizzoni P, Rizzi R, Pesole G: Computational methods for alternative di Milano-Bicocca, Milano, 20126, Italy. Centro Ricerche e Studi splicing prediction. Briefings in Functional Genomics and Proteomics 2006, Agroalimentari, Parco Tecnologico Padano, Lodi, 26900, Italy. Dipartimento 5(1):46-51. di Biochimica e Biologia Molecolare “E. Quagliariello”, Univ. degli Studi di 19. Bonizzoni P, Mauri G, Pesole G, Picardi E, Pirola Y, Rizzi R: Detecting Bari, Bari, 70126, Italy. Istituto di Biomembrane e Bioenergetica, Consiglio alternative gene structures from spliced ESTs: a computational Nazionale delle Ricerche, Bari, 70126, Italy. Dipartimento di Statistica, Univ. approach. Journal of Computational Biology 2009, 16(1):43-66. degli Studi di Milano-Bicocca, Milano, 20126, Italy. 20. Bonizzoni P, Rizzi R, Pesole G: ASPIC: a novel method to predict the exon- intron structure of a gene that is optimally compatible to a set of Authors’ contributions transcript sequences. BMC Bioinformatics 2005, 6:244. YP and RR designed the algorithm, developed the pipeline, designed and 21. Djebali S, Delaplace F, Crollius HR: Exogean: a framework for annotating helped to perform the experiments, and drafted the manuscript. EP helped protein-coding genes in eukaryotic genomic DNA. Genome Biology 2006, to design and to perform the experiments, and interpreted the results. GP 7(Suppl 1):S7. helped to design the experiments and supervised the interpretation of the 22. Gusfield D: Algorithms on Strings, Trees and Sequences: Computer Science and results. GDV helped to design the algorithm, to develop the pipeline, and to Computational Biology Cambridge: Cambridge University Press; 1997. draft the manuscript. PB designed the algorithm, helped to draft the 23. Burset M, Seledtsov I, Solovyev V: Analysis of canonical and non-canonical manuscript, and supervised the research. All authors read and approved the splice sites in mammalian genomes. Nucleic Acids Research 2000, final manuscript. 28(21):4364-4375. 24. Sheth N, Roca X, Hastings ML, Roeder T, Krainer AR, Sachidanandam R: Competing interests Comprehensive splice-site analysis using comparative genomics. Nucleic The authors declare that they have no competing interests. Acids Research 2006, 34(14):3955-3967. 25. Kent JJ: BLAT-the BLAST-like alignment tool. Genome Research 2002, Published: 12 April 2012 12(4):656-664. 26. Guigó R, Flicek P, Abril J, Reymond A, Lagarde J, Denoeud F, Antonarakis S, Ashburner M, Bajic VB, Birney E, Castelo R, Eyras E, Ucla C, Gingeras TR, References Harrow J, Hubbard T, Lewis SE, Reese MG: EGASP: the human ENCODE 1. Caceres J, Kornblihtt A: Alternative splicing: multiple control mechanisms Genome Annotation Assessment Project. Genome Biology 2006, 7(Suppl and involvement in human disease. Trends Genet 2002, 18(4):186-193. 1):S2. 2. Heber S, Alekseyev M, Sze SH, Tang H, Pevzner PA: Splicing graphs and 27. Burset M, Guigo R: Evaluation of Gene Structure Prediction Programs. EST assembly problem. Bioinformatics 2002, 18(Suppl 1):S181-S188. Genomics 1996, 34:353-357. 3. Leipzig J, Pevzner P, Heber S: The Alternative Splicing Gallery (ASG): 28. Altman DG, Bland JM: Statistics Notes: Diagnostic tests 1: sensitivity and bridging the gap between genome and transcriptome. Nucleic Acids specificity. BMJ 1994, 308(6943):1552. Research 2004, 32(13):3977-3983. 4. Xing Y, Resch A, Lee C: The multiassembly problem: reconstructing doi:10.1186/1471-2105-13-S5-S2 multiple transcript isoforms from EST fragment mixtures. Genome Cite this article as: Pirola et al.: PIntron: a fast method for detecting the Research 2004, 14(3):426-441. gene structure due to alternative splicing via maximal pairings of a 5. Kim N, Shin S, Lee S: ECgene: genome-based EST clustering and gene pattern and a text. BMC Bioinformatics 2012 13(Suppl 5):S2. modeling for alternative splicing. Genome Research 2005, 15(4):566-576. 6. Eyras E, Caccamo M, Curwen V, Clamp M: ESTGenes: alternative splicing from ESTs in Ensembl. Genome Research 2004, 14(5):976-987. 7. Castrignanò T, Rizzi R, Talamo IG, D’Onorio De Meo P, Anselmo A, Bonizzoni P, Pesole G: ASPIC: a web resource for alternative splicing prediction and transcript isoforms characterization. Nucleic Acids Research 2006, 34(Suppl 2):W440-W443. 8. Kan Z, Rouchka EC, Gish WR, States DJ: Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Research 2001, 11(5):889-900. 9. Gupta S, Zink D, Korn B, Vingron M, Haas S: Genome wide identification and classification of alternative splicing based on EST data. Bioinformatics Submit your next manuscript to BioMed Central 2004, 20(16):2579-2585. 10. De Bona F, Ossowski S, Schneeberger K, Rätsch G: Optimal spliced and take full advantage of: alignments of short sequence reads. Bioinformatics 2008, 24:i174-i180. 11. Trapnell C, Pachter L, Salzberg SL: TopHat: discovering splice junctions • Convenient online submission with RNA-Seq. Bioinformatics 2009, 25(9):1105-1111. • Thorough peer review 12. Bryant DW, Shen R, Priest HD, Wong WK, Mockler TC: Supersplat–spliced RNA-seq alignment. Bioinformatics 2010, 26(12):1500-1505. • No space constraints or color ﬁgure charges 13. Wang K, Singh D, Zeng Z, Coleman SJ, Huang Y, Savich GL, He X, • Immediate publication on acceptance Mieczkowski P, Grimm SA, Perou CM, MacLeod JN, Chiang DY, Prins JF, • Inclusion in PubMed, CAS, Scopus and Google Scholar Liu J: MapSplice: accurate mapping of RNA-seq reads for splice junction discovery. Nucleic Acids Research 2010, 38(18):e178. • Research which is freely available for redistribution 14. Slater G, Birney E: Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 2005, 6:31. Submit your manuscript at www.biomedcentral.com/submit

Journal

BMC Bioinformatics – Springer Journals

Published: Apr 12, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text

PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text

PIntron: a fast method for detecting the gene structure due to alternative splicing via maximal pairings of a pattern and a text

References (29)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies