Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Genome-wide detection of alternative splicing in expressed sequences of human genes

Genome-wide detection of alternative splicing in expressed sequences of human genes 2850–2859 Nucleic Acids Research, 2001, Vol. 29, No. 13 © 2001 Oxford University Press Genome-wide detection of alternative splicing in expressed sequences of human genes Barmak Modrek, Alissa Resch, Catherine Grasso and Christopher Lee* Department of Chemistry and Biochemistry, University of California, 611 Charles E. Young Drive East, Los Angeles, CA 90095-1570, USA Received January 26, 2001; Revised and Accepted May 5, 2001 ABSTRACT many genes. Its functional implications can be simple, gener- ating a single alternative form, or can produce remarkable We have identified 6201 alternative splice relation- diversity. In the Drosophila gene Dscam, combinatorial alter- ships in human genes, through a genome-wide native splicing of ‘cassettes’ of exons reminiscent of the analysis of expressed sequence tags (ESTs). combinatorial generation of immunoglobulin diversity, ∼2.1 million human mRNA and EST Starting with produces thousands of distinct functional isoforms (4). This sequences, we mapped expressed sequences onto gene, homologous to the human gene for Down’s syndrome the draft human genome sequence and only cell adhesion molecule (DSCAM), appears to be involved in accepted splices that obeyed the standard splice site neuronal guidance, where such diversity could be useful as a molecular ‘address’. consensus. A large fraction (47%) of these were Alternative splicing has been studied intensively in hundreds observed multiple times, indicating that they of human genes (1,5), and it appears to be widespread, occurring comprise a substantial fraction of the mRNA species. in 5–30% of human genes (6,7) or perhaps as many as 35–40% The vast majority of the detected alternative forms (8,9). Recently, it has been reported that alternative splicing appear to be novel, and produce highly specific, can be detected in expressed sequence tag (EST) sequencing biologically meaningful control of function in both (9) and has been analyzed in a collection of full-length mRNAs known and novel human genes, e.g. specific removal (8). Based upon estimates of the total number of human genes of the lysosomal targeting signal from HLA-DM (10,11), it is likely that at least 10 000–20 000 human genes are chain, replacement of the C-terminal transmembrane alternatively spliced. However, currently only 899 alterna- domain and cytoplasmic tail in an FC receptor β tively spliced human genes are catalogued in the Alternative chain homolog with a different transmembrane Splicing Database of Mammals (AsMamDB) (12). domain and cytoplasmic tail, likely modulating its We have performed a genome-wide analysis of alternative signal transduction activity. Our data indicate that a splicing based on human expressed sequence data, which greatly expands our knowledge of this central function in human molec- large proportion of human genes, probably 42% or ular biology (Table 1). We have identified tens of thousands of more, are alternatively spliced, but that this appears splices, and thousands of alternative splices, in several thousand to be observed mainly in certain types of molecules human genes. We have mapped all of these onto the draft human (e.g. cell surface receptors) and systemic functions, genome sequence, and verified that the putative splice junctions particularly the immune system and nervous system. detected in the expressed sequences map onto genomic exon–intron These results provide a comprehensive dataset for junctions that match the known splice site consensus. Based on understanding the role of alternative splicing in the this genome-wide analysis of gene structure and alternative human genome, accessible at http://www.bioinfor- splicing, we have constructed a Human Alternative Splicing matics.ucla.edu/HASDB. Database, at http://www.bioinformatics.ucla.edu/HASDB. In this paper we also show how our database can be used to study the impact of alternative splicing on protein function. We present INTRODUCTION an initial analysis of the patterns and functional role of alternative splicing across the human genome. Alternative splicing is an important mechanism for modulating gene function. It can change how a gene acts in different As we seek to show with examples in this paper, our data- tissues and developmental states by generating distinct mRNA base could be a useful resource to researchers who have found isoforms composed of different selections of exons. Alterna- a new cDNA or human gene and wish to find additional infor- tive splicing has been implicated in many processes, including mation. It can help answer a wide range of questions, e.g. ‘Are sex determination (1), apoptosis (2) and acoustic tuning in the the two bands on a western blot due to alternative splicing?’ or ear (3). Recently, it has been suggested that if alternative ‘Do the genes in protein family X all use alternative splicing as splicing is widespread in the human genome, it could represent a mechanism to modulate function?’ The database integrates a a relatively efficient expansion of the genome’s ‘vocabulary’ variety of data for each gene, ranging from genomic map loca- of variant genes, by producing multiple functional forms of tion to gene structure, with links to external resources such as *To whom correspondence should be addressed. Tel: +1 310 825 7374; Fax: +1 310 267 0248; Email: [email protected] Nucleic Acids Research, 2001, Vol. 29, No. 13 2851 Table 1. Alternative splicing in mRNA and EST sequence data All genes Genes with mRNA On chromosome 22 No. splices No. clusters No. splices No. clusters No. splices No. clusters Initial unigene clusters 86 244 16 240 548 Mapped to draft genome (10/00) 47 422 55% 6603 41% 421 77% Detected splices 39 862 8429 18% 30 495 4024 61% 1797 220 52% Alternative splice relationships 6201 2272 27% 5009 1687 42% 313 94 43% (with multiple evidence) 2892 1306 2505 1089 141 56 Tabulation of the number of splices and number of distinct gene clusters in which they were observed, from the total dataset (All genes), clusters containing partial and/or full-length mRNAs (Genes with mRNA) and a control set of 548 clusters that have been STS-mapped to chromosome 22 (On chromosome 22). Percentages are given for the fraction of gene clusters successfully mapped by our procedure to the October 2000 draft human genome sequence (14) (Mapped to draft genome 10/00); the fraction of mapped gene clusters observed to contain at least one splice (Detected splices); the fraction of gene clusters containing alternative splices, out of the total observed to contain at least one splice (Alternative splice relationships); and the subset of alternative splices that were observed in multiple ESTs or a human-verified mRNA, as opposed to being observed in just a single EST (with multiple evidence). GenBank, OMIM, SWISS-PROT etc. It provides a detailed the EST–mRNA alignment is generated using dynamic alignment of the ESTs, mRNAs, genomic DNA and protein programming, producing a consensus sequence that excludes sequence, showing single nucleotide polymorphisms (SNPs) minority features such as sequencing errors, sequence differ- (13), exons and introns, splice site junctions, alternative splices ences due to paralog contamination, unaligned ends and inserts and, most importantly, the raw experimental evidence for all of due to chimeric sequences or unspliced introns. these features, including chromatogram traces from public The search for the genomic location of each UNIGENE EST sequencing projects. cluster was performed in two stages. First, we identify the candidate gene regions in the genomic sequence for a given –50 consensus using a BLAST threshold of E < 10 and a nucle- MATERIALS AND METHODS otide mismatch penalty of 11. Secondly, to check the candidate gene location, we searched for radiation hybrid mapping data Data sources for sequence tagged sites (STS) linked with this gene. Candi- Our analysis is based on two major types of data: human date regions that did not agree with the STS mapping informa- genomic sequence assemblies and human EST sequences. tion for the cluster were discarded. Thirdly, we identified the –10 Human genomic assembly sequences (accession no. putative exons, by using a lower threshold (E < 10 )that will NT_XXXX) and ‘draft’ BAC clone sequences [accession nos report shorter exons. The resulting BLAST hits must span the ACXXXX, ALXXXXX, etc. (14)] were downloaded from entire consensus, allowing only up to 100 bp of unmatched NCBI (ftp://ncbi.nlm.nih.gov/genome/seq and ftp:// sequence at the consensus ends. Allowing BLAST this short ncbi.nlm.nih.gov/genbank/gbhtgXXX.seq.gz). For partially unmatched region at the ends is necessary, since it may not sequenced clones, ‘draft’ fragments of 4 kb or longer contin- identify very small exons reliably. Genomic candidates are uous sequence were included in our analysis. All the work assessed in order of ascending expectation value, until a candi- describedinthispaper is basedonthe October2000releaseof date passes our second BLAST stage. The matching genomic these data. Human EST sequences were downloaded from region, plus 2 kb on either end to allow for short or fragmen- UNIGENE (ftp://ncbi.nlm.nih.gov/repository/UniGene). We tary exons at the ends that BLAST may have missed, is aligned used the EST clustering provided by UNIGENE, and did not with the complete set of ESTs and mRNAs for the UNIGENE perform our own re-clustering of the EST sequences. All the cluster using dynamic programming (16,17), truncating the work described in this paper is based on the December 2000 gap extension penalty beyond 16 bp to allow for introns. The release of UNIGENE. full EST and mRNA sequence must match the genomic sequence to be kept for the alternative splicing analysis. If an Genomic mapping of expressed sequence clusters EST has 6 bp or more of insert relative to genomic, it is excluded. Using this procedure we mapped 47 422 of the Consensus sequences from our previous analysis of human 86 244 UNIGENE clusters onto the genomic sequence. Based expressed sequences (13) were searched against a database of on our analysis of chromosome 22, and comparison with human genomic assembly sequences using BLAST (15). We NCBI’s Acembly gene mapping, we estimate a false negative used consensus sequences to eliminate non-consensus features rate for our mapping procedure of 20%, and an upper bound for of each UNIGENE cluster, such as EST sequencing errors, its false positive rate of 3–5% (see Results). chimeric ESTs or contamination by a minority of similar but paralogous sequences. The consensus sequence excludes these Alternative splicing analysis features, and should prevent them from affecting the genomic search. Our assembly and consensus analysis of UNIGENE Splicing is detected by a computational procedure that was previously described as part of SNP discovery from analyzes the genomic–EST–mRNA multiple sequence align- human expressed sequences (13), and the consensus sequences ment. Briefly, the gene structure is marked on the genomic are available at http://www.bioinformatics.ucla.edu/snp/. sequence, based on its alignment with ESTs and mRNAs, by Briefly, after assembly, the maximum likelihood traversal of drawing a connection between each pair of genomic letters 2852 Nucleic Acids Research, 2001, Vol. 29, No. 13 aligned to a pair of letters in an expressed sequence that are GeneMine software is freely available to academic researchers adjacent (i.e. nucleotide i and i+1). Thus, an exon is identified (http://www.bioinformatics.ucla.edu/genemine). by a contiguous segment of connected letters, an intron by a To characterize the functional impacts of alternative contiguous segment of unconnected letters and a splice by a splicing, a random sample of 50 clusters with alternative connection that jumps from one genomic letter to a distant splicing and at least one full-length mRNA was generated genomic letter. Thus, a candidate splice is detected as a gap (Table 2). The mRNA requirement was imposed to ensure that between two exons that match a single contiguous region of the cluster would contain as complete a set of the gene’s exons one or more ESTs. We report splices only for connections that as possible, to cover the full coding region and untranslated skip >10 bp in the genomic sequence (representing an effective regions (UTRs). Without such coverage in many cases it is not minimum intron length) to screen out sequencing error or even possible to define what the actual bounds of the coding alignment heterogeneity artefacts. Individual splice observa- region are, let alone get unbiased sampling of the coding tions from different ESTs are joined together when their 5′ region versus UTRs. To characterize the function of each gene splice sites match within 6 bp in the genomic sequence, and product at both the cellular and systemic level required careful their 3′ splice sites also match within 6 bp. This level of variation manual evaluation and study (i.e. not only sequence analysis is permitted to screen out sequencing error and alignment artefacts but also digging into the available literature and information on that could give spurious alternative splices. All candidate the web). We did not feel that the twin objectives of accurate splices were checked against the standard consensus splicing classification of the functional effects of alternative splicing sequences, and all candidates with mismatches were discarded. and lack of selection bias could be provided reliably by It is possible that some of these mismatches may be viable electronic annotations at this time, although this is a very inter- deviations from the consensus sequence and represent real esting area for further work. The effect of each alternative splices. However, we have excluded them from consideration splice was evaluated manually, by careful examination of the in the results presented in this paper. This procedure was complete alignment and available information using the Gene- designed to be robust, and even in cases with a mis-assembled Mine software. Most importantly, we considered all possible genomic sequence should not report spurious splices. Instead, changes in the boundaries of the coding region (alternative genomic mis-assembly would likely cause mismatch with the initiation, alternative termination, truncation, extension, in- ESTs, and complete exclusion of the cluster from our analysis. frame deletion and insertion). Since an alternative splice can It should be noted that EST–mRNA versus genomic sequence change where the coding region starts and ends, it is incorrect alignments occasionally contain degenerate alignment positions, to classify it as the UTR simply because it is upstream of the in which one or more nucleotides are identical in the genomic translation start site given by the GenBank annotation for the sequence on either side of a gap (intron). In this case our soft- gene. We have adopted the policy that any alternative splice ware checks each of the equivalent alignments to identify the that alters the protein product will be classified as a ‘coding correct splice junction. region’, regardless of its location relative to the GenBank CDS Alternative splices are reported when two detected splices annotation. In the process, the alternative splices affecting the overlap in the genomic sequence (and thus are mutually exclu- coding region were identified as changing the N-terminal, sive events). One important consequence of this definition is C-terminal or internal region of the protein. that alternative splicing always requires positive evidence (i.e. strong match of EST to genomic) on both sides of each RESULTS compared splice. An alternative splice will never be reported simply because one EST is longer or shorter than others, or Detection of alternative splicing even if vector sequence was attached at one end. [Vector Our analysis of alternative splicing is based strictly on experi- sequences are screened out of UNIGENE (18) data. However, mental data, not theoretical models. Rather than seeking to it is still important to note that heterogeneity at EST ends will predict alternative splices, we directly detect them as large not give rise to reported alternative splices.] All splices, alter- inserts in EST data from the publicly available dbEST (20) and native splices, individual splice observations in specific sequences, source library information, gene information, UNIGENE (18) databases. We measure the evidence for a genomic mapping information, etc. are stored within a rela- genuine alternative splice via a series of criteria (Fig. 1). First, tional database (MySQL), and are accessible for query via the a set of ESTs must match over their full lengths, on both sides web (http://www.bioinformatics.ucla.edu/HASDB). To assess of a putative alternative splice (allowing for sequence error). A the fraction of alternative splices detected based on mRNA, large insert in the middle of such a perfect match is a candidate EST versus mRNA or EST versus EST evidence (Fig. 5D), we alternative splice. Unlike many other types of genomics results used a database query to compute these numbers for all the such as SNPs and variations in expression level, alternative alternative splices in our database. splicing does not resemble common experimental noise (such as sequencing error). Functional analysis of alternative splicing Next, the EST consensus sequence is mapped to the draft We have performed extensive visual analysis and verification human genome sequence by homology search. Because human of our results, for hundreds of different genes. We used the genes are broken into short exons, a genomic hit typically GeneMine software system (19) to validate all aspects of the consists of many short matches. To be valid, these matches genomic mapping of our clusters, the exons, introns, splice must be perfect (again allowing only for sequencing error), sites, alternative splicing analysis and impact on protein structure must all be in the same orientation (strand) and form a and function, by thoroughly examining each of these features in complete, correctly ordered walk through the EST consensus the genomic–EST–mRNA multiple sequence alignments. The sequence. We require that each genomic–EST match region Nucleic Acids Research, 2001, Vol. 29, No. 13 2853 Table 2. Random gene sample used for functional analysis Cluster ID Gene Title Hs.104519 PLD2 Phospholipase D2 Hs.84190 SLC19A1 Solute carrier family 19 (folate transport) Hs.366 PTS 6-Pyruvoyltetrahydropterin synthase Hs.43812 STX10 Syntaxin 10 Hs.6483 CXORF5 Chromosome X open reading frame 5 Hs.52166 LOC51275 Apoptosis-related protein PNAS-1 Hs.172894 BID BH3 interacting domain death agonist Hs.20887 FLJ10392 Hypothetical protein Hs.26994 FLJ20477 Hypothetical protein Hs.76873 HYAL2 Hyaluronoglucosaminidase 2 Hs.81337 LGALS9 Lectin, galactoside-binding, soluble, 9 (galectin 9) Hs.198246 GC Group-specific component (vitamin D binding protein) Hs.155247 ALDOC Aldolase C, fructose-bisphosphate Hs.125139 FLJ11004 Hypothetical protein Hs.89575 CD79B CD79B antigen (immunoglobulin-associated β) Hs.49427 LOC51291 Gem-interacting protein Hs.7100 CL25084 Hypothetical protein Hs.11042 LOC51248 Hypothetical protein Hs.82359 TNFRSF6 Tumor necrosis factor receptor superfamily, member 6 Hs.2839 NDP Norrie disease (pseudoglioma) Hs.94498 LILRA2 Leukocyte immunoglobulin-like receptor, subfamily A (with TM domain), member 2 Hs.169294 TCF7 Transcription factor 7 (T-cell specific, HMG-box) Hs.75562 DDR1 Discoidin domain receptor family, member 1 Hs.3657 KIAA0784 KIAA0784 protein Hs.99855 FPRL1 Formyl peptide receptor-like 1 Hs.1252 APOH Apolipoprotein H (β-2-glycoprotein I) Hs.171595 HTATSF1 HIV TAT specific factor I Hs.278522 PSG6 Pregnancy specific β-1-glycoprotein 6 Hs.55847 LOC51258 Hypothetical protein Hs.76285 DKFZP564B167 DKFZP564B167 protein Hs.89506 PAX6 Paired box gene 6 (aniridia, keratitis) Hs.1334 MYB v-myb avian myeloblastosis viral oncogene homolog Hs.7768 FIBP Fibroblast growth factor (acidic) intracellular binding protein Hs.3280 CASP6 Caspase 6, apoptosis-related cysteine protease Hs.6710 MPDU1 Mannose-P-dolichol utilization defect 1 Hs.78921 AKAP1 A kinase (PRKA) anchor protein 1 Hs.96038 RIT Ric (Drosophila)-like Hs.73851 ATP5J ATP synthase, H transporting, mitos F0 complex, subunitF6 Hs.167031 DKFZP566D133 DKFZP566D133 protein Hs.49767 NDUFS6 NADH dehydrogenase (ubq) Fe-S protein 6 (13 kDa) (NADH CoQ reductase) Hs.151761 KIAA0100 KIAA0100 gene product Hs.83937 FLJ20323 Hypothetical protein Hs.1162 HLA-DMB Major histocompatibility complex, class II, DM β Hs.38044 DKFZP564M082 DKFZP564M082 protein Hs.99526 OBP2B Odorant-binding protein 2B Hs.15159 HSPC224 Transmembrane proteolipid Hs.69285 NRP1 Neuropilin 1 Hs.10028 CG1I Putative cyclin G1 interacting protein Hs.198272 NDUFB2 NADH dehydrogenase (ubq) 1 β subcomplex, 2(8 kDa, AGGG) Hs.75486 HSF4 Heat shock transcription factor 4 A random sample of 50 UNIGENE clusters containing at least one full-length mRNA was generated. The UNIGENE cluster ID, gene symbol and title are described. 2854 Nucleic Acids Research, 2001, Vol. 29, No. 13 Figure 1. Detection and validation of alternative splicing. (A) Types of evidence for alternative splicing (see text). (B) Types of alternative splicing detected in this study include exon skipping, alternative 5′ splice donor sites and alternative 3′ splice acceptor sites. (putative exon) be bounded by consensus splice donor site and A candidate alternative splice insert (from the EST) must acceptor site sequences in the neighboring genomic (intron) pass a series of tests. First, it must also be found in the genomic sequence, matching an exonic region in the genomic sequence sequence. Our results give an average internal exon size of whose boundaries correspond to known splice site sequences. 144 bp, with only 4% of internal exons >300 bp in length, Since these splice site sequences are mostly intronic, this similar to results obtained for known genes (21). Only 0.2% provides an independent validation of the alternative splice. It (79/39 862) of our introns were <60 bp, and the median intron should be emphasized that differences in where ESTs begin length was 935 bp. The typical gene pattern of short internal and end in a gene (e.g. a shorter EST might give the appearance exons ending in a single, long 3′ exon can usually be verified of a truncated gene product) will never be interpreted as an because 3′-end sequences are highly represented in the EST data, alternative splice by our procedure. We focus exclusively on and because 3′ ESTs can be identified by their conspicuous detecting splicing, i.e. a contiguous region of the transcript that poly(A) tails, which directly indicate the end of the 3′ exon. has been removed during mRNA processing. Detecting a To assess the accuracy of our gene mapping and exon/intron splice in an EST requires extensive matches to both upstream structure, we have compared against the completely inde- and downstream exons. Our analysis identified 39 862 splices pendent data produced by NCBI’s Acembly, a human curated in 8429 clusters. Our analysis only reports alternative splices, gene annotation effort (data downloaded from ftp:// i.e. pairs of validated splices that are mutually exclusive. Thus ncbi.nlm.nih.gov/genomes/H_sapiens). LocusLink provides unspliced introns or other genomic contaminants will never be an independent linkage between individual RefSeq genes and reported, since they result in the absence of a splice, not the UNIGENE clusters (22). For genes mapped independently to creation of a new, mutually exclusive splice. To call an alterna- the genomic sequence by RefSeq and our procedure, 97.3% tive splice, our procedure requires a pair of splices that match mapped to the same genomic contig. Moreover, of those genes, exactly at one splice site, and differ at the second splice site. 95% were mapped to the same nucleotides of the contig. While This procedure can detect exon skipping, alternative 5′ donor Acembly’s mapping should not be assumed to be perfect, this sites, and alternative 3′ acceptor sites (Fig. 1B). 6201 such high level of agreement between independent efforts is encour- alternative splice relationships were identified in 2272 clusters. aging. Our exon details (derived in our procedure from our These diverse forms of evidence produce strong log odds splice detection) match the NCBI Acembly exons in 97% of scores for each detected alternative splice. A detailed statistical ′ splice site, and 96% at the 3′ splice site (overall, cases at the 5 analysis of this evidence will be presented elsewhere 94% of the exons were identical). Our splice details matched (D.Miller, J.Aten, C.Grasso, B.Modrek and C.Lee, manuscript the NCBI Acembly introns in 93% of cases at the 5′-end, and in preparation). 92% at the 3′-end (86% matched exactly at both ends). As a typical validation example from our database, we illus- Because of alternative splicing, a 100% correspondence is not trate the dystrophia myotonica protein kinase (DMPK)gene expected. (Fig. 2), whose alternative splicing has previously been studied Nucleic Acids Research, 2001, Vol. 29, No. 13 2855 Figure 2. Alternative splicing of DMPK. (A) Gene structure for exons XII–XV Figure 3. Alternative splicing of HLA-DMB. (A)Genomic structure ofthe of the DMPK gene, in contig NT000991 of chromosome 19. Two splice forms HLA-DM β gene, in contig NT001520 of chromosome 6. Exons are shown as are shown, one observed in an mRNA (mRNA1) and one in an EST (EST1). filled boxes, and the observed splices are shown on top of the genomic (B) Example sequence evidence for the two splice forms. Sequence EST1 sequence. (B) The four alternative forms of HLA-DM β mRNA inferred from skips directly from exon XII to exon XV. We detected three alternative splice the expressed sequence data, colored to show the exons. The protein reading forms in DMPK; all are confirmed by the experimental literature (23). frame is indicated by an arrow beneath each form, showing the transmembrane domain (TM) and lysosomal targeting signal (LT). (C) The splice donor and acceptor sites for the eight putative splices observed in HLA-DM β. The pri- mary consensus site sequences are highlighted in black and secondary consen- sus sequences (5) are marked in magenta. extensively. In DMPK, we identified three alternative splices in the EST data, all of which are verified by independent exper- imental results in the existing literature (23). Of the three alter- native splices, one deletes the last 15 bp of exon 8, another Analysis of these forms reveals a remarkably simple and skips exon 12 and exon 13, and the last deletes just 4 bp in exon intriguing functional effect. HLA-DM is essential for the 14. Figure 2 shows one of these alternative splice forms loading of class II MHC molecules with exogenous peptide including junction and quality of match of the EST evidence antigens, a key step in antigen presentation and activation of versus the genomic sequence. the humoral immune response. This is thought to occur in early lysosomal compartments. HLA-DM is normally targeted to Novel alternative splice forms of a known gene lysosomes, and its β chain contains a transmembrane domain Figure 3 shows several novel alternative splices detected in a anchoring its C-terminus (26,27). Exon IV is short, and corre- well-studied gene, HLA-DM β. Eighty ESTs from UNIGENE sponds precisely to the transmembrane domain. Exon V is very cluster Hs.1162 align to form a consensus sequence, which in short, and encodes the lysosomal targeting signal YTPL, turn matches an ordered series of segments on one strand of whose first residue begins at the start of the exon. Thus, the chromosome 6. The EST sequences match the genomic alternative splice regulates HLA-DM’s targeting to endosomal sequence closely, consistent with sequencing error. The EST compartments (by including or excluding the YTPL signal), as sequences mark out a long 3′ exon (359 bp) plus a series of five well as its anchoring to the membrane. Given HLA-DM’s short exons, whose sizes (36–288 bp) match the range importance in antigen processing and presentation by class II expected for internal exons. This matches the known gene MHC, this regulation is functionally interesting. Removing its location and structure for HLA-DM β (24,25). Eight splices are targeting signal would likely redirect HLA-DM first to the observed in these ESTs, where sequence matching one exonic plasma membrane, so that it would travel to lysosomes via region skips directly to a downstream exonic region as indi- endocytic pathways, altering the kinetics and conditions in cated in Figure 3A. The 16 putative exon boundaries implied which it first encounters class II MHC. It appears that the gene by the ESTs map precisely to strong consensus splice acceptor structure of the HLA-DM β gene has been carefully ‘designed’ and donor sites in the genomic sequence (Fig. 3C). to enable control of HLA-DM function, by pulling out both the Four different alternatively spliced forms of HLA-DM β are transmembrane helix and the lysosomal targeting signal into observed: splices 3+4+5 (including exons IV and V in the separate short exons (IV, V) that can be alternatively spliced mRNA product); splices 6+5 (skipping exon IV); splices 3+7 in-frame (exon VI supplies the last 4 amino acids of the (excluding exon V); splice 8 (skipping exons IV and V). protein, identical in all forms). The alternatively spliced forms Exons IV and V are 117 and 36 bp in length, and thus these were detected in uterus (two ESTs), placenta, lymph, stomach alternative splices are all in-frame. The protein coding region and colon. Despite the fact that HLA-DM is the subject of begins in exon I and ends in exon VI, so these splices produce four different forms of the HLA-DM β chain that differ at their intense research, we have not been able to find any report of C-terminus. such alternative splicing in the published literature, and it is 2856 Nucleic Acids Research, 2001, Vol. 29, No. 13 Figure 4. Alternative splicing of Hs.11090, a putative FCε receptor β chain homolog. (A) Genomic structure of exons and splices, as in Figure 3. Potential polyadenylation sites important for the alternative gene forms are indicated. (B) Three alternative forms inferred from the expressed sequences. Predicted transmembrane domains (TM) are indicated (see text). (C) The corresponding protein forms, indicating topology across the membrane. thought to be novel by an expert on HLA-DM (E.Mellins, ESTs to cluster at the 3′-end bias the current dataset against personal communication). finding full-length genes, and probably underestimate the true level of alternative splicing. Moreover, since the current EST Scope of alternative splicing in human genes data for each gene represent only a subset of the tissues and cell types in which that gene is expressed, it is likely that the total Our genome-wide analysis detected thousands of alternative occurrence of alternative splicing is much greater than what splices in the current, publicly available human genome data our analyses can detect. A large fraction of the EST alternative (Table 1). 6201 alternative splice relationships were detected splice forms were observed multiple times (from different in which two splices shared a common donor or acceptor site, clones and different libraries), indicating that they constitute a but spliced to a different site on their other end (i.e. exon skip- relatively high fraction of total mRNA. Of our alternative ping, alternative 5′ splice donor site or alternative 3′ splice splices, 2892 (47%) were observed in two or more EST acceptor site; Fig. 1B). We found alternative splices in 27% of sequences. These data represent a ‘high confidence’ subset of genes for which we had enough expressed sequence to cover the detected alternative splices. more than a single exon. However, this estimate, based on Our analysis indicates that the vast majority of our database analysis of all EST clusters, likely underestimates the real represents novel findings (Fig. 5D). Only 13% of our alterna- occurrence of alternative splicing, because the available EST tive splices were detected in mRNA sequences from GenBank, data typically cover only a small part of the complete gene. To which presumably have been thoroughly studied. The test this hypothesis, we analyzed the alternative splicing rate in remaining 87% could be detected only with ESTs. Our proce- genes for which mRNA sequence was available (representing dure also detected large numbers of alternative splicing events all or part of the full gene). We detected one or more alterna- tive splice forms in 42% of these genes, significantly higher in completely novel genes. Approximately 1200 alternative than the rate observed in EST-only clusters. This is in close splices were detected in clusters containing ESTs only. agreement with a previous study of mRNA-based expressed Alternative splicing in a novel human gene sequence clusters (8). Since fragmentation of the genomic sequence can also block complete coverage of a gene, we Figure 4 illustrates an example of alternative splice detection assessed the rate of alternative splice detection in genes in a novel gene mapped in the human genome by our proce- mapped to chromosome 22. Of these, 43% contained alterna- dure. This gene has 33% identity to rat FCε receptor I β chain, tive splices, including both mRNA and EST-only clusters. and 25% identity to CD20, and has a pattern of four predicted The current EST data appears to be incomplete. Our proce- transmembrane domains characteristic of both proteins. At dure identified splices (i.e. multiple exons) in only 18% of the least seven different forms are detectable, all of which affect mapped EST clusters. However, for clusters that we mapped to the protein product. In a pattern strikingly reminiscent of HLA- chromosome 22 (full genomic) that also had an mRNA DM β, the C-terminal transmembrane region and cytoplasmic sequence, 88% contained at least one splice. A variety of tail of the major form (form 1) are placed on a single, short factors such as the fragmentation of the draft human genome exon (exon VI), that can be included or excluded to create sequence, the large size of introns and the tendency of the different forms. One particularly interesting form is created by Nucleic Acids Research, 2001, Vol. 29, No. 13 2857 Figure 5. Analysis of a random sample of alternative spliced genes. (A) Fraction of the observed alternative splice in protein coding region, 5′ and 3′ UTR. (B) Fraction replacing the protein N-terminus, C-terminus or internal regions. (C) Fraction causing truncation of protein product due to frame-shift; extension of the protein product due to frame-shift; switch to a new initiator codon while preserving protein reading frame; switch to a new terminator codon from an alternative exon; in-frame deletion of codons, preserving reading frame; in-frame insertion of codons, preserving frame. (D) Origin of alternative splicing evidence: detected in mRNA (presumednot novel); detectedinEST (bycomparisonwithanmRNA); detectedinEST (bycomparisonwith otherESTs). (E) Type of alternative splice. (F) Categorization of alternatively spliced genes by systemic function (see text). (G) Categorization by gene product (see text). ignoring the normal splice from exon V to exon VI, extending alternative splices modified the protein product, whereas 22% the coding region from exon Va for 142 bp (which we have were confined to the 5′ UTR versus just 4% in the 3′ UTR designated exon Vb). A polyadenylation site is predicted at the (Fig. 5A). This may simply reflect the larger fraction of exons end of this sequence, and the ESTs are observed to terminate in in human genes that are protein-coding as opposed to UTR. poly(A) at this point. This alternative termination replaces the This result fits expectations from molecular biology studies coding region of exons VI and VII with 40 amino acids (1), but disagrees strongly with a bioinformatics analysis of a encoded by exon Vb [terminated by a STOP codon 23 bp small set of ESTs (9), which reported 80% of genes with alter- before the poly(A) site]. Intriguingly, this replacement C-terminal native splicing had an alternative splice in 5′ UTR versus only sequence also contains a predicted transmembrane sequence, 20% in coding regions. Our observation of little alternative and thus neatly substitutes a new C-terminal transmembrane splicing in 3′ UTR is striking in view of the strong bias in the domain and cytoplasmic tail. The cytoplasmic tail in equivalent EST data towards the 3′ exon. One possible explanation is that FC receptor chains plays a key role in activating cytoplasmic mRNA species with alternatively spliced 3′ UTR sequence signal transduction molecules (28,29), so this alternative form could contain sequences that destabilize the mRNA, resulting likely modulates the signal transduction activity of this in fewer observations of these forms. In contrast, the effect on the receptor. This form is detected in placenta and kidney, while protein product is seen much more frequently at the C-terminal the majority form was detected in many different libraries. end (3′ in the mRNA) (Fig. 5B). We observed a tendency to replace the C-terminus (46%), as opposed to making an internal deletion, insertion or substitution (37%), or a replace- DISCUSSION ment of the N-terminus (17%). In this respect, the examples we Our results provide a comprehensive dataset for understanding have shown (HLA-DM β and FCε receptor I β homolog) are the role of alternative splicing in the human genome. First of representative. Alternative splicing appears to be strongly all, what is the function of alternative splicing—modification biased to preserve the protein coding frame (Fig. 5C). Only of the protein product, or of the untranslated regions that could 19% of alternative splices resulted in a truncation of the protein affect mRNA localization and stability? Analysis of a random product due to frame-shift; occasionally alternative splicing sample from our database (Table 2) indicates that 74% of was observed to add a new, extended C-terminal sequence 2858 Nucleic Acids Research, 2001, Vol. 29, No. 13 through frame-shift (6%). Alternative splicing resulted in a procedure to detect alternatively spliced forms that are known switch to a new AUG start site on an alternative exon in 15% in the literature. First, a given gene may not map yet to the draft of cases. In contrast, replacing the C-terminus by switching to genome, a prerequisite in our procedure for analyzing its a different exon containing an alternative STOP codon splicing. Secondly, some alternatively spliced mRNA forms occurred in 20% of cases. In-frame deletion or insertion of a are miscategorized as genomic DNA in GenBank, causing new sequence in the middle of the protein accounted for 29 and them to be excluded by our procedure. The former seems to be 11% of cases, respectively. the most important cause of failure. Despite >90% complete- In what types of molecules is alternative splicing commonly ness by total nucleotides sequenced, the draft genome used in observed? Figure 5G shows a molecular classification of a this study (October 2000) only enabled mapping of 55% of random sample of alternatively spliced genes. The most abun- UNIGENE expressed sequence clusters, because we require a dant category is cell surface functions/receptors (29%), which full-length match versus the expressed gene sequence includes membrane-anchored receptors (e.g. CD79B), integral consensus (Table 1). The draft (i.e. incomplete) BAC clone membrane proteins (e.g. folate transporter SLC19A1)and sequences which constituted the majority of this dataset, proteins involved in cell surface adhesion (e.g. lectin, consisted in large part of short sequence fragments (4–10 kb) hyaluronoglucosaminidase 2). In two related categories, an separated by unsequenced regions. Such fragments are too additional 14% of alternatively spliced genes encode secreted small to map a typical human gene (10–30 kb) by our conserv- proteins (e.g. Norrie disease protein; group-specific compo- ative procedure. This trend is even stronger for the subset of nent) and 9% encode signal transduction molecules (e.g. phos- genes that have full-length mRNAs. Of these clusters, only pholipase D2; RIT). The next two major categories are 41% could be mapped over their full length to an available transcriptional regulation (14%; e.g. MYB, PAX6) and apop- genomic contig. To check whether this is due to the draft tosis (11%; e.g. BID, PNAS-1). Together, these functions of genome’s fragmentation, we analyzed a subset of gene clusters transmission, reception and response to cellular signals that have been mapped by STS to chromosome 22, which has comprise >75% of the observed alternatively spliced genes. been almost completely sequenced. For these clusters, 77% Proteins involved in metabolism (e.g. aldolase C), and could be mapped. Thus, given unbroken genomic sequence, organelle-specific sorting proteins were also observed. This our mapping procedure has a false negative rate of ∼20%. sample is by no means comprehensive or exact, but indicates a These data suggest that completion of the human genome trend towards cell surface interactions and signaling. sequence, along with improvements in our algorithms, will at What types of systemic functions are most often affected by least double the number of alternative splices detected. Our alternative splicing? Twenty-nine percent of the alternatively detection of alternative splicing should also grow with spliced genes encoded functions specific to the immune system increasing EST data. In our current EST dataset (December (Fig. 5F; e.g. T-cell specific transcription factor 7, TNF 2000), splices were detected in only 18% of clusters, reflecting receptor superfamily member 6). In particular, alternative the fact that the average cluster consists of too few ESTs (one splicing of immune system cell surface receptors was very or two) and is too short (a few hundred base pairs) to cover prevalent. Neuronal functions (e.g. neuropilin, brain-specific more than a single exon. This is exaggerated by the strong bias aldolase C) comprised 12% of the total. The remaining genes of theESTs tobefromthe 3′-end, since 3′ exons tend to be possessed no clearly specific systemic function. These data much longer than typical internal exons. In contrast, in genes suggest alternative splicing may play a large role in immune for which a full-length or partial mRNA sequence was avail- system and nervous system functions which require precise able and which were mapped to a region of full-length genomic control of cellular differentiation and activation, to process sequence (e.g. chromosome 22), 88% contained at least one large amounts of information. Controlling how each cell splice (and typically many more). responds to a diverse array of signals can be achieved through alternative splicing of its receptors and signal transduction ACKNOWLEDGEMENTS molecules. How often is alternative splicing clearly associated with a We wish to thank D. Black, D. Miller, S. Galbraith and specific tissue? Based on a sample of 50 genes, ∼14% of alter- D. Eisenberg for their helpful discussions and comments on natively spliced genes in our dataset showed evidence of tissue this work, and K. Ke for assistance in constructing the HASDB specificity for the minor isoform. This estimate is based on a web site. This work was supported by Department of Energy conservative definition requiring that the minor isoform be grant DEFG0387ER60615 and a grant from the Searle observed multiple times in a specific tissue in which the major Scholars Program to C.L. B.M. is a predoctoral trainee form was not observed. Since in many known cases of tissue- supported by NSF IGERT Award #DGE-9987641. specific alternative splicing both minor and major forms are observed in the same tissue, this probably misses many cases REFERENCES of real tissue-specificity. Examples include DDR1, discoidin domain receptor, which has a minor form observed in muscle; 1. Lopez,A.J. (1998) Alternative splicing of pre-mRNA: developmental and CG1I, a putative cyclin G1 interacting protein, which has consequences and mechanisms of regulation. Annu.Rev.Genet., 32, isoforms observed specifically in ovary and brain. Within the 279–305. 2. Boise,L.H., Gonzalez-Garcia,M., Postema,C.E., Ding,L., Lindsten,T., small sample, tissue-specific minor isoforms were observed in Turka,L.A., Mao,X., Nunez,G. and Thompson,C.B. (1993) bcl-x, a bcl-2- novel, uncharacterized genes in brain, colon, testis and pros- related gene that functions as a dominant regulator of apoptotic cell death. tate. Cell, 74, 597–608. How comprehensive is our dataset, and what are its pros- 3. Fettiplace,R. and Fuchs,P.A. (1999) Mechanisms of hair cell tuning. pects for growth? We have noted two causes of failure by our Annu.Rev.Physiol., 61, 809–834. Nucleic Acids Research, 2001, Vol. 29, No. 13 2859 4. Schmucker,D., Clemens,J.C., Shu,H., Worby,C.A., Xiao,J., Muda,M., 17. Smith,T.F. and Waterman,M.S. (1981) Identification of common Dixon,J.E. and Zipursky,S.L. (2000) Drosophila Dscam is an axon molecular subsequences. J. Mol. Biol., 147, 195–197. guidance receptor exhibiting extraordinary molecular diversity. Cell, 101, 18. Schuler,G. (1997) Pieces of the puzzle: expressed sequence tags and the 671–684. catalog of human genes. J. Mol. Med., 75, 694–698. 19. Lee,C. and Irizarry,K. (2001) The GeneMine system for genome/ 5. Smith,C.W.J. and Valcarcel,J. (2000) Alternative pre-mRNA splicing: the proteome annotation and collaborative data-mining. IBM Syst. J., 40, logic of combinatorial control. Trends Biochem. Sci., 25, 381–388. in press. 6. Sharp,P.A. (1994) Split genes and RNA splicing. Cell, 77, 805–815. 20. Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) dbEST—database 7. Sutcliffe,J.G. and Milner,R.J. (1988) Alternative mRNA splicing: the for ‘expressed sequence tags’. Nat. Genet., 4, 332–333. Shaker gene. Trends Genet., 4, 297–299. 21. Hawkins,J.D. (1988) A survey on intron and exon lengths. Nucleic Acids 8. Brett,D., Hanke,J., Lehmann,G., Haase,S., Delbruck,S., Krueger,S., Res., 16, 9893–9905. Reich,J. and Bork,P. (2000) EST comparison indicates 38% of human 22. Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI gene- mRNAs contain possible alternative splice forms. FEBS Lett., 474, 83–86. centered resources. Nucleic Acids Res., 29, 137–140. 9. Mironov,A.A., Fickett,J.W. and Gelfand,M.S. (1999) Frequent alternative 23. Groenen P.J., Wansink,D.G., Coerwinkel,M., van den Broek,W., splicing of human genes. Genome Res., 9, 1288–1293. Jansen,G. and Wieringa,B. (2000) Constitutive and regulated modes of 10. Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L. and splicing produce six major myotonic dystrophy protein kinase (DMPK) Quackenbush,J. (2000) Gene Index analysis of the human genome isoforms with distinct properties. Hum. Mol. Genet., 9, 605–616. estimates approximately 120,000 genes. Nat. Genet., 25, 239–240. 24. Kelly,A.P., Monaco,J.J., Cho,S.G. and Trowsdale,J. (1991) A new human 11. Ewing,B. and Green,P. (2000) Analysis of expressed sequence tags HLA class II-related locus, DM. Nature, 353, 571–573. indicates 35,000 human genes. Nat. Genet., 25, 232–234. 25. Shaman,J., von Scheven,E., Morris,P., Chang,M.Y. and Mellins,E. (1995) 12. Ji,H., Zhou,Q., Wen,F., Xia,H., Lu,X. and Li,Y. (2001) AsMamDB: an Analysis of HLA-DMB mutants and -DMB genomic structure. alternative splice database of mammals. Nucleic Acids Res., 29, 260–263. Immunogenetics, 41, 117–124. 13. Irizarry,K., Kustanovich,V., Li,C., Brown,N., Nelson,S., Wong,W. and 26. Sanderson,F., Kleijmeer,M.J., Kelly,A.P., Verwoerd,D., Tulp,A., Lee,C. (2000) Genome-wide analysis of single-nucleotide polymorphisms Neefjes,J., Geueze,H.J. and Trowsdale,J. (1994) Accumulation of HLA- in human expressed sequences. Nat. Genet., 26, 233–236. DM, a regulator of antigen presentation, in MHC class II compartments. 14. Jang,W., Chen,W.C., Sicotte,H. and Schuler,G.D. (1999) Making Science, 266, 1566–1569. effective use of human genomic sequence data. Trends Genet., 15, 27. Potter,P.K., Copier,J., Sacks,S.H., Calafat,J., Janssen,H., Neefjes,J. and 284–286. Kelly,A.P. (1999) Accurate intracellular localization of HLA-DM 15. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) requires correct spacing of a cytoplasmic YTPL targeting motif relative to Basic local alignment search tool. J. Mol. Biol., 215, 403–410. the transmembrane domain. Eur. J. Immunol., 29, 3936–3944. 16. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to 28. Daeron,M. (1997) Fc receptor biology. Annu. Rev. Immunol., 15, 203–234. the search for similarities in the amino acid sequence of two proteins. 29. Kinet,J.P. (1999) The high affinity IgE receptor (FCεRI): from physiology J. Mol. Biol., 48, 443–453. to pathology. Annu. Rev. Immunol., 17, 931–972. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

Genome-wide detection of alternative splicing in expressed sequences of human genes

Loading next page...
 
/lp/oxford-university-press/genome-wide-detection-of-alternative-splicing-in-expressed-sequences-qRDGNPuD6Y

References (28)

Publisher
Oxford University Press
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/29.13.2850
Publisher site
See Article on Publisher Site

Abstract

2850–2859 Nucleic Acids Research, 2001, Vol. 29, No. 13 © 2001 Oxford University Press Genome-wide detection of alternative splicing in expressed sequences of human genes Barmak Modrek, Alissa Resch, Catherine Grasso and Christopher Lee* Department of Chemistry and Biochemistry, University of California, 611 Charles E. Young Drive East, Los Angeles, CA 90095-1570, USA Received January 26, 2001; Revised and Accepted May 5, 2001 ABSTRACT many genes. Its functional implications can be simple, gener- ating a single alternative form, or can produce remarkable We have identified 6201 alternative splice relation- diversity. In the Drosophila gene Dscam, combinatorial alter- ships in human genes, through a genome-wide native splicing of ‘cassettes’ of exons reminiscent of the analysis of expressed sequence tags (ESTs). combinatorial generation of immunoglobulin diversity, ∼2.1 million human mRNA and EST Starting with produces thousands of distinct functional isoforms (4). This sequences, we mapped expressed sequences onto gene, homologous to the human gene for Down’s syndrome the draft human genome sequence and only cell adhesion molecule (DSCAM), appears to be involved in accepted splices that obeyed the standard splice site neuronal guidance, where such diversity could be useful as a molecular ‘address’. consensus. A large fraction (47%) of these were Alternative splicing has been studied intensively in hundreds observed multiple times, indicating that they of human genes (1,5), and it appears to be widespread, occurring comprise a substantial fraction of the mRNA species. in 5–30% of human genes (6,7) or perhaps as many as 35–40% The vast majority of the detected alternative forms (8,9). Recently, it has been reported that alternative splicing appear to be novel, and produce highly specific, can be detected in expressed sequence tag (EST) sequencing biologically meaningful control of function in both (9) and has been analyzed in a collection of full-length mRNAs known and novel human genes, e.g. specific removal (8). Based upon estimates of the total number of human genes of the lysosomal targeting signal from HLA-DM (10,11), it is likely that at least 10 000–20 000 human genes are chain, replacement of the C-terminal transmembrane alternatively spliced. However, currently only 899 alterna- domain and cytoplasmic tail in an FC receptor β tively spliced human genes are catalogued in the Alternative chain homolog with a different transmembrane Splicing Database of Mammals (AsMamDB) (12). domain and cytoplasmic tail, likely modulating its We have performed a genome-wide analysis of alternative signal transduction activity. Our data indicate that a splicing based on human expressed sequence data, which greatly expands our knowledge of this central function in human molec- large proportion of human genes, probably 42% or ular biology (Table 1). We have identified tens of thousands of more, are alternatively spliced, but that this appears splices, and thousands of alternative splices, in several thousand to be observed mainly in certain types of molecules human genes. We have mapped all of these onto the draft human (e.g. cell surface receptors) and systemic functions, genome sequence, and verified that the putative splice junctions particularly the immune system and nervous system. detected in the expressed sequences map onto genomic exon–intron These results provide a comprehensive dataset for junctions that match the known splice site consensus. Based on understanding the role of alternative splicing in the this genome-wide analysis of gene structure and alternative human genome, accessible at http://www.bioinfor- splicing, we have constructed a Human Alternative Splicing matics.ucla.edu/HASDB. Database, at http://www.bioinformatics.ucla.edu/HASDB. In this paper we also show how our database can be used to study the impact of alternative splicing on protein function. We present INTRODUCTION an initial analysis of the patterns and functional role of alternative splicing across the human genome. Alternative splicing is an important mechanism for modulating gene function. It can change how a gene acts in different As we seek to show with examples in this paper, our data- tissues and developmental states by generating distinct mRNA base could be a useful resource to researchers who have found isoforms composed of different selections of exons. Alterna- a new cDNA or human gene and wish to find additional infor- tive splicing has been implicated in many processes, including mation. It can help answer a wide range of questions, e.g. ‘Are sex determination (1), apoptosis (2) and acoustic tuning in the the two bands on a western blot due to alternative splicing?’ or ear (3). Recently, it has been suggested that if alternative ‘Do the genes in protein family X all use alternative splicing as splicing is widespread in the human genome, it could represent a mechanism to modulate function?’ The database integrates a a relatively efficient expansion of the genome’s ‘vocabulary’ variety of data for each gene, ranging from genomic map loca- of variant genes, by producing multiple functional forms of tion to gene structure, with links to external resources such as *To whom correspondence should be addressed. Tel: +1 310 825 7374; Fax: +1 310 267 0248; Email: [email protected] Nucleic Acids Research, 2001, Vol. 29, No. 13 2851 Table 1. Alternative splicing in mRNA and EST sequence data All genes Genes with mRNA On chromosome 22 No. splices No. clusters No. splices No. clusters No. splices No. clusters Initial unigene clusters 86 244 16 240 548 Mapped to draft genome (10/00) 47 422 55% 6603 41% 421 77% Detected splices 39 862 8429 18% 30 495 4024 61% 1797 220 52% Alternative splice relationships 6201 2272 27% 5009 1687 42% 313 94 43% (with multiple evidence) 2892 1306 2505 1089 141 56 Tabulation of the number of splices and number of distinct gene clusters in which they were observed, from the total dataset (All genes), clusters containing partial and/or full-length mRNAs (Genes with mRNA) and a control set of 548 clusters that have been STS-mapped to chromosome 22 (On chromosome 22). Percentages are given for the fraction of gene clusters successfully mapped by our procedure to the October 2000 draft human genome sequence (14) (Mapped to draft genome 10/00); the fraction of mapped gene clusters observed to contain at least one splice (Detected splices); the fraction of gene clusters containing alternative splices, out of the total observed to contain at least one splice (Alternative splice relationships); and the subset of alternative splices that were observed in multiple ESTs or a human-verified mRNA, as opposed to being observed in just a single EST (with multiple evidence). GenBank, OMIM, SWISS-PROT etc. It provides a detailed the EST–mRNA alignment is generated using dynamic alignment of the ESTs, mRNAs, genomic DNA and protein programming, producing a consensus sequence that excludes sequence, showing single nucleotide polymorphisms (SNPs) minority features such as sequencing errors, sequence differ- (13), exons and introns, splice site junctions, alternative splices ences due to paralog contamination, unaligned ends and inserts and, most importantly, the raw experimental evidence for all of due to chimeric sequences or unspliced introns. these features, including chromatogram traces from public The search for the genomic location of each UNIGENE EST sequencing projects. cluster was performed in two stages. First, we identify the candidate gene regions in the genomic sequence for a given –50 consensus using a BLAST threshold of E < 10 and a nucle- MATERIALS AND METHODS otide mismatch penalty of 11. Secondly, to check the candidate gene location, we searched for radiation hybrid mapping data Data sources for sequence tagged sites (STS) linked with this gene. Candi- Our analysis is based on two major types of data: human date regions that did not agree with the STS mapping informa- genomic sequence assemblies and human EST sequences. tion for the cluster were discarded. Thirdly, we identified the –10 Human genomic assembly sequences (accession no. putative exons, by using a lower threshold (E < 10 )that will NT_XXXX) and ‘draft’ BAC clone sequences [accession nos report shorter exons. The resulting BLAST hits must span the ACXXXX, ALXXXXX, etc. (14)] were downloaded from entire consensus, allowing only up to 100 bp of unmatched NCBI (ftp://ncbi.nlm.nih.gov/genome/seq and ftp:// sequence at the consensus ends. Allowing BLAST this short ncbi.nlm.nih.gov/genbank/gbhtgXXX.seq.gz). For partially unmatched region at the ends is necessary, since it may not sequenced clones, ‘draft’ fragments of 4 kb or longer contin- identify very small exons reliably. Genomic candidates are uous sequence were included in our analysis. All the work assessed in order of ascending expectation value, until a candi- describedinthispaper is basedonthe October2000releaseof date passes our second BLAST stage. The matching genomic these data. Human EST sequences were downloaded from region, plus 2 kb on either end to allow for short or fragmen- UNIGENE (ftp://ncbi.nlm.nih.gov/repository/UniGene). We tary exons at the ends that BLAST may have missed, is aligned used the EST clustering provided by UNIGENE, and did not with the complete set of ESTs and mRNAs for the UNIGENE perform our own re-clustering of the EST sequences. All the cluster using dynamic programming (16,17), truncating the work described in this paper is based on the December 2000 gap extension penalty beyond 16 bp to allow for introns. The release of UNIGENE. full EST and mRNA sequence must match the genomic sequence to be kept for the alternative splicing analysis. If an Genomic mapping of expressed sequence clusters EST has 6 bp or more of insert relative to genomic, it is excluded. Using this procedure we mapped 47 422 of the Consensus sequences from our previous analysis of human 86 244 UNIGENE clusters onto the genomic sequence. Based expressed sequences (13) were searched against a database of on our analysis of chromosome 22, and comparison with human genomic assembly sequences using BLAST (15). We NCBI’s Acembly gene mapping, we estimate a false negative used consensus sequences to eliminate non-consensus features rate for our mapping procedure of 20%, and an upper bound for of each UNIGENE cluster, such as EST sequencing errors, its false positive rate of 3–5% (see Results). chimeric ESTs or contamination by a minority of similar but paralogous sequences. The consensus sequence excludes these Alternative splicing analysis features, and should prevent them from affecting the genomic search. Our assembly and consensus analysis of UNIGENE Splicing is detected by a computational procedure that was previously described as part of SNP discovery from analyzes the genomic–EST–mRNA multiple sequence align- human expressed sequences (13), and the consensus sequences ment. Briefly, the gene structure is marked on the genomic are available at http://www.bioinformatics.ucla.edu/snp/. sequence, based on its alignment with ESTs and mRNAs, by Briefly, after assembly, the maximum likelihood traversal of drawing a connection between each pair of genomic letters 2852 Nucleic Acids Research, 2001, Vol. 29, No. 13 aligned to a pair of letters in an expressed sequence that are GeneMine software is freely available to academic researchers adjacent (i.e. nucleotide i and i+1). Thus, an exon is identified (http://www.bioinformatics.ucla.edu/genemine). by a contiguous segment of connected letters, an intron by a To characterize the functional impacts of alternative contiguous segment of unconnected letters and a splice by a splicing, a random sample of 50 clusters with alternative connection that jumps from one genomic letter to a distant splicing and at least one full-length mRNA was generated genomic letter. Thus, a candidate splice is detected as a gap (Table 2). The mRNA requirement was imposed to ensure that between two exons that match a single contiguous region of the cluster would contain as complete a set of the gene’s exons one or more ESTs. We report splices only for connections that as possible, to cover the full coding region and untranslated skip >10 bp in the genomic sequence (representing an effective regions (UTRs). Without such coverage in many cases it is not minimum intron length) to screen out sequencing error or even possible to define what the actual bounds of the coding alignment heterogeneity artefacts. Individual splice observa- region are, let alone get unbiased sampling of the coding tions from different ESTs are joined together when their 5′ region versus UTRs. To characterize the function of each gene splice sites match within 6 bp in the genomic sequence, and product at both the cellular and systemic level required careful their 3′ splice sites also match within 6 bp. This level of variation manual evaluation and study (i.e. not only sequence analysis is permitted to screen out sequencing error and alignment artefacts but also digging into the available literature and information on that could give spurious alternative splices. All candidate the web). We did not feel that the twin objectives of accurate splices were checked against the standard consensus splicing classification of the functional effects of alternative splicing sequences, and all candidates with mismatches were discarded. and lack of selection bias could be provided reliably by It is possible that some of these mismatches may be viable electronic annotations at this time, although this is a very inter- deviations from the consensus sequence and represent real esting area for further work. The effect of each alternative splices. However, we have excluded them from consideration splice was evaluated manually, by careful examination of the in the results presented in this paper. This procedure was complete alignment and available information using the Gene- designed to be robust, and even in cases with a mis-assembled Mine software. Most importantly, we considered all possible genomic sequence should not report spurious splices. Instead, changes in the boundaries of the coding region (alternative genomic mis-assembly would likely cause mismatch with the initiation, alternative termination, truncation, extension, in- ESTs, and complete exclusion of the cluster from our analysis. frame deletion and insertion). Since an alternative splice can It should be noted that EST–mRNA versus genomic sequence change where the coding region starts and ends, it is incorrect alignments occasionally contain degenerate alignment positions, to classify it as the UTR simply because it is upstream of the in which one or more nucleotides are identical in the genomic translation start site given by the GenBank annotation for the sequence on either side of a gap (intron). In this case our soft- gene. We have adopted the policy that any alternative splice ware checks each of the equivalent alignments to identify the that alters the protein product will be classified as a ‘coding correct splice junction. region’, regardless of its location relative to the GenBank CDS Alternative splices are reported when two detected splices annotation. In the process, the alternative splices affecting the overlap in the genomic sequence (and thus are mutually exclu- coding region were identified as changing the N-terminal, sive events). One important consequence of this definition is C-terminal or internal region of the protein. that alternative splicing always requires positive evidence (i.e. strong match of EST to genomic) on both sides of each RESULTS compared splice. An alternative splice will never be reported simply because one EST is longer or shorter than others, or Detection of alternative splicing even if vector sequence was attached at one end. [Vector Our analysis of alternative splicing is based strictly on experi- sequences are screened out of UNIGENE (18) data. However, mental data, not theoretical models. Rather than seeking to it is still important to note that heterogeneity at EST ends will predict alternative splices, we directly detect them as large not give rise to reported alternative splices.] All splices, alter- inserts in EST data from the publicly available dbEST (20) and native splices, individual splice observations in specific sequences, source library information, gene information, UNIGENE (18) databases. We measure the evidence for a genomic mapping information, etc. are stored within a rela- genuine alternative splice via a series of criteria (Fig. 1). First, tional database (MySQL), and are accessible for query via the a set of ESTs must match over their full lengths, on both sides web (http://www.bioinformatics.ucla.edu/HASDB). To assess of a putative alternative splice (allowing for sequence error). A the fraction of alternative splices detected based on mRNA, large insert in the middle of such a perfect match is a candidate EST versus mRNA or EST versus EST evidence (Fig. 5D), we alternative splice. Unlike many other types of genomics results used a database query to compute these numbers for all the such as SNPs and variations in expression level, alternative alternative splices in our database. splicing does not resemble common experimental noise (such as sequencing error). Functional analysis of alternative splicing Next, the EST consensus sequence is mapped to the draft We have performed extensive visual analysis and verification human genome sequence by homology search. Because human of our results, for hundreds of different genes. We used the genes are broken into short exons, a genomic hit typically GeneMine software system (19) to validate all aspects of the consists of many short matches. To be valid, these matches genomic mapping of our clusters, the exons, introns, splice must be perfect (again allowing only for sequencing error), sites, alternative splicing analysis and impact on protein structure must all be in the same orientation (strand) and form a and function, by thoroughly examining each of these features in complete, correctly ordered walk through the EST consensus the genomic–EST–mRNA multiple sequence alignments. The sequence. We require that each genomic–EST match region Nucleic Acids Research, 2001, Vol. 29, No. 13 2853 Table 2. Random gene sample used for functional analysis Cluster ID Gene Title Hs.104519 PLD2 Phospholipase D2 Hs.84190 SLC19A1 Solute carrier family 19 (folate transport) Hs.366 PTS 6-Pyruvoyltetrahydropterin synthase Hs.43812 STX10 Syntaxin 10 Hs.6483 CXORF5 Chromosome X open reading frame 5 Hs.52166 LOC51275 Apoptosis-related protein PNAS-1 Hs.172894 BID BH3 interacting domain death agonist Hs.20887 FLJ10392 Hypothetical protein Hs.26994 FLJ20477 Hypothetical protein Hs.76873 HYAL2 Hyaluronoglucosaminidase 2 Hs.81337 LGALS9 Lectin, galactoside-binding, soluble, 9 (galectin 9) Hs.198246 GC Group-specific component (vitamin D binding protein) Hs.155247 ALDOC Aldolase C, fructose-bisphosphate Hs.125139 FLJ11004 Hypothetical protein Hs.89575 CD79B CD79B antigen (immunoglobulin-associated β) Hs.49427 LOC51291 Gem-interacting protein Hs.7100 CL25084 Hypothetical protein Hs.11042 LOC51248 Hypothetical protein Hs.82359 TNFRSF6 Tumor necrosis factor receptor superfamily, member 6 Hs.2839 NDP Norrie disease (pseudoglioma) Hs.94498 LILRA2 Leukocyte immunoglobulin-like receptor, subfamily A (with TM domain), member 2 Hs.169294 TCF7 Transcription factor 7 (T-cell specific, HMG-box) Hs.75562 DDR1 Discoidin domain receptor family, member 1 Hs.3657 KIAA0784 KIAA0784 protein Hs.99855 FPRL1 Formyl peptide receptor-like 1 Hs.1252 APOH Apolipoprotein H (β-2-glycoprotein I) Hs.171595 HTATSF1 HIV TAT specific factor I Hs.278522 PSG6 Pregnancy specific β-1-glycoprotein 6 Hs.55847 LOC51258 Hypothetical protein Hs.76285 DKFZP564B167 DKFZP564B167 protein Hs.89506 PAX6 Paired box gene 6 (aniridia, keratitis) Hs.1334 MYB v-myb avian myeloblastosis viral oncogene homolog Hs.7768 FIBP Fibroblast growth factor (acidic) intracellular binding protein Hs.3280 CASP6 Caspase 6, apoptosis-related cysteine protease Hs.6710 MPDU1 Mannose-P-dolichol utilization defect 1 Hs.78921 AKAP1 A kinase (PRKA) anchor protein 1 Hs.96038 RIT Ric (Drosophila)-like Hs.73851 ATP5J ATP synthase, H transporting, mitos F0 complex, subunitF6 Hs.167031 DKFZP566D133 DKFZP566D133 protein Hs.49767 NDUFS6 NADH dehydrogenase (ubq) Fe-S protein 6 (13 kDa) (NADH CoQ reductase) Hs.151761 KIAA0100 KIAA0100 gene product Hs.83937 FLJ20323 Hypothetical protein Hs.1162 HLA-DMB Major histocompatibility complex, class II, DM β Hs.38044 DKFZP564M082 DKFZP564M082 protein Hs.99526 OBP2B Odorant-binding protein 2B Hs.15159 HSPC224 Transmembrane proteolipid Hs.69285 NRP1 Neuropilin 1 Hs.10028 CG1I Putative cyclin G1 interacting protein Hs.198272 NDUFB2 NADH dehydrogenase (ubq) 1 β subcomplex, 2(8 kDa, AGGG) Hs.75486 HSF4 Heat shock transcription factor 4 A random sample of 50 UNIGENE clusters containing at least one full-length mRNA was generated. The UNIGENE cluster ID, gene symbol and title are described. 2854 Nucleic Acids Research, 2001, Vol. 29, No. 13 Figure 1. Detection and validation of alternative splicing. (A) Types of evidence for alternative splicing (see text). (B) Types of alternative splicing detected in this study include exon skipping, alternative 5′ splice donor sites and alternative 3′ splice acceptor sites. (putative exon) be bounded by consensus splice donor site and A candidate alternative splice insert (from the EST) must acceptor site sequences in the neighboring genomic (intron) pass a series of tests. First, it must also be found in the genomic sequence, matching an exonic region in the genomic sequence sequence. Our results give an average internal exon size of whose boundaries correspond to known splice site sequences. 144 bp, with only 4% of internal exons >300 bp in length, Since these splice site sequences are mostly intronic, this similar to results obtained for known genes (21). Only 0.2% provides an independent validation of the alternative splice. It (79/39 862) of our introns were <60 bp, and the median intron should be emphasized that differences in where ESTs begin length was 935 bp. The typical gene pattern of short internal and end in a gene (e.g. a shorter EST might give the appearance exons ending in a single, long 3′ exon can usually be verified of a truncated gene product) will never be interpreted as an because 3′-end sequences are highly represented in the EST data, alternative splice by our procedure. We focus exclusively on and because 3′ ESTs can be identified by their conspicuous detecting splicing, i.e. a contiguous region of the transcript that poly(A) tails, which directly indicate the end of the 3′ exon. has been removed during mRNA processing. Detecting a To assess the accuracy of our gene mapping and exon/intron splice in an EST requires extensive matches to both upstream structure, we have compared against the completely inde- and downstream exons. Our analysis identified 39 862 splices pendent data produced by NCBI’s Acembly, a human curated in 8429 clusters. Our analysis only reports alternative splices, gene annotation effort (data downloaded from ftp:// i.e. pairs of validated splices that are mutually exclusive. Thus ncbi.nlm.nih.gov/genomes/H_sapiens). LocusLink provides unspliced introns or other genomic contaminants will never be an independent linkage between individual RefSeq genes and reported, since they result in the absence of a splice, not the UNIGENE clusters (22). For genes mapped independently to creation of a new, mutually exclusive splice. To call an alterna- the genomic sequence by RefSeq and our procedure, 97.3% tive splice, our procedure requires a pair of splices that match mapped to the same genomic contig. Moreover, of those genes, exactly at one splice site, and differ at the second splice site. 95% were mapped to the same nucleotides of the contig. While This procedure can detect exon skipping, alternative 5′ donor Acembly’s mapping should not be assumed to be perfect, this sites, and alternative 3′ acceptor sites (Fig. 1B). 6201 such high level of agreement between independent efforts is encour- alternative splice relationships were identified in 2272 clusters. aging. Our exon details (derived in our procedure from our These diverse forms of evidence produce strong log odds splice detection) match the NCBI Acembly exons in 97% of scores for each detected alternative splice. A detailed statistical ′ splice site, and 96% at the 3′ splice site (overall, cases at the 5 analysis of this evidence will be presented elsewhere 94% of the exons were identical). Our splice details matched (D.Miller, J.Aten, C.Grasso, B.Modrek and C.Lee, manuscript the NCBI Acembly introns in 93% of cases at the 5′-end, and in preparation). 92% at the 3′-end (86% matched exactly at both ends). As a typical validation example from our database, we illus- Because of alternative splicing, a 100% correspondence is not trate the dystrophia myotonica protein kinase (DMPK)gene expected. (Fig. 2), whose alternative splicing has previously been studied Nucleic Acids Research, 2001, Vol. 29, No. 13 2855 Figure 2. Alternative splicing of DMPK. (A) Gene structure for exons XII–XV Figure 3. Alternative splicing of HLA-DMB. (A)Genomic structure ofthe of the DMPK gene, in contig NT000991 of chromosome 19. Two splice forms HLA-DM β gene, in contig NT001520 of chromosome 6. Exons are shown as are shown, one observed in an mRNA (mRNA1) and one in an EST (EST1). filled boxes, and the observed splices are shown on top of the genomic (B) Example sequence evidence for the two splice forms. Sequence EST1 sequence. (B) The four alternative forms of HLA-DM β mRNA inferred from skips directly from exon XII to exon XV. We detected three alternative splice the expressed sequence data, colored to show the exons. The protein reading forms in DMPK; all are confirmed by the experimental literature (23). frame is indicated by an arrow beneath each form, showing the transmembrane domain (TM) and lysosomal targeting signal (LT). (C) The splice donor and acceptor sites for the eight putative splices observed in HLA-DM β. The pri- mary consensus site sequences are highlighted in black and secondary consen- sus sequences (5) are marked in magenta. extensively. In DMPK, we identified three alternative splices in the EST data, all of which are verified by independent exper- imental results in the existing literature (23). Of the three alter- native splices, one deletes the last 15 bp of exon 8, another Analysis of these forms reveals a remarkably simple and skips exon 12 and exon 13, and the last deletes just 4 bp in exon intriguing functional effect. HLA-DM is essential for the 14. Figure 2 shows one of these alternative splice forms loading of class II MHC molecules with exogenous peptide including junction and quality of match of the EST evidence antigens, a key step in antigen presentation and activation of versus the genomic sequence. the humoral immune response. This is thought to occur in early lysosomal compartments. HLA-DM is normally targeted to Novel alternative splice forms of a known gene lysosomes, and its β chain contains a transmembrane domain Figure 3 shows several novel alternative splices detected in a anchoring its C-terminus (26,27). Exon IV is short, and corre- well-studied gene, HLA-DM β. Eighty ESTs from UNIGENE sponds precisely to the transmembrane domain. Exon V is very cluster Hs.1162 align to form a consensus sequence, which in short, and encodes the lysosomal targeting signal YTPL, turn matches an ordered series of segments on one strand of whose first residue begins at the start of the exon. Thus, the chromosome 6. The EST sequences match the genomic alternative splice regulates HLA-DM’s targeting to endosomal sequence closely, consistent with sequencing error. The EST compartments (by including or excluding the YTPL signal), as sequences mark out a long 3′ exon (359 bp) plus a series of five well as its anchoring to the membrane. Given HLA-DM’s short exons, whose sizes (36–288 bp) match the range importance in antigen processing and presentation by class II expected for internal exons. This matches the known gene MHC, this regulation is functionally interesting. Removing its location and structure for HLA-DM β (24,25). Eight splices are targeting signal would likely redirect HLA-DM first to the observed in these ESTs, where sequence matching one exonic plasma membrane, so that it would travel to lysosomes via region skips directly to a downstream exonic region as indi- endocytic pathways, altering the kinetics and conditions in cated in Figure 3A. The 16 putative exon boundaries implied which it first encounters class II MHC. It appears that the gene by the ESTs map precisely to strong consensus splice acceptor structure of the HLA-DM β gene has been carefully ‘designed’ and donor sites in the genomic sequence (Fig. 3C). to enable control of HLA-DM function, by pulling out both the Four different alternatively spliced forms of HLA-DM β are transmembrane helix and the lysosomal targeting signal into observed: splices 3+4+5 (including exons IV and V in the separate short exons (IV, V) that can be alternatively spliced mRNA product); splices 6+5 (skipping exon IV); splices 3+7 in-frame (exon VI supplies the last 4 amino acids of the (excluding exon V); splice 8 (skipping exons IV and V). protein, identical in all forms). The alternatively spliced forms Exons IV and V are 117 and 36 bp in length, and thus these were detected in uterus (two ESTs), placenta, lymph, stomach alternative splices are all in-frame. The protein coding region and colon. Despite the fact that HLA-DM is the subject of begins in exon I and ends in exon VI, so these splices produce four different forms of the HLA-DM β chain that differ at their intense research, we have not been able to find any report of C-terminus. such alternative splicing in the published literature, and it is 2856 Nucleic Acids Research, 2001, Vol. 29, No. 13 Figure 4. Alternative splicing of Hs.11090, a putative FCε receptor β chain homolog. (A) Genomic structure of exons and splices, as in Figure 3. Potential polyadenylation sites important for the alternative gene forms are indicated. (B) Three alternative forms inferred from the expressed sequences. Predicted transmembrane domains (TM) are indicated (see text). (C) The corresponding protein forms, indicating topology across the membrane. thought to be novel by an expert on HLA-DM (E.Mellins, ESTs to cluster at the 3′-end bias the current dataset against personal communication). finding full-length genes, and probably underestimate the true level of alternative splicing. Moreover, since the current EST Scope of alternative splicing in human genes data for each gene represent only a subset of the tissues and cell types in which that gene is expressed, it is likely that the total Our genome-wide analysis detected thousands of alternative occurrence of alternative splicing is much greater than what splices in the current, publicly available human genome data our analyses can detect. A large fraction of the EST alternative (Table 1). 6201 alternative splice relationships were detected splice forms were observed multiple times (from different in which two splices shared a common donor or acceptor site, clones and different libraries), indicating that they constitute a but spliced to a different site on their other end (i.e. exon skip- relatively high fraction of total mRNA. Of our alternative ping, alternative 5′ splice donor site or alternative 3′ splice splices, 2892 (47%) were observed in two or more EST acceptor site; Fig. 1B). We found alternative splices in 27% of sequences. These data represent a ‘high confidence’ subset of genes for which we had enough expressed sequence to cover the detected alternative splices. more than a single exon. However, this estimate, based on Our analysis indicates that the vast majority of our database analysis of all EST clusters, likely underestimates the real represents novel findings (Fig. 5D). Only 13% of our alterna- occurrence of alternative splicing, because the available EST tive splices were detected in mRNA sequences from GenBank, data typically cover only a small part of the complete gene. To which presumably have been thoroughly studied. The test this hypothesis, we analyzed the alternative splicing rate in remaining 87% could be detected only with ESTs. Our proce- genes for which mRNA sequence was available (representing dure also detected large numbers of alternative splicing events all or part of the full gene). We detected one or more alterna- tive splice forms in 42% of these genes, significantly higher in completely novel genes. Approximately 1200 alternative than the rate observed in EST-only clusters. This is in close splices were detected in clusters containing ESTs only. agreement with a previous study of mRNA-based expressed Alternative splicing in a novel human gene sequence clusters (8). Since fragmentation of the genomic sequence can also block complete coverage of a gene, we Figure 4 illustrates an example of alternative splice detection assessed the rate of alternative splice detection in genes in a novel gene mapped in the human genome by our proce- mapped to chromosome 22. Of these, 43% contained alterna- dure. This gene has 33% identity to rat FCε receptor I β chain, tive splices, including both mRNA and EST-only clusters. and 25% identity to CD20, and has a pattern of four predicted The current EST data appears to be incomplete. Our proce- transmembrane domains characteristic of both proteins. At dure identified splices (i.e. multiple exons) in only 18% of the least seven different forms are detectable, all of which affect mapped EST clusters. However, for clusters that we mapped to the protein product. In a pattern strikingly reminiscent of HLA- chromosome 22 (full genomic) that also had an mRNA DM β, the C-terminal transmembrane region and cytoplasmic sequence, 88% contained at least one splice. A variety of tail of the major form (form 1) are placed on a single, short factors such as the fragmentation of the draft human genome exon (exon VI), that can be included or excluded to create sequence, the large size of introns and the tendency of the different forms. One particularly interesting form is created by Nucleic Acids Research, 2001, Vol. 29, No. 13 2857 Figure 5. Analysis of a random sample of alternative spliced genes. (A) Fraction of the observed alternative splice in protein coding region, 5′ and 3′ UTR. (B) Fraction replacing the protein N-terminus, C-terminus or internal regions. (C) Fraction causing truncation of protein product due to frame-shift; extension of the protein product due to frame-shift; switch to a new initiator codon while preserving protein reading frame; switch to a new terminator codon from an alternative exon; in-frame deletion of codons, preserving reading frame; in-frame insertion of codons, preserving frame. (D) Origin of alternative splicing evidence: detected in mRNA (presumednot novel); detectedinEST (bycomparisonwithanmRNA); detectedinEST (bycomparisonwith otherESTs). (E) Type of alternative splice. (F) Categorization of alternatively spliced genes by systemic function (see text). (G) Categorization by gene product (see text). ignoring the normal splice from exon V to exon VI, extending alternative splices modified the protein product, whereas 22% the coding region from exon Va for 142 bp (which we have were confined to the 5′ UTR versus just 4% in the 3′ UTR designated exon Vb). A polyadenylation site is predicted at the (Fig. 5A). This may simply reflect the larger fraction of exons end of this sequence, and the ESTs are observed to terminate in in human genes that are protein-coding as opposed to UTR. poly(A) at this point. This alternative termination replaces the This result fits expectations from molecular biology studies coding region of exons VI and VII with 40 amino acids (1), but disagrees strongly with a bioinformatics analysis of a encoded by exon Vb [terminated by a STOP codon 23 bp small set of ESTs (9), which reported 80% of genes with alter- before the poly(A) site]. Intriguingly, this replacement C-terminal native splicing had an alternative splice in 5′ UTR versus only sequence also contains a predicted transmembrane sequence, 20% in coding regions. Our observation of little alternative and thus neatly substitutes a new C-terminal transmembrane splicing in 3′ UTR is striking in view of the strong bias in the domain and cytoplasmic tail. The cytoplasmic tail in equivalent EST data towards the 3′ exon. One possible explanation is that FC receptor chains plays a key role in activating cytoplasmic mRNA species with alternatively spliced 3′ UTR sequence signal transduction molecules (28,29), so this alternative form could contain sequences that destabilize the mRNA, resulting likely modulates the signal transduction activity of this in fewer observations of these forms. In contrast, the effect on the receptor. This form is detected in placenta and kidney, while protein product is seen much more frequently at the C-terminal the majority form was detected in many different libraries. end (3′ in the mRNA) (Fig. 5B). We observed a tendency to replace the C-terminus (46%), as opposed to making an internal deletion, insertion or substitution (37%), or a replace- DISCUSSION ment of the N-terminus (17%). In this respect, the examples we Our results provide a comprehensive dataset for understanding have shown (HLA-DM β and FCε receptor I β homolog) are the role of alternative splicing in the human genome. First of representative. Alternative splicing appears to be strongly all, what is the function of alternative splicing—modification biased to preserve the protein coding frame (Fig. 5C). Only of the protein product, or of the untranslated regions that could 19% of alternative splices resulted in a truncation of the protein affect mRNA localization and stability? Analysis of a random product due to frame-shift; occasionally alternative splicing sample from our database (Table 2) indicates that 74% of was observed to add a new, extended C-terminal sequence 2858 Nucleic Acids Research, 2001, Vol. 29, No. 13 through frame-shift (6%). Alternative splicing resulted in a procedure to detect alternatively spliced forms that are known switch to a new AUG start site on an alternative exon in 15% in the literature. First, a given gene may not map yet to the draft of cases. In contrast, replacing the C-terminus by switching to genome, a prerequisite in our procedure for analyzing its a different exon containing an alternative STOP codon splicing. Secondly, some alternatively spliced mRNA forms occurred in 20% of cases. In-frame deletion or insertion of a are miscategorized as genomic DNA in GenBank, causing new sequence in the middle of the protein accounted for 29 and them to be excluded by our procedure. The former seems to be 11% of cases, respectively. the most important cause of failure. Despite >90% complete- In what types of molecules is alternative splicing commonly ness by total nucleotides sequenced, the draft genome used in observed? Figure 5G shows a molecular classification of a this study (October 2000) only enabled mapping of 55% of random sample of alternatively spliced genes. The most abun- UNIGENE expressed sequence clusters, because we require a dant category is cell surface functions/receptors (29%), which full-length match versus the expressed gene sequence includes membrane-anchored receptors (e.g. CD79B), integral consensus (Table 1). The draft (i.e. incomplete) BAC clone membrane proteins (e.g. folate transporter SLC19A1)and sequences which constituted the majority of this dataset, proteins involved in cell surface adhesion (e.g. lectin, consisted in large part of short sequence fragments (4–10 kb) hyaluronoglucosaminidase 2). In two related categories, an separated by unsequenced regions. Such fragments are too additional 14% of alternatively spliced genes encode secreted small to map a typical human gene (10–30 kb) by our conserv- proteins (e.g. Norrie disease protein; group-specific compo- ative procedure. This trend is even stronger for the subset of nent) and 9% encode signal transduction molecules (e.g. phos- genes that have full-length mRNAs. Of these clusters, only pholipase D2; RIT). The next two major categories are 41% could be mapped over their full length to an available transcriptional regulation (14%; e.g. MYB, PAX6) and apop- genomic contig. To check whether this is due to the draft tosis (11%; e.g. BID, PNAS-1). Together, these functions of genome’s fragmentation, we analyzed a subset of gene clusters transmission, reception and response to cellular signals that have been mapped by STS to chromosome 22, which has comprise >75% of the observed alternatively spliced genes. been almost completely sequenced. For these clusters, 77% Proteins involved in metabolism (e.g. aldolase C), and could be mapped. Thus, given unbroken genomic sequence, organelle-specific sorting proteins were also observed. This our mapping procedure has a false negative rate of ∼20%. sample is by no means comprehensive or exact, but indicates a These data suggest that completion of the human genome trend towards cell surface interactions and signaling. sequence, along with improvements in our algorithms, will at What types of systemic functions are most often affected by least double the number of alternative splices detected. Our alternative splicing? Twenty-nine percent of the alternatively detection of alternative splicing should also grow with spliced genes encoded functions specific to the immune system increasing EST data. In our current EST dataset (December (Fig. 5F; e.g. T-cell specific transcription factor 7, TNF 2000), splices were detected in only 18% of clusters, reflecting receptor superfamily member 6). In particular, alternative the fact that the average cluster consists of too few ESTs (one splicing of immune system cell surface receptors was very or two) and is too short (a few hundred base pairs) to cover prevalent. Neuronal functions (e.g. neuropilin, brain-specific more than a single exon. This is exaggerated by the strong bias aldolase C) comprised 12% of the total. The remaining genes of theESTs tobefromthe 3′-end, since 3′ exons tend to be possessed no clearly specific systemic function. These data much longer than typical internal exons. In contrast, in genes suggest alternative splicing may play a large role in immune for which a full-length or partial mRNA sequence was avail- system and nervous system functions which require precise able and which were mapped to a region of full-length genomic control of cellular differentiation and activation, to process sequence (e.g. chromosome 22), 88% contained at least one large amounts of information. Controlling how each cell splice (and typically many more). responds to a diverse array of signals can be achieved through alternative splicing of its receptors and signal transduction ACKNOWLEDGEMENTS molecules. How often is alternative splicing clearly associated with a We wish to thank D. Black, D. Miller, S. Galbraith and specific tissue? Based on a sample of 50 genes, ∼14% of alter- D. Eisenberg for their helpful discussions and comments on natively spliced genes in our dataset showed evidence of tissue this work, and K. Ke for assistance in constructing the HASDB specificity for the minor isoform. This estimate is based on a web site. This work was supported by Department of Energy conservative definition requiring that the minor isoform be grant DEFG0387ER60615 and a grant from the Searle observed multiple times in a specific tissue in which the major Scholars Program to C.L. B.M. is a predoctoral trainee form was not observed. Since in many known cases of tissue- supported by NSF IGERT Award #DGE-9987641. specific alternative splicing both minor and major forms are observed in the same tissue, this probably misses many cases REFERENCES of real tissue-specificity. Examples include DDR1, discoidin domain receptor, which has a minor form observed in muscle; 1. Lopez,A.J. (1998) Alternative splicing of pre-mRNA: developmental and CG1I, a putative cyclin G1 interacting protein, which has consequences and mechanisms of regulation. Annu.Rev.Genet., 32, isoforms observed specifically in ovary and brain. Within the 279–305. 2. Boise,L.H., Gonzalez-Garcia,M., Postema,C.E., Ding,L., Lindsten,T., small sample, tissue-specific minor isoforms were observed in Turka,L.A., Mao,X., Nunez,G. and Thompson,C.B. (1993) bcl-x, a bcl-2- novel, uncharacterized genes in brain, colon, testis and pros- related gene that functions as a dominant regulator of apoptotic cell death. tate. Cell, 74, 597–608. How comprehensive is our dataset, and what are its pros- 3. Fettiplace,R. and Fuchs,P.A. (1999) Mechanisms of hair cell tuning. pects for growth? We have noted two causes of failure by our Annu.Rev.Physiol., 61, 809–834. Nucleic Acids Research, 2001, Vol. 29, No. 13 2859 4. Schmucker,D., Clemens,J.C., Shu,H., Worby,C.A., Xiao,J., Muda,M., 17. Smith,T.F. and Waterman,M.S. (1981) Identification of common Dixon,J.E. and Zipursky,S.L. (2000) Drosophila Dscam is an axon molecular subsequences. J. Mol. Biol., 147, 195–197. guidance receptor exhibiting extraordinary molecular diversity. Cell, 101, 18. Schuler,G. (1997) Pieces of the puzzle: expressed sequence tags and the 671–684. catalog of human genes. J. Mol. Med., 75, 694–698. 19. Lee,C. and Irizarry,K. (2001) The GeneMine system for genome/ 5. Smith,C.W.J. and Valcarcel,J. (2000) Alternative pre-mRNA splicing: the proteome annotation and collaborative data-mining. IBM Syst. J., 40, logic of combinatorial control. Trends Biochem. Sci., 25, 381–388. in press. 6. Sharp,P.A. (1994) Split genes and RNA splicing. Cell, 77, 805–815. 20. Boguski,M.S., Lowe,T.M. and Tolstoshev,C.M. (1993) dbEST—database 7. Sutcliffe,J.G. and Milner,R.J. (1988) Alternative mRNA splicing: the for ‘expressed sequence tags’. Nat. Genet., 4, 332–333. Shaker gene. Trends Genet., 4, 297–299. 21. Hawkins,J.D. (1988) A survey on intron and exon lengths. Nucleic Acids 8. Brett,D., Hanke,J., Lehmann,G., Haase,S., Delbruck,S., Krueger,S., Res., 16, 9893–9905. Reich,J. and Bork,P. (2000) EST comparison indicates 38% of human 22. Pruitt,K.D. and Maglott,D.R. (2001) RefSeq and LocusLink: NCBI gene- mRNAs contain possible alternative splice forms. FEBS Lett., 474, 83–86. centered resources. Nucleic Acids Res., 29, 137–140. 9. Mironov,A.A., Fickett,J.W. and Gelfand,M.S. (1999) Frequent alternative 23. Groenen P.J., Wansink,D.G., Coerwinkel,M., van den Broek,W., splicing of human genes. Genome Res., 9, 1288–1293. Jansen,G. and Wieringa,B. (2000) Constitutive and regulated modes of 10. Liang,F., Holt,I., Pertea,G., Karamycheva,S., Salzberg,S.L. and splicing produce six major myotonic dystrophy protein kinase (DMPK) Quackenbush,J. (2000) Gene Index analysis of the human genome isoforms with distinct properties. Hum. Mol. Genet., 9, 605–616. estimates approximately 120,000 genes. Nat. Genet., 25, 239–240. 24. Kelly,A.P., Monaco,J.J., Cho,S.G. and Trowsdale,J. (1991) A new human 11. Ewing,B. and Green,P. (2000) Analysis of expressed sequence tags HLA class II-related locus, DM. Nature, 353, 571–573. indicates 35,000 human genes. Nat. Genet., 25, 232–234. 25. Shaman,J., von Scheven,E., Morris,P., Chang,M.Y. and Mellins,E. (1995) 12. Ji,H., Zhou,Q., Wen,F., Xia,H., Lu,X. and Li,Y. (2001) AsMamDB: an Analysis of HLA-DMB mutants and -DMB genomic structure. alternative splice database of mammals. Nucleic Acids Res., 29, 260–263. Immunogenetics, 41, 117–124. 13. Irizarry,K., Kustanovich,V., Li,C., Brown,N., Nelson,S., Wong,W. and 26. Sanderson,F., Kleijmeer,M.J., Kelly,A.P., Verwoerd,D., Tulp,A., Lee,C. (2000) Genome-wide analysis of single-nucleotide polymorphisms Neefjes,J., Geueze,H.J. and Trowsdale,J. (1994) Accumulation of HLA- in human expressed sequences. Nat. Genet., 26, 233–236. DM, a regulator of antigen presentation, in MHC class II compartments. 14. Jang,W., Chen,W.C., Sicotte,H. and Schuler,G.D. (1999) Making Science, 266, 1566–1569. effective use of human genomic sequence data. Trends Genet., 15, 27. Potter,P.K., Copier,J., Sacks,S.H., Calafat,J., Janssen,H., Neefjes,J. and 284–286. Kelly,A.P. (1999) Accurate intracellular localization of HLA-DM 15. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) requires correct spacing of a cytoplasmic YTPL targeting motif relative to Basic local alignment search tool. J. Mol. Biol., 215, 403–410. the transmembrane domain. Eur. J. Immunol., 29, 3936–3944. 16. Needleman,S.B. and Wunsch,C.D. (1970) A general method applicable to 28. Daeron,M. (1997) Fc receptor biology. Annu. Rev. Immunol., 15, 203–234. the search for similarities in the amino acid sequence of two proteins. 29. Kinet,J.P. (1999) The high affinity IgE receptor (FCεRI): from physiology J. Mol. Biol., 48, 443–453. to pathology. Annu. Rev. Immunol., 17, 931–972.

Journal

Nucleic Acids ResearchOxford University Press

Published: Jul 1, 2001

There are no references for this article.