Novel porcine repetitive elements

Ralph Wiedmann; Dan Nonneman; John Keele

doi:10.1186/1471-2164-7-304

Novel porcine repetitive elements

Wiedmann, Ralph; Nonneman, Dan; Keele, John 2006-12-01 00:00:00 Background: Repetitive elements comprise ~45% of mammalian genomes and are increasingly known to impact genomic function by contributing to the genomic architecture, by direct regulation of gene expression and by affecting genomic size, diversity and evolution. The ubiquity and increasingly understood importance of repetitive elements contribute to the need to identify and annotate them. We set out to identify previously uncharacterized repetitive DNA in the porcine genome. Once found, we characterized the prevalence of these repeats in other mammals. Results: We discovered 27 repetitive elements in 220 BACs covering 1% of the porcine genome (Comparative Vertebrate Sequencing Initiative; CVSI). These repeats varied in length from 55 to 1059 nucleotides. To estimate copy numbers, we went to an independent source of data, the BAC- end sequences (Wellcome Trust Sanger Institute), covering approximately 15% of the porcine genome. Copy numbers in BAC-ends were less than one hundred for 6 repeat elements, between 100 and 1000 for 16 and between 1,000 and 10,000 for 5. Several of the repeat elements were found in the bovine genome and we have identified two with orthologous sites, indicating that these elements were present in their common ancestor. None of the repeat elements were found in primate, rodent or dog genomes. We were unable to identify any of the replication machinery common to active transposable elements in these newly identified repeats. Conclusion: The presence of both orthologous and non-orthologous sites indicates that some sites existed prior to speciation and some were generated later. The identification of low to moderate copy number repetitive DNA that is specific to artiodactyls will be critical in the assembly of livestock genomes and studies of comparative genomics. tate REs [9]. In recent years, several attempts have been Background Repetitive elements comprise ~45% [1] of mammalian made to automate the process of de novo identification genomes and are increasingly known to impact genomic and characterization of REs [10-16]. The algorithms take function by contributing to the genomic architecture, by into account the likely evolutionary history of the REs – direct regulation of gene expression [2,3] and by affecting not only genetic drift, but also the processes that lead to genomic size, diversity and evolution [4-8]. The ubiquity the juxtaposition of REs [10]. Because knowing the evolu- and increasingly understood importance of repetitive ele- tionary history of each RE helps to define the type of RE, ments (REs) contribute to the need to identify and anno- these algorithms are valuable not only in identifying Page 1 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 repetitive sequence, but also in increasing our under- bers (across the 220 discovery BACs) ranged from 12 to standing of the evolutionary role of the identified RE. Our 1102. initial attempt was to identify novel repetitive DNA with a program called RECON [10], which produced 14,067 Steps 5 – 6 families of REs with 249 of those having count numbers The 60 original MPREs were consolidated into 31 because of 10 or more. We decided a different approach was of overlap or co-localization at multiple sites. Twenty- needed that would organize closely related elements in a nine MPREs were absorbed into 31; the 31 original MPRE parsimonious way. In this paper, we describe 27 novel identifiers of the longer sequences were kept to maintain porcine repetitive elements and estimate their prevalence provenance. In addition, there were three combinations in swine and other species. (MPRE20 and 57; MPRE15, 17, 19 and 26; MPRE44, 50 and 52) of repeats that frequently appeared together in the same order with some variation in their relative spacing. Results We identified repetitive elements using a procedure simi- The most consistent group contained two elements – lar to previously published methods [10,11]. First, we MPRE20 in reverse complement followed by a small gap, used RepeatMasker [17] on the BAC sequences to mask then MPRE57. All thirteen times that MPRE20 occurred, it out previously characterized repeat elements. Second, we occurred in this grouping. MPRE57 occurred 13 out of 14 identified all pair-wise alignments among masked times in this grouping. Naturally, we concluded that sequences using BLAST [18]. Third, we identified multiple MPRE20 (600 bp) and MPRE57 (204 bp) were two parts copy sequence segments with alignments to many sites (≥ of a longer RE that had a variable middle (100–250 bp 10). Fourth, we clustered sites linked by pair-wise align- range for all but one example). After examining the align- ments and constructed phylogenetic trees. Fifth, excessive ment in ClustalX, we could see that the middle was con- variation (2-fold) in copy number within a putative RE served except for an 84 bp deletion in one instance and a caused it to be divided; co-localization of RE among many 67 bp insertion in another. Further review of the BACs sites caused them to be merged. Sixth, we examined flank- showed that the 13 groups containing MPRE20 and ing sequences of putative RE for clues about replication MPRE57 sometimes occurred in overlapping regions machinery or to consolidate RE that should be merged. between pairs of clones in the BAC collection, meaning Seventh, we estimated the prevalence of RE in an inde- that we only had 7 unique loci plus one very unique locus pendent set of porcine sequences as well as in the that had a PRE1a (Porcine Repeat Element 1a, as identi- genomes of other species. fied by RepeatMasker) inserted into the gap. There was no pattern to the gap in the other instances. We include this Our method compared to RECON longer repeat element in our list of novel porcine repeat The bulk of the automated parts of our process, Steps 2 elements as MPRE61, which is more fully described in a through 4, were very similar to RECON [4]. RECON does later section. not appear to have analogues for Steps 1 (RepeatMasker), 5, 6, or 7. We utilized Step 1 to steer us away from previ- The final alteration to the list of MPREs was the removal ously characterized repetitive. We utilized the manually of MPRE48 due to its low copy number in the set of intensive Steps 5–7 to achieve a more parsimonious 275,595 porcine BAC-ends supplied by the Wellcome (smaller number of repeat families) than appeared to be Trust Sanger Institute (hereafter shortened to "Sanger") possible with RECON alone. In this sense, we envision [19]. Surprisingly, MPRE48 was found to appear less fre- that our method is a complement to RECON, not a quently, only six times, in BAC-ends (335.9 Mb) than in replacement. the much smaller portion of the genome spanned by the set of fully sequenced BACs (36.4 Mb) from which the Steps 1 – 4 MPREs were derived. That brings the final number of Thirty-six percent of the sequence was masked by Repeat- novel repeat elements reported here to 27, although we Masker. Comparing all unmasked sequence fragments (≥ decided against removing MPRE48 from the fasta file of 50 bp) produced 1,334,953 pair-wise alignments. One MPREs, see Additional file 1. thousand five hundred seventy-nine highly redundant sequences (totaling 1.07 Mb) were identified that had a Step 7 minimum of 10 hits for at least 50 contiguous bases. Sixty Table 1 lists the MPREs along with their observed count putative repeat element families resulted from clustering numbers in the TIGR (The Institute of Genomic Research, the 1579 highly redundant sequences. The repeat element Rockville, MD) Sus scrofa Gene Index [20] and the Sanger families were labeled MPRE1 – MPRE60 (for Meat Animal BAC-end sequences [19]. Noting that the data set of BAC- Research Center Porcine Repetitive Element). Their ends is 4.8 times larger than the TIGR Gene Index lengths ranged from 55 to 1059 bp and their copy num- (104,328 entries of expressed swine sequence totalling Page 2 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 70.0 Mb), we conclude that all the novel repeats occur less (BLAST) to the GenBank nr database and only one strong frequently in expressed sequence than in genomic DNA. hit was found. MPRE1 hit Sus scrofa interferon alpha-1 precursor with a bit score of 352, so it was eliminated The prevalence of these newly identified REs was com- from further consideration as a novel RE. For comparison, pared to that of known REs. Three of the newly discovered the highest bit score of MPREs reported here was less than porcine REs, MPRE11, 16 and 38, were more common 50. The repeats were also compared (BLAST) to vectors, than the LINE element L3 and one, MPRE42, was about as mitochondrial DNA, and tRNAs. The middle of MPRE58 common as L3 (Table 1). The other 23 MPREs have lower did have high similarity to tRNA-GLU; otherwise, there count numbers. In the Sanger archive of 275,595 BAC- were no substantial high-scoring pairs. ends, the number of elements for all SINEs was 203,206, for all LINES was 116,107 and for all LTRs (Long Terminal Discussion Repeats) was 25,066 based on RepeatMasker. Looking Certain difficulties arise when defining repeat elements. specifically at the LINEs, the most common by far was L1 One is that REs often are present as mosaics of smaller with 94,325, followed by L2 with 18,720 and L3 is third subsets of commonly occurring sequences [21,22]. with 2,358. Another is that REs can often sustain considerable muta- tions, including large truncations and insertions. Two These newly discovered repeat elements did not appear to extreme examples of this are the truncation of the 5' end be duplicated genes, LINE elements or expressed sequence during retrotransposition, and the insertion of one RE that was transposed by a LINE element. To address these into the middle of another. A third difficulty requiring res- questions, the MPREs were translated and compared olution is that segmental duplication will create very long Table 1: Count numbers for novel porcine repetitive elements Repeat name length GC content BLAST hits to SSGI BLAST hits to count number count number BLAST hits to 2 3 4 5 BAC-ends regular irregular Bovine Genome MPRE2 111 0.40 66 528 513 1000 MPRE3 411 0.51 25 392 324 57 MPRE6 255 0.55 15 392 342 0 MPRE11 76 0.33 888 8876 8040 1599 MPRE12 199 0.47 26 272 123 1051 MPRE14 234 0.57 29 292 306 1157 MPRE15 912 0.50 56 520 592 452 1475 MPRE16 276 0.46 379 4688 5051 4035 1260 MPRE17 870 0.29 5 89 75 1002 MPRE19 125 0.34 30 577 550 534 MPRE21 595 0.48 16 189 201 1604 MPRE22 166 0.46 6 83 81 0 MPRE26 324 0.50 75 475 479 1054 MPRE28 140 0.64 27 700 648 0 MPRE38 176 0.35 610 7567 7110 5806 1417 MPRE42 220 0.39 160 2350 2425 1050 MPRE44 55 0.40 22 560 551 6 MPRE49 221 0.50 3 52 50 6 MPRE50 136 0.35 40 907 871 0 MPRE51 71 0.49 17 140 112 110 MPRE52 341 0.30 28 457 703 362 643 MPRE54 326 0.46 17 121 125 62 MPRE55 161 0.52 39 247 244 1075 MPRE58 196 0.41 4 98 98 1034 MPRE59 123 0.27 207 1830 1723 1431 MPRE60 151 0.40 13 98 90 0 MPRE61 1059 0.37 2 31 41 10 The number of BLAST hits, at least half as long as the repeat element, found within the TIGR Sus scrofa Gene Index version 11, which contains 2 3,4 104,328 entries and 70.0 MB. The number of similar BLAST hits to the Sanger archive of BAC-ends that has 275,595 entries totaling 335.9 MB. The regular and irregular columns give the number of BLAST hits across the repeat element, again using the Sanger data. The regular values are the average of the middle 90% of the repeat element while the irregular values are the minimum value within the middle 80% of the repeat element. The number of BLAST hits, including those less than half the length of the repeat element, found within the whole Bovine genome (build AAFCO2). Page 3 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 repeated sequences that do not retro-transpose together, pared to a comprehensive database (NCBI BLASTX). None and therefore should be broken up into their retro-trans- of the results were similar to the possible translations of a posable component parts. RECON, the software for iden- LINE. tification of REs described by Bao and Eddy, handles all three of these difficulties [10]. Counting repeat elements is challenging Because of the degeneracy of repetitive elements it is diffi- Our approach was intentionally a bit more simplistic. We cult to arrive at an accurate count in the target genome. were able to create a much more parsimonious set of RE Another difficulty in the quantification of repeat elements than what we were able to generate with RECON. Whereas is that REs are often composed of smaller repeat units that RECON intends to recreate the full repeat elements in the occur more frequently than the larger unit [21,22]. way that will make for the best possible additions to the RepeatMasker database, as well as aid in the study of the To characterize the prevalence of MPREs, we went to an evolutionary history of the repeat elements, our goal was independent data set, the Sanger BAC-ends from the to mask out the most commonly repeated regions of the CHORI-242 library archived at Ensembl [19]. Table 1 lists porcine genome. The technique we found most useful in three different measures of prevalence of MPRE within refining the definitions of the MPREs was to plot the fre- these BAC-ends. The first measure (BLAST hits to BAC- quency of BLAST hits as a function of position within the ends) gives the number of hits that were at least half the sequence of the putative repeat elements. From the criteria length of the repeat element. An issue here is the typical used to define them, the number of hits was at least 10 size of the traces – an average of 1219 bp. The longer REs across the whole sequence – but many showed a much will tend to be under-counted due to edge effects in the higher hit frequency along part of their lengths. For pur- trace archive. The next two measures of count number poses of comparison, we applied RECON to our pair-wise were calculated by plotting the number of BLAST hits as a alignments from Step 2. RECON divided the 1,334,953 function of position on the RE. Some of the resulting plots BLAST hits into 29,631 potential repeat elements that were smooth and flat across most of the RE with an were then grouped into 14,067 families. Only 249 of expected drop-off near each end. For these "regular" plots these families had 10 or more elements. Note that it is the count number was the average value of the middle possible for a family containing only one element to cor- 90% of the plot amplitude. Other plots varied quite a bit respond to many BLAST hits. Rather than continue with in amplitude across the RE. This was likely due to sub- so many families, we found that our method yielded a repeats that hit in areas of the genome that the whole more parsimonious classification of moderately repetitive repeat did not. During this measure of count number elements. One difference between the two methods was there was no lower limit to the size of the hit other than that our method required a minimum copy number prior that needed to get the expectation value below 0.1. These to the formation of families of repeat elements. were considered irregular and the algorithm for determin- ing their count number was to take the smallest value on The MPREs have no clear connections to known proteins. the plot after ignoring the first and last 10% of the plot. A The NCBI BLASTX results for these sequences were typi- few plots were only mildly irregular, and for those both cally a combination of description-less accessions and the regular and irregular algorithms were used with both unrelated proteins in a variety of organisms. That numbers reported in Table 1. remained true when the dataset was compared to the TIGR gene index for Sus scrofa [20]. Comparing the novel repeat element content across genomes The novel repeat elements were compared to known types The sequences of novel porcine repetitive elements listed of repeats – SINEs, LINEs and LTRs – and did not fit the here were compared (BLAST [25]) to a recent build of the definitions for those classes of repeat elements. Because complete cow genome (AAFCO2 from [26]) as well as RepeatMasker would mask out low-complexity regions, against the mouse and human genomes. In the case of the methods used here would not initially find the tail mouse, there were no significant similarities found. The ends of LTRs. Each MPRE was tested for nearby low-com- comparison to the human genome yielded only one sig- plexity regions and none were consistently found. One of nificant hit – a 37 bp long section of MPRE17 (870 bp the characteristics of SINEs is the presence of tRNA coding long) matched once in chromosome 9 thousands of bp sequence in their 5 prime regions [23,24]. Only MPRE58 away from any annotated features. The comparison to the had a region similar to tRNA, and that was in the middle cow genome yielded a variety of results. Five of the 27 of its sequence. LINEs are best characterized by their two MPREs did not hit at all (MPREs 6, 22, 28, 50 and 60), and ORFs – one coding for a reverse transcriptase and the three others (MPREs 44, 49 and 61) had ten or fewer hits other for a protein with RNA binding activity [6]. All the (Table 1), despite the fact that the cow genome contains MPREs were translated to potential proteins and com- ten times more sequence than the collection of porcine Page 4 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 BAC-ends tested. Fourteen of the 27 MPREs appeared fre- resented by the sum of horizontal distances that one must quently in cow as well as pig, as indicated by having at travel along the tree to connect the two sequences. The least 1000 BLAST hits to the cow genome. leftmost part of that path represents a common ancestor. It is not surprising that the two sequences in question Not surprisingly, the bovine hits tend to be shorter than have individually diverged a significant amount from the the porcine hits because the MPREs were defined from pig original sequence of the common ancestor at that locus. sequence and as such would be expected to be more intact The more surprising result is that some of the pig and cow in porcine. What is interesting is that in both species the sequences are more similar to each other than the endpoints of the hits have a strong tendency to line up to sequences at the oldest loci. Coincidental convergence is particular spots in the MPRE, as shown in Figure 1 using an unlikely possibility. A more likely explanation is that MPRE12, 15, 17, 41, 51 and 58 as examples. Sometimes enough copies of the old sequence were created that some the common endpoints are the same in both species, of them experienced much less mutation than the sometimes not. This could be a result of the repeat ele- diverged sequences at the ancestral locus. The most recent ments being comprised of smaller repeat elements, not all common ancestors (MRCA) occurred in a narrow window of which have the same frequency of occurrence in either of time (evolutionary) relative to the full extent of the tree genome. The longer MPREs often had more than one sub- (< 1/5 of the distance from the root to the most peripheral region with multiple extra hits. This, too, could be evi- branch). The MRCA among the orthologous sites occurred dence of internal repeat structure. within the same time frame as the other MRCA. The tree clearly shows considerable radiation following speciation Figure 2 shows that MPRE55 occurs in both swine and cat- as evidenced by large genetic distances from MRCA to tle in orthologous loci. The pig BAC lies along the x-axis, peripheral tips. and the cow BAC lies along the y-axis. Also plotted are line A closer look at MPRE61 segments of high similarity between the two BACs. The preponderance of these segments demonstrates little Allelic differences or SNP can be identified from cases genomic rearrangement between species, which indicates where MPRE61 sites coincide with overlaps among CVSI that these are orthologous regions of likely common BACs. MPRE61 sites coincide with 3 pairs of overlapping ancestry between the two species. This region is highly BACs, 1 (AC145413 and AC144901), 2 (AC139879, similar to the human contig NT_005403.16 and the locus AC140099) and 3 (AC146932 and AC087424). In addi- of MPRE55 corresponds to the 3' UTR of the model gene tion, an MPRE61 site coincided with a group of 3 overlap- LOC643405, which codes for a protein similar to TGF- ping BACs, including AC138784, AC138788 and beta induced apoptosis protein 2. AC138786. Overlapping BAC pair 2 had two single base differences, and pair 3 had 3 single base differences and Because the collection of BACs spans only 1% of the one 43 bp insertion/deletion. No sequence differences whole pig or cow genome, we cannot rule out the possi- were observed within MPRE61 for pair 1 or the group of 3 bility that all of the MPREs have at least one orthologous overlapping BACs. location in both species. The fact that 12 MPREs did not have blast hits in any of the cow BACs makes it seem likely To put the apparent allelic diversity rates into context, we that those 12 are relatively recent evolutionary occur- examined the genetic sources of the DNA used to con- rences. Of the 10 MPREs that appear most frequently in struct the BAC library (RPCI-44). The source of DNA for the cow, only two, MPRE55 and MPRE59, were observed RPCI-44 was a pooled sample with equal contributions to appear in orthologous locations among the tested set of from 4 male crossbred pigs each comprised of 3/8 Lan- fully-sequenced BACs. drace, 3/8 Yorkshire and ¼ Meishan [28]. The probability of identifying SNP increases with the diversity of genomes A phylogenetic analysis was performed on the different sampled. For the cases of 2 overlapping BACs, the proba- integration sites of MPRE55 from both the cow and pig bility of sampling different genomes is 87.5%, different BAC libraries using ClustalX (see Additional file 2 for the breeds is 65.7%, and one BAC of western (Landrace or sequences), and the output (Additional file 3) was then Yorkshire) origin and the other of Meishan origin is input into R [27] to create Figure 3. The sequences that 37.5%. The probability of sampling diverse genomes is occurred at orthologous locations in swine and cattle are higher for the case of 3 overlapping BACs. The probability highlighted. As expected, the pig branches and cow of sampling more than one genome is 98.4%, more than branches tend to be separate. It is notable that the most one breed is 87.9%, and at least one BAC of western origin similar sequences that occur in both species do not come combined with one BAC of Meishan origin is 56.25%. The from orthologous locations, but seem to be found in loci fact that we didn't observe SNP in one of the three pairs of that originated after the cow and pig ancestral lines overlapping BACs is not that unusual given that the prob- diverged. The evolutionary distance between them is rep- ability of sampling identical genomes with at least one of Page 5 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 Distribution Figure 1 of BLAST hits to cow and pig DNA across selected MPREs Distribution of BLAST hits to cow and pig DNA across selected MPREs. BLAST hits plotted across MPREs 12, 15, 17, 42, 51 and 58. Along the abscissa lies each MPRE sequence and stacked above are the corresponding hits to the cow genome in blue and to pig BAC-ends in red. The hits are ordered from the top down by length. Page 6 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 the 3 pairs of overlapping BACs is 33% (1-.875 ). On the other hand, the fact that we did not observe SNP within the group of 3 overlapping BACs given the relatively high probabilities of diverse genomes being sampled is unex- pected. To bolster the relatively small number of distinct MPRE61 loci (7) identified in the CVSI BACs, we further investi- gated the prevalence and diversity of MPRE61 by cloning and sequencing PCR amplification products derived from 16 pigs sampled from 10 breeds (Berkshire, Chester White, Duroc, Hampshire, Landrace, Meishan, Pietrain, Poland China, Spot, and Yorkshire). We used primers designed to match the highly conserved parts of MPRE61 to amplify and clone (see Methods for details) multiple and variable loci for the RE that are differentiable by size as well as sequence. The different breeds showed indistin- 120 130 guishable smears on denaturing PAGE gels including many different sizes. Too many fragments and too many Swine Position, Kb sizes were present to identify allelic differences in sizes among animals. The PCR products were sequenced to yield 91 reads that were not bacterial or vector contamina- tion. The 91 sequences (listed as a fasta file in Additional file 4) were analyzed with Clustal X (creating a dendro- gram file, Additional file 5) and displayed in Figure 4 as a phylogenetic tree. The topology of the tree (number of diverse nodes) is consistent with the estimated copy number of 300 sites in the whole genome given in Table 1. We speculate that the more similar sequences repre- sented as tips close (with few sequence differences) to their common ancestor are probably allelic differences at the same locus. On the other hand, the more diverse tips and peripheral nodes probably represent different sites or loci. The amount of sequence diversity presented in Figure 4 supports the idea that individual integration sites (loci) and alleles of repetitive elements can be uniquely identi- 80 120 160 fied by high-throughput array based assays by hybridizing samples to short probes. This demonstrates that repetitive Swine Position, Kb DNA with similar properties to MPRE61 (i.e., prevalence and diversity) can be harnessed for genetic and physical mapping [29]. This dispels the long standing myth that (a a Figure 2 nd b) – MPRE55 in homologous positions in pig and cow (a and b) – MPRE55 in homologous positions in pig repetitive DNA should always be avoided because it is and cow. MPRE55 exists in homologous positions in pig and intractable. Our results indicate that some classes (low to cow. Along the horizontal axis lies the pig BAC with acces- intermediate copy number and highly diverse) of repeti- sion number AC147198. Along the vertical axis lies the cow tive DNA would be tractable with high-throughput tech- BAC with accession number AC138165. The numerous line nologies. segments are BLAST hits between the two BACS that have bit scores of at least 100. Dashed lines are drawn through MPRE61 size differences are not randomly distributed the positions on the BACs where MPRE55 is located. The throughout the phylogenetic tree. Different sizes cluster circle indicates the region containing MPRE55 that is on different branches of the tree; however, the clustering expanded and shown in Figure 2(b). is not complete. This indicates that insertions and dele- tions (evolutionary events that cause size differences) occurred throughout the evolution of MPRE61, and in some cases while the element was still replicating. The Page 7 of 12 (page number not for citation purposes) Cow Position, Kb Cow Position, Kb 040 80 45 50 55 BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 Swine Cow OMRCA ● ● MRCA ● ● ● ● Phylog Figure 3 eny of MPRE55 in pig and cow Phylogeny of MPRE55 in pig and cow. The phylogram displays the BLAST hits obtained from querying MPRE55 against the fully sequenced BAC libraries for pig and cow. The red dots indicate examples of MPRE55 from the cow and the black dots indicate pig examples. The orthologous sites depicted in Figure 2 are noted by the grey dashed lines and the word "ortho- logues." Also shown are the Most Recent Common Ancestors (MRCA) between species in green and, in blue, the MRCA for the 2 orthologues (OMRCA). In both cases the BACs covered about 1% of the total genome. The MRCA lie within a relatively narrow band of time consistent with a single speciation event and there appears to be considerable radiation among elements following speciation (i.e., time frame spanning MRCA). incomplete clustering of sizes indicates evolutionary plas- in the RepeatMasker library. Several other examples ticity and as a result recurrent insertions and deletions. existed of PRE1 next to a section of MPRE61, but the trace end occurred next to the PRE1, so that it may or may not MPRE61 was further characterized by plotting BLAST hits have had the continuing section of MPRE61 on its other of it to the 275,595 sequences in the trace archive of BAC- side. No other REs were found to be incorporated into ends submitted by Sanger. These were plotted along with MPRE61, suggesting that MPRE61 replicated relatively the repeat elements recognized by RepeatMasker. The recently. Another interesting observation was that the most interesting observations included the fact that three density of REs on the 3' side of MPRE61 was much higher times among the 140 hits a PRE1 was incorporated into than on the 5' side. To take a closer look at this, we col- MPRE61. PRE1 is a porcine specific SINE that is included lected the trace sequence 3' of the 62 hits that ended near Page 8 of 12 (page number not for citation purposes) Orthologues BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 Size Range, bp ● ● (418,492] (492,519] (519,594] (594,677] ● ● ● (677,704] Diver Figure 4 sity of MPRE61 across ten breeds of pig Diversity of MPRE61 across ten breeds of pig. This phylogram displays the variety of sequences obtained by amplifying MPRE61 in 16 DNA samples from ten breeds of pig. Size differences are highlighted using colored dots according to the legend. Size cut-offs were chosen to lie between modes of the size distribution which were well separated. (within 60 bp of) the 3' end of MPRE61 (length of 1059 side of MPRE61, particularly in the region closest to the bp). This flanking sequence, ranging in length from 12 to end of MPRE61. For the 22 LINEs that occur within 80 bp 1368 nucleotides, was analyzed for repeat content and of the end of MPRE61, 15 are oriented on the opposite distance of that content from the end of MPRE61 (Figure strand and 7 on the same strand. At this point, there is no 5). Running RepeatMasker on the entire collection way to know which strand of MPRE61 might be tran- (275,595 sequences) of Sanger BAC-ends shows that the scribed. We arbitrarily chose one of the strands and used number of SINE elements is 75% greater than the number it consistently. Because the LINEs have a particular inter- of LINE elements (203,206 vs. 116,107). The LINEs tend nal structure, the 5' and 3' ends are well defined. So to be longer than the SINEs, so the total percentage of another way of looking at the result would be to say that sequence occupied by the LINEs is actually larger (13.29% the LINEs occur on the 5' end of MPRE61 (or rather, its vs. 10.29%). The most obvious feature of Figure 5 is that reverse complement) with 15 on the same strand and 7 on LINEs are significantly over represented on the 3 prime the opposite strand. Either way, there is less strand conser- Page 9 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 Repeat elem Figure 5 ents that flank the 3' end of MPRE61 Repeat elements that flank the 3' end of MPRE61. The repeat content of 62 BAC-end sequences flanking the 3' end of MPRE61. The origin on the horizontal axis is the last position that matches the 3' end (minimum position within the repeat of 1000 out of the full 1059 bp length) of MPRE61. The 62 flanking sequences are ordered with the longest at the top and the shortest at the bottom. The horizontal position is the distance from the 3' end of the hit to MPRE61. Colored arrows are superimposed on the dotted outline of the flanking sequence to indicate the repeat elements that RepeatMasker found. Page 10 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 vation than would be expected if MPRE61 used LINEs as (positions 278–298 and 953-935 (5' to 3' on opposite a vehicle for either replication or integration. strand) of MPRE61, respectively). PCR was performed in a PTC-225 DNA engine (MJ Research Inc, Watertown, MA) using 0.25 U Hot Star Taq polymerase (Qiagen, Conclusion From our experience, it seems that although some availa- Valencia, CA, USA), 1X of supplied buffer, 1.5 mM MgCl2, ble programs may help with the process of identification 200 µM dNTPs, 0.8 µM each primer, and 100 ng of of REs, a level of judiciousness is also required. The BLAST genomic DNA in 25 µl reactions. The PCR mixture was and phylogenetic analyses are proven to be useful to held at 94°C for 15 min, and cycled 44 times at 94°C for improve the efficacy, particularly when comparisons are 20 sec, held at 57°C annealing temperature for 30 sec and made across species. Discovering the RE in one dataset extension at 72°C for 1.5 min, followed by a final exten- and characterizing their prevalence and diversity in sion at 72°C for 5 min. Five µl of the PCR reaction was another was crucial to our effort. electrophoresed in 1.5% agarose gels to determine quality of amplification and a portion (2–4 µl) was used for clon- Using an approach similar to previously published work ing in pCR4-TOPO vector (Invitrogen, Carlsbad, CA). but modified to fit our specific goals and data, several Plasmid DNA was prepared using standard alkaline lysis repetitive elements were identified in porcine and bovine and PTFE filter plates (Millipore, Bedford, MA) and was genomes that do not exist in mouse or human. These ele- sequenced with T7 primer. ments do not contain signatures of previously identified retrotransposons, but seem to have undergone replication Authors' contributions and mutation. Because these elements are in a lower copy RW performed the bioinformatic analysis and drafted the number than most of the REs that make up mammalian manuscript. DN carried out the molecular genetics studies genomes, they could be exploited in mapping or whole- and helped draft the manuscript. JK designed and coordi- genome association studies. As the porcine genome nated the study and helped draft the manuscript. All sequencing effort progresses, we should know more about authors have read and approved the final manuscript. the distribution, history and possible contribution of these repeats to the genomic architecture in artiodactyls. Additional material The genuine challenge of genome sequencing and assem- Additional file 1 bly would be enhanced with an improved understanding Fasta file of novel porcine repetitive elements. Each definition line of repeat elements and their distributions, especially those includes an accession number along with the start and end positions for repeat elements that are species specific. that repetitive element. Click here for file [http://www.biomedcentral.com/content/supplementary/1471- Methods 2164-7-304-S1.fas] Bioinformatics Two hundred-twenty fully sequenced porcine BACs gener- Additional file 2 ated by the Comparative Vertebrate Sequencing Initiative Fasta file that provides the sequences used to create Figure 3. The def- [30,31] were downloaded from the RPCI-44 clone library, inition lines include the accession number with the start and end positions totaling 36.4 Mb. RepeatMasker [17] masked out 36% of of the sequence. this sequence. All unmasked fragments of sequence that Click here for file were at least 50 bp long were compared (BLAST) to the [http://www.biomedcentral.com/content/supplementary/1471- 2164-7-304-S2.fas] original data set. The BLAST parameters used were those recommended by Korf et al. (2003) for finding repeat ele- Additional file 3 ments, namely -r 1 -q -1 -G 2 -E 2 -W 9 -F "m D" -e 1 for Dendrogram file used to create Figure 3. The format is the standard NCBI-BLAST [32]. The output, which contained output of ClustalX and can be read by various tree viewing and tree mak- 1,334,953 hits, was analyzed using two similar methods. ing software, including R when using the packages sequinr and ape. Each One was to use the RECON software [10] downloaded vertex is labelled consistently with the corresponding fasta file. from its website [33] and the other used separate, original Click here for file [http://www.biomedcentral.com/content/supplementary/1471- PERL scripts that performed several of the same functions 2164-7-304-S3.dnd] included in the RECON package. Additional file 4 PCR and sequencing Fasta file of cloned MPRE61 sequences. The definition line refers to an Primer pairs for amplification of genomic DNA were arbitrary sequence ID generated at USMARC. designed from consensus MPRE61 sequences using Click here for file Primer3 [34]. Primer sequences were 5'-TTTTCCTGTGGT- [http://www.biomedcentral.com/content/supplementary/1471- 2164-7-304-S4.fas] GATTTGTGA-3' and 5'-GGGCGCTGGACTGCTCAAA-3' Page 11 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 17. Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0 [http:// www.repeatmasker.org]. 1996–2004 Additional file 5 18. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local Dendrogram file of cloned MPRE61 sequences used to construct Fig- alignment search tool. J Mol Biol 1990, 215:403-410. ure 4. The format is the standard output of ClustalX and can be read by 19. The Ensembl archive of swine (Sus scrofa) sequences [ftp:// ftp.ensembl.org/pub/traces/sus_scrofa/fasta/] various tree viewing and tree making software, including R when using 20. The Gene Index Project [http://compbio.dfci.harvard.edu/tgi/] the packages sequinr and ape. Each vertex is labelled consistently with the 21. Pevzner PA, Tang H, Tesler G: De novo repeat classification and corresponding fasta file. fragment assembly. Genome Research 2004, 14:1786-1796. Click here for file 22. Zhi D, Raphael BJ, Price AL, Tang H, Pevzner PA: Identifying repeat [http://www.biomedcentral.com/content/supplementary/1471- domains in large genomes. Genome Biology 2006, 7:R7. 2164-7-304-S5.dnd] 23. Shedlock AM, Okada N: SINE insertions: powerful tools for molecular systematics. BioEssays 2000, 22:148-160. 24. Shimamura M, Abe H, Nikaido M, Ohshima K, Okada N: Geneology of families of SINEs in Cetaceans and Artiodactyls: The pres- GLU ence of a huge superfamily of tRNA -derived families of SINEs. Mol Biol Evol 1999, 16(8):1046-1060. Acknowledgements 25. NCBI BLAST cow sequences [http://www.ncbi.nlm.nih.gov/ The authors wish to thank Sue Hauver for expert technical assistance. Men- genome/seq/BlastGen/BlastGen.cgi?taxid=9913] tion of trade names or commercial products is solely for the purpose of 26. Bovine Genome Project [http://www.hgsc.bcm.tmc.edu/ providing information and does not imply recommendation, endorsement projects/bovine/] 27. The R Project for Statistical Computing [http://www.r- or exclusion of other suitable products by the U.S. Department of Agricul- project.org/] ture. This work was supported by USDA CRIS Project No. 5438-31000- 28. BAC PAC Resources, Children's Hospital Oakland Research 071-00D and 5438-31000-073-00D. Institute (CHORI) [http://bacpac.chori.org/mporcine44.htm] 29. Hafez EE, Ghany AGAA, Zaki EA: LTR- retrotransposons-based molecular markers in cultivated Egyptian cottons G. bar- References badense L. African Journal of Biotechnology 2006, 5:1200-1204. 1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, 30. Comparative Vertebrate Sequencing Initiative [http:// Devon K, Dewar K, Doyle M, FitzHugh W, International Human www.nisc.nih.gov/] Genome Sequencing Consortium, et al.: Initial sequencing and 31. Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing analysis of the human genome. Nature 2001, 409:860-921. Program, Green ED, Batzoglou S, Sidow A: Distribution and inten- 2. Han JS, Szak ST, Boeke JD: Transcriptional disruption by the L1 sity of constraint in mammalian genomic sequence. Genome retrotransposon and implications for mammalian transcrip- Res 2005, 15:901-13. tomes. Nature 2004, 429:268-274. 32. Korf I, Yandell M, Bedell J: BLAST. O'Reilly & Associates; 2003:143. 3. Fondon JW III, Garner HR: Molecular origins of rapid and con- 33. RECON software package [http://selab.janelia.org/recon.html] tinuous morphological evolution. PNAS 2004, 34. Rozen S, Skaletsky HJ: Primer3 on the WWW for general users 101(52):18058-18063. and for biologist programmers. Bioinformatics Methods and Proto- 4. Singer MF: SINEs and LINEs: highly repeated short and long cols Methods in Molecular Biology 2000:365-386 [http:// interspersed sequences in mammalian genomes. Cell 1982, frodo.wi.mit.edu/primer3/primer3_code.html]. Humana Press, 28:433-434. Totowa, NJ 5. Singer M, Berg P: Genes and Genomes. University Science Books, Mill Valley, California; 1991. 6. Bennett EA, Coleman LE, Tsui C, Pittart WS, Devine SE: Natural genetic variation caused by transposable elements in humans. Genetics 2004, 168:933-951. 7. Nekrutenko A, Li W-H: Transposable elements are found in a large number of human protein-coding genes. Trends in Genet- ics 2001, 17(11):619-621. 8. Deininger PL, Batzer MA: Mammalian retroelements. Genome Research 2002, 12:1455-1465. 9. Holmes I: Transcendent elements: whole-genome transposon screens and open evolutionary questions. Genome Research 2002, 12:1152-1155. 10. Bao Z, Eddy SR: Automated de novo identification of repeat sequence families in sequenced genomes. Genome Research 2002, 12:1269-1276. 11. Campagna D, Romualdi C, Vitulo N, Del Favero M, Lexa M, Cannata N, Valle G: RAP: a new computer program for de novo identi- fication of repeated sequences in whole genomes. Bioinformat- ics 2005, 21(5):582-588. Publish with Bio Med Central and every 12. Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics 2005, 21(Suppl scientist can read your work free of charge 1):i351-i358. "BioMed Central will be the most significant development for 13. Edgar RC, Myers EW: PILER: identification and classification of disseminating the results of biomedical researc h in our lifetime." genomic repeats. Bioinformatics 2005, 21(Suppl 1):i152-i158. 14. Taneda A: Adplot: detection and visualization of repetitive Sir Paul Nurse, Cancer Research UK patterns in complete genomes. Bioinformatics 2004, 20(5):701-708. Your research papers will be: 15. Caspi A, Pachter L: Identification of transposable elements available free of charge to the entire biomedical community using multiple alignments of related genomes. Genome peer reviewed and published immediately upon acceptance Research 2006, 16:260-270. 16. Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ash- cited in PubMed and archived on PubMed Central burner M, Anxolabehere D: Combined evidence annotation of yours — you keep the copyright transposable elements in genome sequences. PLoS Comp Biol 2005, 1(2):e22. BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 12 of 12 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Genomics Springer Journals http://www.deepdyve.com/lp/springer-journals/novel-porcine-repetitive-elements-jljemzM1zU

Loading next page...

References (46)

H. Quesneville, C. Bergman, O. Andrieu, D. Autard, D. Nouaud, M. Ashburner, D. Anxolabéhère (2005)
Combined Evidence Annotation of Transposable Elements in Genome Sequences
PLoS Computational Biology, 1
E. Lander, L. Linton, B. Birren, C. Nusbaum, M. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. Fitzhugh, R. Funke, D. Gage, K. Harris, A. Heaford, J. Howland, L. Kann, J. Lehoczky, R. Levine, P. McEwan, K. McKernan, J. Meldrim, J. Mesirov, C. Miranda, W. Morris, J. Naylor, C. Raymond, M. Rosetti, R. Santos, A. Sheridan, C. Sougnez, N. Stange-thomann, N. Stojanovic, A. Subramanian, D. Wyman, J. Rogers, J. Sulston, R. Ainscough, S. Beck, D. Bentley, J. Burton, C. Clee, N. Carter, A. Coulson, R. Deadman, P. Deloukas, A. Dunham, I. Dunham, R. Durbin, L. French, D. Grafham, S. Gregory, T. Hubbard, S. Humphray, A. Hunt, M. Jones, C. Lloyd, A. McMurray, L. Matthews, S. Mercer, S. Milne, J. Mullikin, A. Mungall, R. Plumb, M. Ross, R. Shownkeen, S. Sims, R. Waterston, R. Wilson, L. Hillier, J. McPherson, M. Marra, E. Mardis, L. Fulton, A. Chinwalla, K. Pepin, W. Gish, S. Chissoe, M. Wendl, K. Delehaunty, T. Miner, A. Delehaunty, J. Kramer, L. Cook, R. Fulton, D. Johnson, P. Minx, S. Clifton, T. Hawkins, E. Branscomb, P. Predki, P. Richardson, S. Wenning, T. Slezak, N. Doggett, Jan-Fang Cheng, A. Olsen, S. Lucas, C. Elkin, E. Uberbacher, M. Frazier, R. Gibbs, D. Muzny, S. Scherer, J. Bouck, E. Sodergren, K. Worley, C. Rives, J. Gorrell, M. Metzker, S. Naylor, R. Kucherlapati, D. Nelson, G. Weinstock, Y. Sakaki, A. Fujiyama, M. Hattori, T. Yada, A. Toyoda, T. Itoh, C. Kawagoe, Hidemi Watanabe, Y. Totoki, T. Taylor, J. Weissenbach, R. Heilig, W. Saurin, F. Artiguenave, P. Brottier, T. Brüls, É. Pelletier, C. Robert, P. Wincker, A. Rosenthal, M. Platzer, G. Nyakatura, S. Taudien, A. Rump, Douglas Smith, L. Doucette-Stamm, M. Rubenfield, K. Weinstock, M. Hong, J. Dubois, Huanming Yang, Jun Yu, Jian Wang, G. Huang, J. Gu, L. Hood, L. Rowen, A. Madan, S. Qin, Ronald Davis, N. Federspiel, A. Abola, M. Proctor, B. Roe, Feng Chen, H. Pan, J. Ramser, H. Lehrach, R. Reinhardt, W. McCombie, M. Bastide, N. Dedhia, H. Blöcker, K. Hornischer, G. Nordsiek, R. Agarwala, L. Aravind, J. Bailey, A. Bateman, S. Batzoglou, E. Birney, P. Bork, Daniel Brown, C. Burge, L. Cerutti, Hsiu-Chuan Chen, D. Church, M. Clamp, R. Copley, T. Doerks, S. Eddy, E. Eichler, T. Furey, J. Galagan, J. Gilbert, C. Harmon, Y. Hayashizaki, D. Haussler, H. Hermjakob, K. Hokamp, W. Jang, L. Johnson, T. Jones, S. Kasif, A. Kaspryzk, S. Kennedy, W. Kent, P. Kitts, E. Koonin, I. Korf, D. Kulp, D. Lancet, T. Lowe, A. McLysaght, T. Mikkelsen, J. Moran, N. Mulder, V. Pollara, C. Ponting, G. Schuler, J. Schultz, G. Slater, A. Smit, E. Stupka, J. Szustakowki, D. Thierry-Mieg, J. Thierry-Mieg, L. Wagner, J. Wallis, R. Wheeler, Alan Williams, Y. Wolf, K. Wolfe, Shiaw-Pyng Yang, R. Yeh, F. Collins, M. Guyer, Jane Peterson, A. Felsenfeld, K. Wetterstrand, R. Myers, J. Schmutz, M. Dickson, J. Grimwood, D. Cox, M. Olson, R. Kaul, C. Raymond, N. Shimizu, K. Kawasaki, S. Minoshima, G. Evans, M. Athanasiou, R. Schultz, A. Patrinos, M. Morgan, P. Jong, J. Catanese, K. Osoegawa, H. Shizuya, Sangdun Choi, Yu Chen (2001)
Erratum: Initial sequencing and analysis of the human genome: International Human Genome Sequencing Consortium (Nature (2001) 409 (860-921))
Nature, 412
A. Caspi, L. Pachter (2005)
Identification of transposable elements using multiple alignments of related genomes.
Genome research, 16 2
I. Holmes (2002)
Transcendent elements: whole-genome transposon screens and open evolutionary questions.
Genome research, 12 8
EE Hafez, AGAA Ghany, EA Zaki (2006)
LTR- retrotransposons-based molecular markers in cultivated Egyptian cottons G. barbadense L
African Journal of Biotechnology, 5
Comparative Vertebrate Sequencing Initiative
Robert Edgar, Eugene Myers (2005)
PILER: identification and classification of genomic repeats
Bioinformatics, 21 Suppl 1
The Ensembl archive of swine (Sus scrofa) sequences
I Korf, M Yandell, J Bedell (2003)
BLAST
D. Botstein (1999)
Of Genes and Genomes
Annals of the New York Academy of Sciences, 882
D Campagna, C Romualdi, N Vitulo, Favero Del, M Lexa, N Cannata, G Valle (2005)
RAP: a new computer program for de novo identification of repeated sequences in whole genomes
Bioinformatics, 21
BAC PAC Resources, Children's Hospital Oakland Research Institute (CHORI)
M. Singer (1982)
SINEs and LINEs: Highly repeated short and long interspersed sequences in mammalian genomes
Cell, 28
SF Altschul, W Gish, W Miller, EW Myers, DJ Lipman (1990)
Basic local alignment search tool
J Mol Biol, 215
A. Taneda (2004)
Adplot: detection and visualization of repetitive patterns in complete genomes
Bioinformatics, 20 5
Prescott Deininger, M. Batzer (2002)
Mammalian retroelements.
Genome research, 12 10
The Ensembl archive of swine (Sus scrofa) sequences [ftp:// ftp.ensembl.org/pub/traces/sus_scrofa/fasta
S Rozen, HJ Skaletsky (2000)
Bioinformatics Methods and Protocols Methods in Molecular Biology
Children's Hospital Oakland Research Institute (CHORI
Bovine Genome Project
Z. Bao, S. Eddy (2002)
Automated de novo identification of repeat sequence families in sequenced genomes.
Genome research, 12 8
G. Cooper, Eric Stone, G. Asimenos, E. Green, S. Batzoglou, A. Sidow (2005)
Distribution and intensity of constraint in mammalian genomic sequence.
Genome research, 15 7
J. Fondon, H. Garner (2004)
Molecular origins of rapid and continuous morphological evolution
Proceedings of the National Academy of Sciences of the United States of America, 101
A. Price, Neil Jones, P. Pevzner (2005)
De novo identification of repeat families in large genomes
Bioinformatics, 21 Suppl 1
The R Project for Statistical Computing
E. Bennett, Laura Coleman, Circe Tsui, W. Pittard, S. Devine (2004)
Natural Genetic Variation Caused by Transposable Elements in Humans
Genetics, 168
S. Altschul, W. Gish, W. Miller, E. Myers, D. Lipman (1990)
Basic local alignment search tool.
Journal of molecular biology, 215 3
S. Rozen, H. Skaletsky (2000)
Primer3 on the WWW for general users and for biologist programmers.
Methods in molecular biology, 132
RECON software package
Anton Nekrutenko, Wen-Hsiung Li (2001)
Transposable elements are found in a large number of human protein-coding genes.
Trends in genetics : TIG, 17 11
A. Shedlock, N. Okada (2000)
SINE insertions: powerful tools for molecular systematics.
BioEssays : news and reviews in molecular, cellular and developmental biology, 22 2
(2004)
BIOINFORMATICS ORIGINAL PAPER Sequence analysis RAP: a new computer program for de novo identification of repeated sequences in whole genomes
Degui Zhi, Benjamin Raphael, A. Price, Haixu Tang, P. Pevzner (2006)
Identifying repeat domains in large genomes
Genome Biology, 7
International Consortium (2001)
Initial sequencing and analysis of the human genome
Nature, 409
E. Hafez, A. Ghany, E. Zaki (2006)
Full Length Research Paper LTR-retrotransposons-based molecular markers in cultivated Egyptian cottons G. barbadense L.
African Journal of Biotechnology, 5
Paul Pevzner, Haixu Tang, G. Tesler (2004)
Glenn Tesler De Novo Repeat Classification and Fragment Assembly data
PA Pevzner, H Tang, G Tesler (2004)
De novo repeat classification and fragment assembly
Genome Research, 14
Jeffrey Han, Suzanne Szak, J. Boeke (2004)
Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes
Nature, 429
The Gene Index Project
S Rozen, HJ Skaletsky, S Krawetz (2000)
Primer3 on the WWW for general users and for biologist programmers
Bioinformatics Methods and Protocols Methods in Molecular Biology
M Singer, P Berg (1991)
Genes and Genomes
(2004)
RepeatMasker Open-3.0 [http:// www.repeatmasker.org
M. Shimamura, Hideaki Abe, M. Nikaido, K. Ohshima, Norihiro Okada (1999)
Genealogy of families of SINEs in cetaceans and artiodactyls: the presence of a huge superfamily of tRNA(Glu)-derived families of SINEs.
Molecular biology and evolution, 16 8
NCBI BLAST cow sequences
(1991)
Genes and Genomes. University Science Books
(2004)
RepeatMasker Open-3

Publisher: Springer Journals
Copyright: Copyright © 2006 by Wiedmann et al; licensee BioMed Central Ltd.
Subject: Life Sciences; Life Sciences, general; Microarrays; Proteomics; Animal Genetics and Genomics; Microbial Genetics and Genomics; Plant Genetics & Genomics
eISSN: 1471-2164
DOI: 10.1186/1471-2164-7-304
pmid: 17140439
Publisher site: See Article on Publisher Site

Abstract

Background: Repetitive elements comprise ~45% of mammalian genomes and are increasingly known to impact genomic function by contributing to the genomic architecture, by direct regulation of gene expression and by affecting genomic size, diversity and evolution. The ubiquity and increasingly understood importance of repetitive elements contribute to the need to identify and annotate them. We set out to identify previously uncharacterized repetitive DNA in the porcine genome. Once found, we characterized the prevalence of these repeats in other mammals. Results: We discovered 27 repetitive elements in 220 BACs covering 1% of the porcine genome (Comparative Vertebrate Sequencing Initiative; CVSI). These repeats varied in length from 55 to 1059 nucleotides. To estimate copy numbers, we went to an independent source of data, the BAC- end sequences (Wellcome Trust Sanger Institute), covering approximately 15% of the porcine genome. Copy numbers in BAC-ends were less than one hundred for 6 repeat elements, between 100 and 1000 for 16 and between 1,000 and 10,000 for 5. Several of the repeat elements were found in the bovine genome and we have identified two with orthologous sites, indicating that these elements were present in their common ancestor. None of the repeat elements were found in primate, rodent or dog genomes. We were unable to identify any of the replication machinery common to active transposable elements in these newly identified repeats. Conclusion: The presence of both orthologous and non-orthologous sites indicates that some sites existed prior to speciation and some were generated later. The identification of low to moderate copy number repetitive DNA that is specific to artiodactyls will be critical in the assembly of livestock genomes and studies of comparative genomics. tate REs [9]. In recent years, several attempts have been Background Repetitive elements comprise ~45% [1] of mammalian made to automate the process of de novo identification genomes and are increasingly known to impact genomic and characterization of REs [10-16]. The algorithms take function by contributing to the genomic architecture, by into account the likely evolutionary history of the REs – direct regulation of gene expression [2,3] and by affecting not only genetic drift, but also the processes that lead to genomic size, diversity and evolution [4-8]. The ubiquity the juxtaposition of REs [10]. Because knowing the evolu- and increasingly understood importance of repetitive ele- tionary history of each RE helps to define the type of RE, ments (REs) contribute to the need to identify and anno- these algorithms are valuable not only in identifying Page 1 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 repetitive sequence, but also in increasing our under- bers (across the 220 discovery BACs) ranged from 12 to standing of the evolutionary role of the identified RE. Our 1102. initial attempt was to identify novel repetitive DNA with a program called RECON [10], which produced 14,067 Steps 5 – 6 families of REs with 249 of those having count numbers The 60 original MPREs were consolidated into 31 because of 10 or more. We decided a different approach was of overlap or co-localization at multiple sites. Twenty- needed that would organize closely related elements in a nine MPREs were absorbed into 31; the 31 original MPRE parsimonious way. In this paper, we describe 27 novel identifiers of the longer sequences were kept to maintain porcine repetitive elements and estimate their prevalence provenance. In addition, there were three combinations in swine and other species. (MPRE20 and 57; MPRE15, 17, 19 and 26; MPRE44, 50 and 52) of repeats that frequently appeared together in the same order with some variation in their relative spacing. Results We identified repetitive elements using a procedure simi- The most consistent group contained two elements – lar to previously published methods [10,11]. First, we MPRE20 in reverse complement followed by a small gap, used RepeatMasker [17] on the BAC sequences to mask then MPRE57. All thirteen times that MPRE20 occurred, it out previously characterized repeat elements. Second, we occurred in this grouping. MPRE57 occurred 13 out of 14 identified all pair-wise alignments among masked times in this grouping. Naturally, we concluded that sequences using BLAST [18]. Third, we identified multiple MPRE20 (600 bp) and MPRE57 (204 bp) were two parts copy sequence segments with alignments to many sites (≥ of a longer RE that had a variable middle (100–250 bp 10). Fourth, we clustered sites linked by pair-wise align- range for all but one example). After examining the align- ments and constructed phylogenetic trees. Fifth, excessive ment in ClustalX, we could see that the middle was con- variation (2-fold) in copy number within a putative RE served except for an 84 bp deletion in one instance and a caused it to be divided; co-localization of RE among many 67 bp insertion in another. Further review of the BACs sites caused them to be merged. Sixth, we examined flank- showed that the 13 groups containing MPRE20 and ing sequences of putative RE for clues about replication MPRE57 sometimes occurred in overlapping regions machinery or to consolidate RE that should be merged. between pairs of clones in the BAC collection, meaning Seventh, we estimated the prevalence of RE in an inde- that we only had 7 unique loci plus one very unique locus pendent set of porcine sequences as well as in the that had a PRE1a (Porcine Repeat Element 1a, as identi- genomes of other species. fied by RepeatMasker) inserted into the gap. There was no pattern to the gap in the other instances. We include this Our method compared to RECON longer repeat element in our list of novel porcine repeat The bulk of the automated parts of our process, Steps 2 elements as MPRE61, which is more fully described in a through 4, were very similar to RECON [4]. RECON does later section. not appear to have analogues for Steps 1 (RepeatMasker), 5, 6, or 7. We utilized Step 1 to steer us away from previ- The final alteration to the list of MPREs was the removal ously characterized repetitive. We utilized the manually of MPRE48 due to its low copy number in the set of intensive Steps 5–7 to achieve a more parsimonious 275,595 porcine BAC-ends supplied by the Wellcome (smaller number of repeat families) than appeared to be Trust Sanger Institute (hereafter shortened to "Sanger") possible with RECON alone. In this sense, we envision [19]. Surprisingly, MPRE48 was found to appear less fre- that our method is a complement to RECON, not a quently, only six times, in BAC-ends (335.9 Mb) than in replacement. the much smaller portion of the genome spanned by the set of fully sequenced BACs (36.4 Mb) from which the Steps 1 – 4 MPREs were derived. That brings the final number of Thirty-six percent of the sequence was masked by Repeat- novel repeat elements reported here to 27, although we Masker. Comparing all unmasked sequence fragments (≥ decided against removing MPRE48 from the fasta file of 50 bp) produced 1,334,953 pair-wise alignments. One MPREs, see Additional file 1. thousand five hundred seventy-nine highly redundant sequences (totaling 1.07 Mb) were identified that had a Step 7 minimum of 10 hits for at least 50 contiguous bases. Sixty Table 1 lists the MPREs along with their observed count putative repeat element families resulted from clustering numbers in the TIGR (The Institute of Genomic Research, the 1579 highly redundant sequences. The repeat element Rockville, MD) Sus scrofa Gene Index [20] and the Sanger families were labeled MPRE1 – MPRE60 (for Meat Animal BAC-end sequences [19]. Noting that the data set of BAC- Research Center Porcine Repetitive Element). Their ends is 4.8 times larger than the TIGR Gene Index lengths ranged from 55 to 1059 bp and their copy num- (104,328 entries of expressed swine sequence totalling Page 2 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 70.0 Mb), we conclude that all the novel repeats occur less (BLAST) to the GenBank nr database and only one strong frequently in expressed sequence than in genomic DNA. hit was found. MPRE1 hit Sus scrofa interferon alpha-1 precursor with a bit score of 352, so it was eliminated The prevalence of these newly identified REs was com- from further consideration as a novel RE. For comparison, pared to that of known REs. Three of the newly discovered the highest bit score of MPREs reported here was less than porcine REs, MPRE11, 16 and 38, were more common 50. The repeats were also compared (BLAST) to vectors, than the LINE element L3 and one, MPRE42, was about as mitochondrial DNA, and tRNAs. The middle of MPRE58 common as L3 (Table 1). The other 23 MPREs have lower did have high similarity to tRNA-GLU; otherwise, there count numbers. In the Sanger archive of 275,595 BAC- were no substantial high-scoring pairs. ends, the number of elements for all SINEs was 203,206, for all LINES was 116,107 and for all LTRs (Long Terminal Discussion Repeats) was 25,066 based on RepeatMasker. Looking Certain difficulties arise when defining repeat elements. specifically at the LINEs, the most common by far was L1 One is that REs often are present as mosaics of smaller with 94,325, followed by L2 with 18,720 and L3 is third subsets of commonly occurring sequences [21,22]. with 2,358. Another is that REs can often sustain considerable muta- tions, including large truncations and insertions. Two These newly discovered repeat elements did not appear to extreme examples of this are the truncation of the 5' end be duplicated genes, LINE elements or expressed sequence during retrotransposition, and the insertion of one RE that was transposed by a LINE element. To address these into the middle of another. A third difficulty requiring res- questions, the MPREs were translated and compared olution is that segmental duplication will create very long Table 1: Count numbers for novel porcine repetitive elements Repeat name length GC content BLAST hits to SSGI BLAST hits to count number count number BLAST hits to 2 3 4 5 BAC-ends regular irregular Bovine Genome MPRE2 111 0.40 66 528 513 1000 MPRE3 411 0.51 25 392 324 57 MPRE6 255 0.55 15 392 342 0 MPRE11 76 0.33 888 8876 8040 1599 MPRE12 199 0.47 26 272 123 1051 MPRE14 234 0.57 29 292 306 1157 MPRE15 912 0.50 56 520 592 452 1475 MPRE16 276 0.46 379 4688 5051 4035 1260 MPRE17 870 0.29 5 89 75 1002 MPRE19 125 0.34 30 577 550 534 MPRE21 595 0.48 16 189 201 1604 MPRE22 166 0.46 6 83 81 0 MPRE26 324 0.50 75 475 479 1054 MPRE28 140 0.64 27 700 648 0 MPRE38 176 0.35 610 7567 7110 5806 1417 MPRE42 220 0.39 160 2350 2425 1050 MPRE44 55 0.40 22 560 551 6 MPRE49 221 0.50 3 52 50 6 MPRE50 136 0.35 40 907 871 0 MPRE51 71 0.49 17 140 112 110 MPRE52 341 0.30 28 457 703 362 643 MPRE54 326 0.46 17 121 125 62 MPRE55 161 0.52 39 247 244 1075 MPRE58 196 0.41 4 98 98 1034 MPRE59 123 0.27 207 1830 1723 1431 MPRE60 151 0.40 13 98 90 0 MPRE61 1059 0.37 2 31 41 10 The number of BLAST hits, at least half as long as the repeat element, found within the TIGR Sus scrofa Gene Index version 11, which contains 2 3,4 104,328 entries and 70.0 MB. The number of similar BLAST hits to the Sanger archive of BAC-ends that has 275,595 entries totaling 335.9 MB. The regular and irregular columns give the number of BLAST hits across the repeat element, again using the Sanger data. The regular values are the average of the middle 90% of the repeat element while the irregular values are the minimum value within the middle 80% of the repeat element. The number of BLAST hits, including those less than half the length of the repeat element, found within the whole Bovine genome (build AAFCO2). Page 3 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 repeated sequences that do not retro-transpose together, pared to a comprehensive database (NCBI BLASTX). None and therefore should be broken up into their retro-trans- of the results were similar to the possible translations of a posable component parts. RECON, the software for iden- LINE. tification of REs described by Bao and Eddy, handles all three of these difficulties [10]. Counting repeat elements is challenging Because of the degeneracy of repetitive elements it is diffi- Our approach was intentionally a bit more simplistic. We cult to arrive at an accurate count in the target genome. were able to create a much more parsimonious set of RE Another difficulty in the quantification of repeat elements than what we were able to generate with RECON. Whereas is that REs are often composed of smaller repeat units that RECON intends to recreate the full repeat elements in the occur more frequently than the larger unit [21,22]. way that will make for the best possible additions to the RepeatMasker database, as well as aid in the study of the To characterize the prevalence of MPREs, we went to an evolutionary history of the repeat elements, our goal was independent data set, the Sanger BAC-ends from the to mask out the most commonly repeated regions of the CHORI-242 library archived at Ensembl [19]. Table 1 lists porcine genome. The technique we found most useful in three different measures of prevalence of MPRE within refining the definitions of the MPREs was to plot the fre- these BAC-ends. The first measure (BLAST hits to BAC- quency of BLAST hits as a function of position within the ends) gives the number of hits that were at least half the sequence of the putative repeat elements. From the criteria length of the repeat element. An issue here is the typical used to define them, the number of hits was at least 10 size of the traces – an average of 1219 bp. The longer REs across the whole sequence – but many showed a much will tend to be under-counted due to edge effects in the higher hit frequency along part of their lengths. For pur- trace archive. The next two measures of count number poses of comparison, we applied RECON to our pair-wise were calculated by plotting the number of BLAST hits as a alignments from Step 2. RECON divided the 1,334,953 function of position on the RE. Some of the resulting plots BLAST hits into 29,631 potential repeat elements that were smooth and flat across most of the RE with an were then grouped into 14,067 families. Only 249 of expected drop-off near each end. For these "regular" plots these families had 10 or more elements. Note that it is the count number was the average value of the middle possible for a family containing only one element to cor- 90% of the plot amplitude. Other plots varied quite a bit respond to many BLAST hits. Rather than continue with in amplitude across the RE. This was likely due to sub- so many families, we found that our method yielded a repeats that hit in areas of the genome that the whole more parsimonious classification of moderately repetitive repeat did not. During this measure of count number elements. One difference between the two methods was there was no lower limit to the size of the hit other than that our method required a minimum copy number prior that needed to get the expectation value below 0.1. These to the formation of families of repeat elements. were considered irregular and the algorithm for determin- ing their count number was to take the smallest value on The MPREs have no clear connections to known proteins. the plot after ignoring the first and last 10% of the plot. A The NCBI BLASTX results for these sequences were typi- few plots were only mildly irregular, and for those both cally a combination of description-less accessions and the regular and irregular algorithms were used with both unrelated proteins in a variety of organisms. That numbers reported in Table 1. remained true when the dataset was compared to the TIGR gene index for Sus scrofa [20]. Comparing the novel repeat element content across genomes The novel repeat elements were compared to known types The sequences of novel porcine repetitive elements listed of repeats – SINEs, LINEs and LTRs – and did not fit the here were compared (BLAST [25]) to a recent build of the definitions for those classes of repeat elements. Because complete cow genome (AAFCO2 from [26]) as well as RepeatMasker would mask out low-complexity regions, against the mouse and human genomes. In the case of the methods used here would not initially find the tail mouse, there were no significant similarities found. The ends of LTRs. Each MPRE was tested for nearby low-com- comparison to the human genome yielded only one sig- plexity regions and none were consistently found. One of nificant hit – a 37 bp long section of MPRE17 (870 bp the characteristics of SINEs is the presence of tRNA coding long) matched once in chromosome 9 thousands of bp sequence in their 5 prime regions [23,24]. Only MPRE58 away from any annotated features. The comparison to the had a region similar to tRNA, and that was in the middle cow genome yielded a variety of results. Five of the 27 of its sequence. LINEs are best characterized by their two MPREs did not hit at all (MPREs 6, 22, 28, 50 and 60), and ORFs – one coding for a reverse transcriptase and the three others (MPREs 44, 49 and 61) had ten or fewer hits other for a protein with RNA binding activity [6]. All the (Table 1), despite the fact that the cow genome contains MPREs were translated to potential proteins and com- ten times more sequence than the collection of porcine Page 4 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 BAC-ends tested. Fourteen of the 27 MPREs appeared fre- resented by the sum of horizontal distances that one must quently in cow as well as pig, as indicated by having at travel along the tree to connect the two sequences. The least 1000 BLAST hits to the cow genome. leftmost part of that path represents a common ancestor. It is not surprising that the two sequences in question Not surprisingly, the bovine hits tend to be shorter than have individually diverged a significant amount from the the porcine hits because the MPREs were defined from pig original sequence of the common ancestor at that locus. sequence and as such would be expected to be more intact The more surprising result is that some of the pig and cow in porcine. What is interesting is that in both species the sequences are more similar to each other than the endpoints of the hits have a strong tendency to line up to sequences at the oldest loci. Coincidental convergence is particular spots in the MPRE, as shown in Figure 1 using an unlikely possibility. A more likely explanation is that MPRE12, 15, 17, 41, 51 and 58 as examples. Sometimes enough copies of the old sequence were created that some the common endpoints are the same in both species, of them experienced much less mutation than the sometimes not. This could be a result of the repeat ele- diverged sequences at the ancestral locus. The most recent ments being comprised of smaller repeat elements, not all common ancestors (MRCA) occurred in a narrow window of which have the same frequency of occurrence in either of time (evolutionary) relative to the full extent of the tree genome. The longer MPREs often had more than one sub- (< 1/5 of the distance from the root to the most peripheral region with multiple extra hits. This, too, could be evi- branch). The MRCA among the orthologous sites occurred dence of internal repeat structure. within the same time frame as the other MRCA. The tree clearly shows considerable radiation following speciation Figure 2 shows that MPRE55 occurs in both swine and cat- as evidenced by large genetic distances from MRCA to tle in orthologous loci. The pig BAC lies along the x-axis, peripheral tips. and the cow BAC lies along the y-axis. Also plotted are line A closer look at MPRE61 segments of high similarity between the two BACs. The preponderance of these segments demonstrates little Allelic differences or SNP can be identified from cases genomic rearrangement between species, which indicates where MPRE61 sites coincide with overlaps among CVSI that these are orthologous regions of likely common BACs. MPRE61 sites coincide with 3 pairs of overlapping ancestry between the two species. This region is highly BACs, 1 (AC145413 and AC144901), 2 (AC139879, similar to the human contig NT_005403.16 and the locus AC140099) and 3 (AC146932 and AC087424). In addi- of MPRE55 corresponds to the 3' UTR of the model gene tion, an MPRE61 site coincided with a group of 3 overlap- LOC643405, which codes for a protein similar to TGF- ping BACs, including AC138784, AC138788 and beta induced apoptosis protein 2. AC138786. Overlapping BAC pair 2 had two single base differences, and pair 3 had 3 single base differences and Because the collection of BACs spans only 1% of the one 43 bp insertion/deletion. No sequence differences whole pig or cow genome, we cannot rule out the possi- were observed within MPRE61 for pair 1 or the group of 3 bility that all of the MPREs have at least one orthologous overlapping BACs. location in both species. The fact that 12 MPREs did not have blast hits in any of the cow BACs makes it seem likely To put the apparent allelic diversity rates into context, we that those 12 are relatively recent evolutionary occur- examined the genetic sources of the DNA used to con- rences. Of the 10 MPREs that appear most frequently in struct the BAC library (RPCI-44). The source of DNA for the cow, only two, MPRE55 and MPRE59, were observed RPCI-44 was a pooled sample with equal contributions to appear in orthologous locations among the tested set of from 4 male crossbred pigs each comprised of 3/8 Lan- fully-sequenced BACs. drace, 3/8 Yorkshire and ¼ Meishan [28]. The probability of identifying SNP increases with the diversity of genomes A phylogenetic analysis was performed on the different sampled. For the cases of 2 overlapping BACs, the proba- integration sites of MPRE55 from both the cow and pig bility of sampling different genomes is 87.5%, different BAC libraries using ClustalX (see Additional file 2 for the breeds is 65.7%, and one BAC of western (Landrace or sequences), and the output (Additional file 3) was then Yorkshire) origin and the other of Meishan origin is input into R [27] to create Figure 3. The sequences that 37.5%. The probability of sampling diverse genomes is occurred at orthologous locations in swine and cattle are higher for the case of 3 overlapping BACs. The probability highlighted. As expected, the pig branches and cow of sampling more than one genome is 98.4%, more than branches tend to be separate. It is notable that the most one breed is 87.9%, and at least one BAC of western origin similar sequences that occur in both species do not come combined with one BAC of Meishan origin is 56.25%. The from orthologous locations, but seem to be found in loci fact that we didn't observe SNP in one of the three pairs of that originated after the cow and pig ancestral lines overlapping BACs is not that unusual given that the prob- diverged. The evolutionary distance between them is rep- ability of sampling identical genomes with at least one of Page 5 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 Distribution Figure 1 of BLAST hits to cow and pig DNA across selected MPREs Distribution of BLAST hits to cow and pig DNA across selected MPREs. BLAST hits plotted across MPREs 12, 15, 17, 42, 51 and 58. Along the abscissa lies each MPRE sequence and stacked above are the corresponding hits to the cow genome in blue and to pig BAC-ends in red. The hits are ordered from the top down by length. Page 6 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 the 3 pairs of overlapping BACs is 33% (1-.875 ). On the other hand, the fact that we did not observe SNP within the group of 3 overlapping BACs given the relatively high probabilities of diverse genomes being sampled is unex- pected. To bolster the relatively small number of distinct MPRE61 loci (7) identified in the CVSI BACs, we further investi- gated the prevalence and diversity of MPRE61 by cloning and sequencing PCR amplification products derived from 16 pigs sampled from 10 breeds (Berkshire, Chester White, Duroc, Hampshire, Landrace, Meishan, Pietrain, Poland China, Spot, and Yorkshire). We used primers designed to match the highly conserved parts of MPRE61 to amplify and clone (see Methods for details) multiple and variable loci for the RE that are differentiable by size as well as sequence. The different breeds showed indistin- 120 130 guishable smears on denaturing PAGE gels including many different sizes. Too many fragments and too many Swine Position, Kb sizes were present to identify allelic differences in sizes among animals. The PCR products were sequenced to yield 91 reads that were not bacterial or vector contamina- tion. The 91 sequences (listed as a fasta file in Additional file 4) were analyzed with Clustal X (creating a dendro- gram file, Additional file 5) and displayed in Figure 4 as a phylogenetic tree. The topology of the tree (number of diverse nodes) is consistent with the estimated copy number of 300 sites in the whole genome given in Table 1. We speculate that the more similar sequences repre- sented as tips close (with few sequence differences) to their common ancestor are probably allelic differences at the same locus. On the other hand, the more diverse tips and peripheral nodes probably represent different sites or loci. The amount of sequence diversity presented in Figure 4 supports the idea that individual integration sites (loci) and alleles of repetitive elements can be uniquely identi- 80 120 160 fied by high-throughput array based assays by hybridizing samples to short probes. This demonstrates that repetitive Swine Position, Kb DNA with similar properties to MPRE61 (i.e., prevalence and diversity) can be harnessed for genetic and physical mapping [29]. This dispels the long standing myth that (a a Figure 2 nd b) – MPRE55 in homologous positions in pig and cow (a and b) – MPRE55 in homologous positions in pig repetitive DNA should always be avoided because it is and cow. MPRE55 exists in homologous positions in pig and intractable. Our results indicate that some classes (low to cow. Along the horizontal axis lies the pig BAC with acces- intermediate copy number and highly diverse) of repeti- sion number AC147198. Along the vertical axis lies the cow tive DNA would be tractable with high-throughput tech- BAC with accession number AC138165. The numerous line nologies. segments are BLAST hits between the two BACS that have bit scores of at least 100. Dashed lines are drawn through MPRE61 size differences are not randomly distributed the positions on the BACs where MPRE55 is located. The throughout the phylogenetic tree. Different sizes cluster circle indicates the region containing MPRE55 that is on different branches of the tree; however, the clustering expanded and shown in Figure 2(b). is not complete. This indicates that insertions and dele- tions (evolutionary events that cause size differences) occurred throughout the evolution of MPRE61, and in some cases while the element was still replicating. The Page 7 of 12 (page number not for citation purposes) Cow Position, Kb Cow Position, Kb 040 80 45 50 55 BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 Swine Cow OMRCA ● ● MRCA ● ● ● ● Phylog Figure 3 eny of MPRE55 in pig and cow Phylogeny of MPRE55 in pig and cow. The phylogram displays the BLAST hits obtained from querying MPRE55 against the fully sequenced BAC libraries for pig and cow. The red dots indicate examples of MPRE55 from the cow and the black dots indicate pig examples. The orthologous sites depicted in Figure 2 are noted by the grey dashed lines and the word "ortho- logues." Also shown are the Most Recent Common Ancestors (MRCA) between species in green and, in blue, the MRCA for the 2 orthologues (OMRCA). In both cases the BACs covered about 1% of the total genome. The MRCA lie within a relatively narrow band of time consistent with a single speciation event and there appears to be considerable radiation among elements following speciation (i.e., time frame spanning MRCA). incomplete clustering of sizes indicates evolutionary plas- in the RepeatMasker library. Several other examples ticity and as a result recurrent insertions and deletions. existed of PRE1 next to a section of MPRE61, but the trace end occurred next to the PRE1, so that it may or may not MPRE61 was further characterized by plotting BLAST hits have had the continuing section of MPRE61 on its other of it to the 275,595 sequences in the trace archive of BAC- side. No other REs were found to be incorporated into ends submitted by Sanger. These were plotted along with MPRE61, suggesting that MPRE61 replicated relatively the repeat elements recognized by RepeatMasker. The recently. Another interesting observation was that the most interesting observations included the fact that three density of REs on the 3' side of MPRE61 was much higher times among the 140 hits a PRE1 was incorporated into than on the 5' side. To take a closer look at this, we col- MPRE61. PRE1 is a porcine specific SINE that is included lected the trace sequence 3' of the 62 hits that ended near Page 8 of 12 (page number not for citation purposes) Orthologues BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 Size Range, bp ● ● (418,492] (492,519] (519,594] (594,677] ● ● ● (677,704] Diver Figure 4 sity of MPRE61 across ten breeds of pig Diversity of MPRE61 across ten breeds of pig. This phylogram displays the variety of sequences obtained by amplifying MPRE61 in 16 DNA samples from ten breeds of pig. Size differences are highlighted using colored dots according to the legend. Size cut-offs were chosen to lie between modes of the size distribution which were well separated. (within 60 bp of) the 3' end of MPRE61 (length of 1059 side of MPRE61, particularly in the region closest to the bp). This flanking sequence, ranging in length from 12 to end of MPRE61. For the 22 LINEs that occur within 80 bp 1368 nucleotides, was analyzed for repeat content and of the end of MPRE61, 15 are oriented on the opposite distance of that content from the end of MPRE61 (Figure strand and 7 on the same strand. At this point, there is no 5). Running RepeatMasker on the entire collection way to know which strand of MPRE61 might be tran- (275,595 sequences) of Sanger BAC-ends shows that the scribed. We arbitrarily chose one of the strands and used number of SINE elements is 75% greater than the number it consistently. Because the LINEs have a particular inter- of LINE elements (203,206 vs. 116,107). The LINEs tend nal structure, the 5' and 3' ends are well defined. So to be longer than the SINEs, so the total percentage of another way of looking at the result would be to say that sequence occupied by the LINEs is actually larger (13.29% the LINEs occur on the 5' end of MPRE61 (or rather, its vs. 10.29%). The most obvious feature of Figure 5 is that reverse complement) with 15 on the same strand and 7 on LINEs are significantly over represented on the 3 prime the opposite strand. Either way, there is less strand conser- Page 9 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 Repeat elem Figure 5 ents that flank the 3' end of MPRE61 Repeat elements that flank the 3' end of MPRE61. The repeat content of 62 BAC-end sequences flanking the 3' end of MPRE61. The origin on the horizontal axis is the last position that matches the 3' end (minimum position within the repeat of 1000 out of the full 1059 bp length) of MPRE61. The 62 flanking sequences are ordered with the longest at the top and the shortest at the bottom. The horizontal position is the distance from the 3' end of the hit to MPRE61. Colored arrows are superimposed on the dotted outline of the flanking sequence to indicate the repeat elements that RepeatMasker found. Page 10 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 vation than would be expected if MPRE61 used LINEs as (positions 278–298 and 953-935 (5' to 3' on opposite a vehicle for either replication or integration. strand) of MPRE61, respectively). PCR was performed in a PTC-225 DNA engine (MJ Research Inc, Watertown, MA) using 0.25 U Hot Star Taq polymerase (Qiagen, Conclusion From our experience, it seems that although some availa- Valencia, CA, USA), 1X of supplied buffer, 1.5 mM MgCl2, ble programs may help with the process of identification 200 µM dNTPs, 0.8 µM each primer, and 100 ng of of REs, a level of judiciousness is also required. The BLAST genomic DNA in 25 µl reactions. The PCR mixture was and phylogenetic analyses are proven to be useful to held at 94°C for 15 min, and cycled 44 times at 94°C for improve the efficacy, particularly when comparisons are 20 sec, held at 57°C annealing temperature for 30 sec and made across species. Discovering the RE in one dataset extension at 72°C for 1.5 min, followed by a final exten- and characterizing their prevalence and diversity in sion at 72°C for 5 min. Five µl of the PCR reaction was another was crucial to our effort. electrophoresed in 1.5% agarose gels to determine quality of amplification and a portion (2–4 µl) was used for clon- Using an approach similar to previously published work ing in pCR4-TOPO vector (Invitrogen, Carlsbad, CA). but modified to fit our specific goals and data, several Plasmid DNA was prepared using standard alkaline lysis repetitive elements were identified in porcine and bovine and PTFE filter plates (Millipore, Bedford, MA) and was genomes that do not exist in mouse or human. These ele- sequenced with T7 primer. ments do not contain signatures of previously identified retrotransposons, but seem to have undergone replication Authors' contributions and mutation. Because these elements are in a lower copy RW performed the bioinformatic analysis and drafted the number than most of the REs that make up mammalian manuscript. DN carried out the molecular genetics studies genomes, they could be exploited in mapping or whole- and helped draft the manuscript. JK designed and coordi- genome association studies. As the porcine genome nated the study and helped draft the manuscript. All sequencing effort progresses, we should know more about authors have read and approved the final manuscript. the distribution, history and possible contribution of these repeats to the genomic architecture in artiodactyls. Additional material The genuine challenge of genome sequencing and assem- Additional file 1 bly would be enhanced with an improved understanding Fasta file of novel porcine repetitive elements. Each definition line of repeat elements and their distributions, especially those includes an accession number along with the start and end positions for repeat elements that are species specific. that repetitive element. Click here for file [http://www.biomedcentral.com/content/supplementary/1471- Methods 2164-7-304-S1.fas] Bioinformatics Two hundred-twenty fully sequenced porcine BACs gener- Additional file 2 ated by the Comparative Vertebrate Sequencing Initiative Fasta file that provides the sequences used to create Figure 3. The def- [30,31] were downloaded from the RPCI-44 clone library, inition lines include the accession number with the start and end positions totaling 36.4 Mb. RepeatMasker [17] masked out 36% of of the sequence. this sequence. All unmasked fragments of sequence that Click here for file were at least 50 bp long were compared (BLAST) to the [http://www.biomedcentral.com/content/supplementary/1471- 2164-7-304-S2.fas] original data set. The BLAST parameters used were those recommended by Korf et al. (2003) for finding repeat ele- Additional file 3 ments, namely -r 1 -q -1 -G 2 -E 2 -W 9 -F "m D" -e 1 for Dendrogram file used to create Figure 3. The format is the standard NCBI-BLAST [32]. The output, which contained output of ClustalX and can be read by various tree viewing and tree mak- 1,334,953 hits, was analyzed using two similar methods. ing software, including R when using the packages sequinr and ape. Each One was to use the RECON software [10] downloaded vertex is labelled consistently with the corresponding fasta file. from its website [33] and the other used separate, original Click here for file [http://www.biomedcentral.com/content/supplementary/1471- PERL scripts that performed several of the same functions 2164-7-304-S3.dnd] included in the RECON package. Additional file 4 PCR and sequencing Fasta file of cloned MPRE61 sequences. The definition line refers to an Primer pairs for amplification of genomic DNA were arbitrary sequence ID generated at USMARC. designed from consensus MPRE61 sequences using Click here for file Primer3 [34]. Primer sequences were 5'-TTTTCCTGTGGT- [http://www.biomedcentral.com/content/supplementary/1471- 2164-7-304-S4.fas] GATTTGTGA-3' and 5'-GGGCGCTGGACTGCTCAAA-3' Page 11 of 12 (page number not for citation purposes) BMC Genomics 2006, 7:304 http://www.biomedcentral.com/1471-2164/7/304 17. Smit AFA, Hubley R, Green P: RepeatMasker Open-3.0 [http:// www.repeatmasker.org]. 1996–2004 Additional file 5 18. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local Dendrogram file of cloned MPRE61 sequences used to construct Fig- alignment search tool. J Mol Biol 1990, 215:403-410. ure 4. The format is the standard output of ClustalX and can be read by 19. The Ensembl archive of swine (Sus scrofa) sequences [ftp:// ftp.ensembl.org/pub/traces/sus_scrofa/fasta/] various tree viewing and tree making software, including R when using 20. The Gene Index Project [http://compbio.dfci.harvard.edu/tgi/] the packages sequinr and ape. Each vertex is labelled consistently with the 21. Pevzner PA, Tang H, Tesler G: De novo repeat classification and corresponding fasta file. fragment assembly. Genome Research 2004, 14:1786-1796. Click here for file 22. Zhi D, Raphael BJ, Price AL, Tang H, Pevzner PA: Identifying repeat [http://www.biomedcentral.com/content/supplementary/1471- domains in large genomes. Genome Biology 2006, 7:R7. 2164-7-304-S5.dnd] 23. Shedlock AM, Okada N: SINE insertions: powerful tools for molecular systematics. BioEssays 2000, 22:148-160. 24. Shimamura M, Abe H, Nikaido M, Ohshima K, Okada N: Geneology of families of SINEs in Cetaceans and Artiodactyls: The pres- GLU ence of a huge superfamily of tRNA -derived families of SINEs. Mol Biol Evol 1999, 16(8):1046-1060. Acknowledgements 25. NCBI BLAST cow sequences [http://www.ncbi.nlm.nih.gov/ The authors wish to thank Sue Hauver for expert technical assistance. Men- genome/seq/BlastGen/BlastGen.cgi?taxid=9913] tion of trade names or commercial products is solely for the purpose of 26. Bovine Genome Project [http://www.hgsc.bcm.tmc.edu/ providing information and does not imply recommendation, endorsement projects/bovine/] 27. The R Project for Statistical Computing [http://www.r- or exclusion of other suitable products by the U.S. Department of Agricul- project.org/] ture. This work was supported by USDA CRIS Project No. 5438-31000- 28. BAC PAC Resources, Children's Hospital Oakland Research 071-00D and 5438-31000-073-00D. Institute (CHORI) [http://bacpac.chori.org/mporcine44.htm] 29. Hafez EE, Ghany AGAA, Zaki EA: LTR- retrotransposons-based molecular markers in cultivated Egyptian cottons G. bar- References badense L. African Journal of Biotechnology 2006, 5:1200-1204. 1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, 30. Comparative Vertebrate Sequencing Initiative [http:// Devon K, Dewar K, Doyle M, FitzHugh W, International Human www.nisc.nih.gov/] Genome Sequencing Consortium, et al.: Initial sequencing and 31. Cooper GM, Stone EA, Asimenos G, NISC Comparative Sequencing analysis of the human genome. Nature 2001, 409:860-921. Program, Green ED, Batzoglou S, Sidow A: Distribution and inten- 2. Han JS, Szak ST, Boeke JD: Transcriptional disruption by the L1 sity of constraint in mammalian genomic sequence. Genome retrotransposon and implications for mammalian transcrip- Res 2005, 15:901-13. tomes. Nature 2004, 429:268-274. 32. Korf I, Yandell M, Bedell J: BLAST. O'Reilly & Associates; 2003:143. 3. Fondon JW III, Garner HR: Molecular origins of rapid and con- 33. RECON software package [http://selab.janelia.org/recon.html] tinuous morphological evolution. PNAS 2004, 34. Rozen S, Skaletsky HJ: Primer3 on the WWW for general users 101(52):18058-18063. and for biologist programmers. Bioinformatics Methods and Proto- 4. Singer MF: SINEs and LINEs: highly repeated short and long cols Methods in Molecular Biology 2000:365-386 [http:// interspersed sequences in mammalian genomes. Cell 1982, frodo.wi.mit.edu/primer3/primer3_code.html]. Humana Press, 28:433-434. Totowa, NJ 5. Singer M, Berg P: Genes and Genomes. University Science Books, Mill Valley, California; 1991. 6. Bennett EA, Coleman LE, Tsui C, Pittart WS, Devine SE: Natural genetic variation caused by transposable elements in humans. Genetics 2004, 168:933-951. 7. Nekrutenko A, Li W-H: Transposable elements are found in a large number of human protein-coding genes. Trends in Genet- ics 2001, 17(11):619-621. 8. Deininger PL, Batzer MA: Mammalian retroelements. Genome Research 2002, 12:1455-1465. 9. Holmes I: Transcendent elements: whole-genome transposon screens and open evolutionary questions. Genome Research 2002, 12:1152-1155. 10. Bao Z, Eddy SR: Automated de novo identification of repeat sequence families in sequenced genomes. Genome Research 2002, 12:1269-1276. 11. Campagna D, Romualdi C, Vitulo N, Del Favero M, Lexa M, Cannata N, Valle G: RAP: a new computer program for de novo identi- fication of repeated sequences in whole genomes. Bioinformat- ics 2005, 21(5):582-588. Publish with Bio Med Central and every 12. Price AL, Jones NC, Pevzner PA: De novo identification of repeat families in large genomes. Bioinformatics 2005, 21(Suppl scientist can read your work free of charge 1):i351-i358. "BioMed Central will be the most significant development for 13. Edgar RC, Myers EW: PILER: identification and classification of disseminating the results of biomedical researc h in our lifetime." genomic repeats. Bioinformatics 2005, 21(Suppl 1):i152-i158. 14. Taneda A: Adplot: detection and visualization of repetitive Sir Paul Nurse, Cancer Research UK patterns in complete genomes. Bioinformatics 2004, 20(5):701-708. Your research papers will be: 15. Caspi A, Pachter L: Identification of transposable elements available free of charge to the entire biomedical community using multiple alignments of related genomes. Genome peer reviewed and published immediately upon acceptance Research 2006, 16:260-270. 16. Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ash- cited in PubMed and archived on PubMed Central burner M, Anxolabehere D: Combined evidence annotation of yours — you keep the copyright transposable elements in genome sequences. PLoS Comp Biol 2005, 1(2):e22. BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 12 of 12 (page number not for citation purposes)

Journal

BMC Genomics – Springer Journals

Published: Dec 1, 2006

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Novel porcine repetitive elements

Novel porcine repetitive elements

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Novel porcine repetitive elements

Novel porcine repetitive elements

References (46)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies