Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor

Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor Background: Repbase is a reference database of eukaryotic repetitive DNA, which includes prototypic sequences of repeats and basic information described in annotations. Updating and maintenance of the database requires specialized tools, which we have created and made available for use with Repbase, and which may be useful as a template for other curated databases. Results: We describe the software tools RepbaseSubmitter and Censor, which are designed to facilitate updating and screening the content of Repbase. RepbaseSubmitter is a java-based interface for formatting and annotating Repbase entries. It eliminates many common formatting errors, and automates actions such as calculation of sequence lengths and composition, thus facilitating curation of Repbase sequences. In addition, it has several features for predicting protein coding regions in sequences; searching and including Pubmed references in Repbase entries; and searching the NCBI taxonomy database for correct inclusion of species information and taxonomic position. Censor is a tool to rapidly identify repetitive elements by comparison to known repeats. It uses WU-BLAST for speed and sensitivity, and can conduct DNA-DNA, DNA-protein, or translated DNA-translated DNA searches of genomic sequence. Defragmented output includes a map of repeats present in the query sequence, with the options to report masked query sequence(s), repeat sequences found in the query, and alignments. Conclusion: Censor and RepbaseSubmitter are available as both web-based services and downloadable versions. They can be found at http://www.girinst.org/repbase/submission.html (RepbaseSubmitter) and http://www.girinst.org/censor/index.php (Censor). Background implemented in mySQL. Ongoing large-scale sequencing Repbase is the most widely used database of transposable of eukaryotic genomes has resulted in a rapid increase in elements, with ~5800 entries as of October 2006, repre- the rate at which new transposable elements are discov- senting over 40 superfamilies of DNA transposons, LTR ered. Rather than relying on error-prone automated and non-LTR retrotransposons, and endogenous retrovi- processing, the philosophy behind Repbase has been to ruses [1]. The current version of Repbase is based on a flex- incorporate a significant amount of manual curation into ible and extensible relational database schema the database. However, the increasing number of Page 1 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 sequences to be annotated and entered led us to develop Repbase. New entries are not directly entered into Rep- a standardized submission interface that external users base, but are submitted to a review database for editorial can use to provide information on their sequences, with a approval and additional curation. minimum of subsequent reformatting being necessary. Censor Repbase is primarily being used for screening and annota- The new version of Censor described here uses an unal- tion of genomic DNA. Censor was the first program for tered version of Repbase (as well as user-supplied libraries Repbase-based repeat detection and masking, originally if desired), and is composed of Perl and C++ modules for released in 1994 and later published [2]. Its major draw- identification of both interspersed and tandem repeats back was inefficient implementation of the Smith-Water- using similarity searches. Censor analyzes DNA/RNA man algorithm and, therefore, the publicly accessible sequences for repeats and provides a description of repet- Censor server ran exclusively on specialized Paracel hard- itive elements from their Repbase Update annotation [1]. ware. In the meantime, other programs, notably Repeat- Censor can optionally use either WU-BLAST or NCBI Masker [3,4], and blaster [5] became available. BLAST as its search engine, and can perform both direct RepeatMasker uses a customized version of the Repbase NA-NA searches, as well as any combination of protein- library that can sometimes have significant differences NA or translated NA searches using the appropriate BLAST from the original Repbase submission. Furthermore, Cen- modules. sor can be used to search DNA sequences against a library of proteins, or translated nucleotide sequences. The downloadable version of Censor can be installed on virtually any UNIX system (including Mac OSX) with Perl Manual curation of databases has both advantages and and a C++ compiler, that has WU-BLAST or NCBI BLAST drawbacks compared to automated processing. Automatic installed. It can also utilize symmetric multiprocessor annotation has the significant advantage of much higher machines. Censor uses BLAST to detect similarity between potential throughput, freedom from user error, and elim- repeat libraries and nucleic acid (NA) or protein ination of unintended bias in the processing. On the other sequences. For simple NA-NA searches, BLASTN is used hand, it is hard to anticipate every contingency in, for (the default). For more sensitive detection of distantly- example, correct reconstruction of consensus sequences. A related protein coding sequences, a six-frame search particular problem with automated reconstruction of (using TBLASTX) or protein-NA search (TBLASTN) is transposable elements is over-fragmentation, where algo- available. Censor offers three sensitivity modes: normal, rithms do not correctly assemble related parts of an ele- rough and sensitive, each offering a different balance of ment into a complete consensus. The principal source of sensitivity and speed. The difference in performance is mistakes in manual curation is user error in entering data. determined by the BLAST search parameters (see Addi- All complex data such as taxonomy, literature references, tional file 1). Censor automatically determines the type transposable element classifications are potentially prob- (NA or protein) of input sequences by calculating base lematic, since simple misspellings can render a database composition, and calls the appropriate BLAST program, entry unretrievable based on exact string-based searches. although this behavior can be overridden as described below. Censor relies on some standard UNIX system com- For these reasons, we have chosen to adopt a hybrid mands. For that reason a Unix/Linux operating system is approach: keeping the positive aspects of manual cura- required. Censor requires Perl to work, which is standard tion, while attempting to eliminate the most common on most UNIX systems. WU-BLAST (recommended) or sources of user-supplied errors, by automating the import NCBI-BLAST is required to perform searches. If the BLAST and annotation of complex, but well-defined information installation directory is on the user's path, the configura- including taxonomic information, referencing, etc. The tion script will automatically detect it and assign corre- purpose of RepbaseSubmitter is to provide an easy to use sponding variables. Otherwise, users must manually edit interface that permits flexibility in annotation, while at the header of Censor's main script to provide this infor- the same time reducing the scope for mistakes in the man- mation. GCC or another C++ compiler, and "make' util- ual curation process. ity, are required to build the Censor distribution. Implementation Results and discussion RepbaseSubmitter RepbaseSubmitter RepbaseSubmitter is implemented in Java (requires Java At all stages of data entry using the submission interface, Virtual Machine version 1.5 or above). The interface is required fields are indicated by boxes highlighted in red. structured around six data entry pages, together with an Although the data entry forms can be accessed in any initialization page for creating a new entry, and a final order, if required information is omitted, the program will submission page for performing checks and submitting to not allow the user to proceed until it has been entered. Page 2 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 The entry forms of RepbaseSubmitter, and the main infor- The Organism entry page ensures that correct taxonomy mation that can be entered through them, are summa- of entries is maintained; both at the level of species, and rized in Table 1. The Initialization (Select) page allows the for classes of repeat element. As species name is typed, user to begin creation of a new Repbase sequence by load- RepbaseSubmitter dynamically searches the NCBI taxon- ing data from a pre-existing file, or by starting with a com- omy tree [6] and lists matching entries. The species can be pletely blank template. After this initial selection, the selected from the list as soon as the correct one appears, or Summary data entry page is displayed. The primary fields can be typed fully – the more of the species name that is required for creation of a new entry include a Repbase typed, the narrower the list presented. Once a specific spe- accession number. The format of accession numbers is not cies has been selected, the interface pulls the correct taxo- fixed, and is user-defined; however, it must be unique. nomic classification from the NCBI Taxonomy database, This Repbase identifier can be considered analogous to a and enters this information in the relevant field. In addi- HUGO gene name, rather than an abstract database entity tion, this section of the interface facilitates correct classifi- such as a Genbank accession number. There is no cur- cation of transposable entries. The current classification rently accepted standard of assigning of names to trans- scheme implemented in Repbase is given in Table 2, how- posable elements. However, this topic was the subject of a ever the scheme is transparently extensible as new super- recent special working group (Asilomar Conference on families of transposable element are identified. The status "Genomic Impact of Transposable Elements", Asilomar, of the sequence as an autonomous or non-autonomous USA, Mar 31 – Apr 4, 2006). The Summary page also element can also be specified at this point. If non-autono- requires a description of the sequence being submitted. mous, the corresponding mobilizing element may be Ideally this is a succinct outline of the sequence type and indicated. nature, for example "L1-1_MD: a young L1 element from Monodelphis domestica – consensus sequence". A com- The Sequence entry page is the simplest, and requires ments section is also available for a more detailed descrip- only the sequence data to be input. If a DNA or RNA tion of the sequence, and is not limited in scope. sequence was loaded from file at the initial entry creation Examples of such information might include number of page, it will be displayed here. Otherwise, sequence data copies of the sequence in a genome; age distribution of can be cut-and-pasted into the window. The base count transposable elements (e.g. the mean similarity of copies and composition of the sequence is automatically to the consensus sequence); relationship of this sequence updated and entered. Sequences can also be comple- to other transposable elements that may be of interest; etc. mented, if it is determined that the other strand is more Finally, it is possible to specify free-form keywords which appropriate (for example, if it encodes proteins for auton- provide pertinent information specific to this sequence. omous elements). Repbase entries can be searched by keyword, so a user may wish to specify information such as characterization of Autonomous transposable elements encode proteins such protein coding domains present in the sequence (e.g. as transposase, reverse transcriptase, endonucleases, etc. reverse transcriptase, endonuclease). The keyword field is This information is often of interest to researchers using also used internally by Repbase to indicate links to corre- Repbase, and the Proteins interface (shown in Fig. 1) pro- sponding RepeatMasker library entries. The Summary vides a convenient way for identification and annotation page also notes the IP address of the computer submitting of open reading frames (ORFs) in the sequence. Multiple the data to Repbase – this is not user-editable. proteins can be specified for the same Repbase entry, and therefore it is necessary to supply a unique Repbase pro- tein identifier. One is generated automatically for each Table 1: Data entry pages in RepbaseSubmitter. Data Entry Form Purpose Select Initialization page Summary Specification of entry Accession, Keywords, Definition, Comments Sequence Entry of sequence, calculation of DNA content and lengths Organism Source organism/taxonomy; classification based on current Repbase structure Protein Specification of coding regions: prediction of ORFs, annotation on DNA sequence, comments describing protein features/ functions References Relevant references to primary literature or databases (Repbase or external such as Genbank, EMBL) Release Repbase release, relevant database accessions; consensus references Submission Display of final version prior to submission, perform final checks, submit to relational database for review The seven main forms presented by RepbaseSubmitter are listed with their title, and a summary of the information which can be entered. Page 3 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 Table 2: Current Repbase schema for transposable element classification. Major Class Superfamilies DNA transposons Mariner, hAT, MuDR, EnSpm, piggyback, P, Merlin, Harbinger, Transib, Novosib, Mirage, Helitron, Polinton, Rehavkus LTR retrotransposons Gypsy, Copia, DIRS, BEL Endogenous retroviruses ERV1, ERV2, ERV3 Non-LTR retrotransposons LINE1 (L1), RTE-1, CRE, CR1 (LINE3), I, Jockey, NeSL, R2, R4, Rex1, RandI, Penelope Caulimoviridae Simple repeat Satellites (SAT, MSAT) Repbase currently recognizes over 40 superfamilies of transposable/repetitive element. The major classes and superfamilies are listed here. The underlying relational database structure of Repbase allows easy addition and modification of the classification scheme, based on currently accepted conventions. ORF added – users may choose to specify their own iden- bines information that has been reported fragmentarily in tifier, but it must be unique in Repbase, and will be multiple locations. A reference may also be to another checked at the final stage before upload to the review data- database such as Genbank or EMBL, or to another Rep- base. A comment field is associated with each protein base entry. In this case, the user needs to supply the author entry on a sequence. Coordinates of coding regions can be information manually. If the creation of this Repbase entered manually, and the corresponding region will be entry represents new work, the user will generally want to translated and entered as the coding sequence. However a supply a title, and submit it to Repbase Reports. Entries useful feature of the Protein annotation page is the ability already described in another publication should be to predict ORFs. Upon selecting the "Predict" option on directed to Repbase Update. Repbase Reports provides a this page, the user is prompted to specify how many ORFs, medium for publication of novel transposable elements N, are anticipated. The program will graphically display in an online journal form, so that the work may be the N longest ORFs on all strands, along with their corre- referred to in other publications. Finally, the Reference sponding coordinates in the sequence. The user can select page provides an option for "Free Text" references, for an ORF to add to the Repbase entry as a putative protein those cases which do not correspond to traditional jour- coding region; in addition, several fragments of ORFs can nal references, or links to those databases specifically rec- be merged together as one coding region if it is anticipated ognized by Repbase. that they are part of the same protein. This is generally only recommended if resulting gaps are small. Finally, an The Release and Accessions page summarizes the infor- option is provided to truncate a specified coding region to mation supplied on the References page, primarily to the first occurring Methionine. allow selection of a primary reference for sequences which are consensi. Additionally, it is possible to specify a "crea- An important feature of Repbase is the ability to supply tion date" for this Repbase entry (generally the current references to appropriate scientific literature, or to other date); and a "last update" which will be the same as the Repbase entries and other databases. The submission creation data for a new sequence, but may be different if interface facilitates both types of referencing. References this is a refinement of a pre-existing Repbase element. This to scientific literature can be added manually i.e. by sup- section is also the appropriate place to specify accession plying authors, title, journal etc. in the normal manner; number(s) linking to other databases (Genbank etc.) – however, in this case entries are not automatically verified one accession number will be the primary accession for the in any way. As an alternative, RepbaseSubmitter provides sequence. an "Import" option on the References entry page. This allows users to specify partial information such as author The last screen of the submission interface is for actual names, article title, journal name, and then search the submission to the Repbase review database. The database NCBI Pubmed database [7]. A list of matching references entry as it will appear in native Repbase (EMBL) format is is returned, and multiple selections can be made from this displayed, and may be saved to a file. Upon selecting list and included in the Repbase entry. In this way, refer- "submit", the entry is checked for correct formatting, and ences to literature will correspond exactly to how they basic consistency such as unique Repbase accession and appear in Pubmed, which can substantially eliminate sequence information; and is then entered into the errors due to mistyping of reference information. In some mySQL database for approval cases, a particular reference may only apply to part of a sequence. This is often true if the sequence currently being entered is an extension of a previously-existing partial Repbase entry; or if the element being annotated com- Page 4 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 Protein Figure 1 annotation entry form of RepbaseSubmitter Protein annotation entry form of RepbaseSubmitter. The protein prediction sub-window is also shown, showing how ORFs can be predicted and merged into a predicted protein for annotation on the nucleotide sequence. The bottom of the main window shows access buttons for each entry page of the program. RepbaseSubmitter is written in java, and can run on any system with an installed Java Virtual Machine of version 1.5 or above. Censor filtering functions can be disabled if required, but this is Pre-processing of data not recommended, since it can lead to a significant pro- Before performing each search, input data is checked and portion of false hits between the query sequence and sim- formatted. Censor automatically chops long sequences ple repeats that are internal parts of repetitive elements into smaller fragments to reduce BLAST memory require- curated in Repbase. However, disabling annotation of sim- ments and to facilitate splitting of jobs on multiple proc- ple repeats can lead to a significant decrease in overall essor machines. Base composition is calculated for query processing time. and database sequences, and based on the total percent- Similarity searching age of ATCGN bases, Censor decides whether each sequence is nucleotide or protein. This information is In the main search phase, Censor uses BLAST to compare used in automatic selection of the BLAST search program the input sequence to annotated repetitive elements in – BLASTN, BLASTP, BLASTX or TBLASTN. In order to run Repbase, or a custom user-supplied library. There are two a translated versus translated search of nucleotide against separately developed and maintained versions of BLAST nucleotide sequences, TBLASTX must be specified as the available: WU-BLAST, copyrighted and maintained by search program (otherwise BLASTN is used). By default, Washington University [8], and a free version developed simple tandem repeats are masked using filter modules by NCBI [9]. Both versions have their advantages and dis- prior to similarity searching, to prevent false hits. Two advantages. WU-BLAST is faster than NCBI BLAST, and approaches are available for dealing with simple repeats. has more options, making it very flexible. However WU- The built-in BLAST filters, SEG and DUST, can be applied BLAST requires licensing from commercial companies in initial sequence processing. However this prevents and academic users (this can be done online for the lat- identification of simple repeats in the Censor output. ter), while NCBI BLAST is free for all users. As a result, we Another approach is to mask them by first BLASTing the created two versions of standalone Censor, with parame- query sequence against a library of simple repeats, which ters optimized for each version of BLAST. A web-based is included with the Censor distribution. In this case sim- Censor server is also available, which uses WU-BLAST ple repeats will be reported in the program's output. Both solely. The default WU-BLAST parameters for Censor's Page 5 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 "normal", "sensitive", and "rough" modes are described Finally optional tasks are performed, including classifica- in the Supplementary Material (see Additional file 1). In tion of repeats into subfamilies based on maximum simi- addition, all BLAST parameters can be overridden by spec- larity to consensus sequences. Currently the Censor ification on the command line of standalone Censor. The distribution supports only classification of human ALU query sequence is scanned against each library of repeats subfamilies. However other repeat families can be classi- specified using Censor's "-lib" option, in the order in fied after an easy setup process that requires a list of con- which they are listed. After processing each library, sensus sequences and a hierarchy of subfamilies. A detected repeats are masked out from the query sequence complete description of Censor parameters can be found before comparison to the next library. in the program documentation. Details of BLAST parame- ters for the available sensitivity modes are given in the Post-processing and output Supplementary Material (see Additional file 1). Censor performs post-processing of BLAST output by removing overlaps and defragmenting detected repeats. Conclusion The program reports positions of repetitive elements in The resulting new package, RepbaseSubmitter, facilitates ".map" files. Figure 2 shows an example of a repeat map. and automates many aspects of Repbase entry creation Many methods for evaluating the similarity between two and maintenance. The program performs numerous or more homologous sequences exist [10-12]. In the case checks on formatting of entries, and consistent entry of of transposable elements, even a large indel (insertion or certain data fields; as well as ensuring that required data deletion), which corresponds to any uninterrupted align- are provided. ment gap, can reflect one event in evolution (transposi- tional insertion or excision) and should impact the value Availability and system requirements of similarity the same way unrelated to its length. The sim- Project name: Censor ilarity values output in maps are therefore calculated as follows: Sim = match_count/(alignment_length - Project home page: http://www.girinst.org/censor/ query_gap_length - subject_gap_length + gap_count) where: index.php match_count = number of matching base positions in Operating system(s): Unix/Linux alignment; alignment_length = length of alignment, i.e. number of matches + number of mismatches + length of gaps; query_gap_length = total length of alignment gaps on Programming language: Perl, C++ submitted query sequence; subject_gap_length = total length of alignment gaps on library sequence; gap_count = License: GPL number of uninterrupted alignment gaps of any length on either query or subject sequences. In addition to this Any restrictions to use by non-academics: None measure, the Censor output incorporates an alternative similarity measure Pos, that is calculated on the basis of Project name: RepbaseSubmitter positive scores between aligned base pairs. This is typically higher than the previous similarity score, and may be Project home page: http://www.girinst.org/repbase/sub more appropriate for protein alignments. Furthermore, mission.html Censor can produce pair-wise alignments of detected repeats using the SWAT algorithm [13]. For these, the sim- Operating system(s): Any, with Java Virtual Machine 1.5 ilarity reported incorporates an affine gap penalty. or above Maps include simple repeats unless the "-nosimple" Programming language: Java option was specified. The web-based version of Censor provides a graphical representation of the map in SVG Other requirements: Java 1.5 or higher (Scalable Vector Graphics) format, with colour-coding of different repeat types. By default, Censor also produces a License: GPL ".masked" file containing the original sequence with all detected repeats masked out; and a ".found" file contain- Any restrictions to use by non-academics: None ing the genomic sequence fragments that were detected as matching a known repeat. General information on the Authors' contributions query sequence(s) and their repeat content is stored in OK wrote and developed software for Censor and Rep- ".tab" files. baseSubmitter. AG helped with debugging and feature addition of both programs, and wrote the manuscript. LH did the initial design and coding of Repbasesubmitter. JJ Page 6 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 OR_CBa0028O06.f ENSPM2_OS RIRE3_LTR TRUNCATOR TRUNCATOR2_LTR 1 143 181 429 462 556 702 783 Nam Name From From To To Nam Name From From To To Dir Sim Sim Pos Pos Score Score OR_CBa0028O06.f 1 143 ENSPM2_OS 2893 3035 c 0.9930 0.99 1342 OR_CBa0028O06.f 181 429 RIRE3_LTR 1 250 d 0.8725 0.87 1649 OR_CBa0028O06.f 462 556 TRUNCATOR 2470 2557 c 0.8391 0.84 408 OR_CBa0028O06.f 702 783 TRUNCATOR2_LTR 1248 1323 d 0.8205 0.82 366 E Figure 2 xample of a repeat map, and graphical representation Example of a repeat map, and graphical representation. Name contains locus names of submitted query sequences (first column) and library sequences (fourth column). Repbase names are hyperlinked to their sequences in web-based Censor. From/To contains beginning/end positions of reported fragments on their corresponding sequence. Dir indicates orientation ('d' for direct, 'c' for complementary) of repeat fragment. Column Sim contains the similarity between 2 aligned fragments, cal- culated as described in the text. Pos is roughly the ratio of positive matches (bases that produce positive scores in the align- ment matrix) to alignment length. This ratio is calculated the same way as we calculate similarity (see main text), with positive_count instead of match_count. This information is particularly useful for estimating the quality of protein alignments. Score is the alignment score obtained from BLAST. and Anopheles gambiae genomes. J Mol Evol 2003, 57(Suppl directed development of both programs as Principal 1):S50-9. Investigator. All authors contributed to and approved the 6. NCBI Taxonomy Browser 2003 [http://www.ncbi.nlm.nih.gov/ Taxonomy/taxonomyhome.html]. final manuscript. 7. NCBI PubMed 2003 [http://www.ncbi.nlm.nih.gov/entrez/ query.fcgi?DB=pubmed]. 8. Washington University BLAST Archives 2003 [http:// Additional material blast.wustl.edu]. 9. Reese JT, Pearson WR: Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 2002, Additional file 1 18(11):1500-7. Supplementary Material A. Parameters supplied to WU-BLAST by Censor 10. Rivas E: Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinformatics 2005, Click here for file 6:63. [http://www.biomedcentral.com/content/supplementary/1471- 11. Vingron M, Waterman MS: Sequence alignment and penalty 2105-7-474-S1.doc] choice. Review of concepts, case studies and implications. J Mol Biol 1994, 235(1):1-12. 12. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lip- man DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, Acknowledgements 25:3389-3402. Development of Censor was supported by National Institutes of Health 13. Smith TF, Waterman MS: Identification of common molecular sub sequences. J Mol Biol 1981, 147:195-198. grant 5 P41 LM006252-08. We would like to thank Jolanta Walichiewicz for help with preparing the manuscript. References 1. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichie- wicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110:462-7. 2. Jurka J, Klonowski P, Dagman V, Pelton P: CENSOR – a program for identification and elimination of repetitive elements from DNA sequences. Comput Chem 1996, 20:119-21. 3. Smit AFA, Hubley R, Green P: 1996–2006 RepeatMasker Open- 3.0. 1996 [http://www.repeatmasker.org]. 4. Bedell JA, Korf I, Gish W: MaskerAid: a performance enhance- ment to RepeatMasker. Bioinformatics 2000, 16:1040-1041. 5. Quesneville H, Nouaud D, Anxolabehere D: Detection of new transposable element families in Drosophila melanogaster Page 7 of 7 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Unpaywall

Annotation, submission and screening of repetitive elements in Repbase: RepbaseSubmitter and Censor

BMC BioinformaticsOct 25, 2006

Loading next page...
 
/lp/unpaywall/annotation-submission-and-screening-of-repetitive-elements-in-repbase-TFued7DwVJ

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
1471-2105
DOI
10.1186/1471-2105-7-474
Publisher site
See Article on Publisher Site

Abstract

Background: Repbase is a reference database of eukaryotic repetitive DNA, which includes prototypic sequences of repeats and basic information described in annotations. Updating and maintenance of the database requires specialized tools, which we have created and made available for use with Repbase, and which may be useful as a template for other curated databases. Results: We describe the software tools RepbaseSubmitter and Censor, which are designed to facilitate updating and screening the content of Repbase. RepbaseSubmitter is a java-based interface for formatting and annotating Repbase entries. It eliminates many common formatting errors, and automates actions such as calculation of sequence lengths and composition, thus facilitating curation of Repbase sequences. In addition, it has several features for predicting protein coding regions in sequences; searching and including Pubmed references in Repbase entries; and searching the NCBI taxonomy database for correct inclusion of species information and taxonomic position. Censor is a tool to rapidly identify repetitive elements by comparison to known repeats. It uses WU-BLAST for speed and sensitivity, and can conduct DNA-DNA, DNA-protein, or translated DNA-translated DNA searches of genomic sequence. Defragmented output includes a map of repeats present in the query sequence, with the options to report masked query sequence(s), repeat sequences found in the query, and alignments. Conclusion: Censor and RepbaseSubmitter are available as both web-based services and downloadable versions. They can be found at http://www.girinst.org/repbase/submission.html (RepbaseSubmitter) and http://www.girinst.org/censor/index.php (Censor). Background implemented in mySQL. Ongoing large-scale sequencing Repbase is the most widely used database of transposable of eukaryotic genomes has resulted in a rapid increase in elements, with ~5800 entries as of October 2006, repre- the rate at which new transposable elements are discov- senting over 40 superfamilies of DNA transposons, LTR ered. Rather than relying on error-prone automated and non-LTR retrotransposons, and endogenous retrovi- processing, the philosophy behind Repbase has been to ruses [1]. The current version of Repbase is based on a flex- incorporate a significant amount of manual curation into ible and extensible relational database schema the database. However, the increasing number of Page 1 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 sequences to be annotated and entered led us to develop Repbase. New entries are not directly entered into Rep- a standardized submission interface that external users base, but are submitted to a review database for editorial can use to provide information on their sequences, with a approval and additional curation. minimum of subsequent reformatting being necessary. Censor Repbase is primarily being used for screening and annota- The new version of Censor described here uses an unal- tion of genomic DNA. Censor was the first program for tered version of Repbase (as well as user-supplied libraries Repbase-based repeat detection and masking, originally if desired), and is composed of Perl and C++ modules for released in 1994 and later published [2]. Its major draw- identification of both interspersed and tandem repeats back was inefficient implementation of the Smith-Water- using similarity searches. Censor analyzes DNA/RNA man algorithm and, therefore, the publicly accessible sequences for repeats and provides a description of repet- Censor server ran exclusively on specialized Paracel hard- itive elements from their Repbase Update annotation [1]. ware. In the meantime, other programs, notably Repeat- Censor can optionally use either WU-BLAST or NCBI Masker [3,4], and blaster [5] became available. BLAST as its search engine, and can perform both direct RepeatMasker uses a customized version of the Repbase NA-NA searches, as well as any combination of protein- library that can sometimes have significant differences NA or translated NA searches using the appropriate BLAST from the original Repbase submission. Furthermore, Cen- modules. sor can be used to search DNA sequences against a library of proteins, or translated nucleotide sequences. The downloadable version of Censor can be installed on virtually any UNIX system (including Mac OSX) with Perl Manual curation of databases has both advantages and and a C++ compiler, that has WU-BLAST or NCBI BLAST drawbacks compared to automated processing. Automatic installed. It can also utilize symmetric multiprocessor annotation has the significant advantage of much higher machines. Censor uses BLAST to detect similarity between potential throughput, freedom from user error, and elim- repeat libraries and nucleic acid (NA) or protein ination of unintended bias in the processing. On the other sequences. For simple NA-NA searches, BLASTN is used hand, it is hard to anticipate every contingency in, for (the default). For more sensitive detection of distantly- example, correct reconstruction of consensus sequences. A related protein coding sequences, a six-frame search particular problem with automated reconstruction of (using TBLASTX) or protein-NA search (TBLASTN) is transposable elements is over-fragmentation, where algo- available. Censor offers three sensitivity modes: normal, rithms do not correctly assemble related parts of an ele- rough and sensitive, each offering a different balance of ment into a complete consensus. The principal source of sensitivity and speed. The difference in performance is mistakes in manual curation is user error in entering data. determined by the BLAST search parameters (see Addi- All complex data such as taxonomy, literature references, tional file 1). Censor automatically determines the type transposable element classifications are potentially prob- (NA or protein) of input sequences by calculating base lematic, since simple misspellings can render a database composition, and calls the appropriate BLAST program, entry unretrievable based on exact string-based searches. although this behavior can be overridden as described below. Censor relies on some standard UNIX system com- For these reasons, we have chosen to adopt a hybrid mands. For that reason a Unix/Linux operating system is approach: keeping the positive aspects of manual cura- required. Censor requires Perl to work, which is standard tion, while attempting to eliminate the most common on most UNIX systems. WU-BLAST (recommended) or sources of user-supplied errors, by automating the import NCBI-BLAST is required to perform searches. If the BLAST and annotation of complex, but well-defined information installation directory is on the user's path, the configura- including taxonomic information, referencing, etc. The tion script will automatically detect it and assign corre- purpose of RepbaseSubmitter is to provide an easy to use sponding variables. Otherwise, users must manually edit interface that permits flexibility in annotation, while at the header of Censor's main script to provide this infor- the same time reducing the scope for mistakes in the man- mation. GCC or another C++ compiler, and "make' util- ual curation process. ity, are required to build the Censor distribution. Implementation Results and discussion RepbaseSubmitter RepbaseSubmitter RepbaseSubmitter is implemented in Java (requires Java At all stages of data entry using the submission interface, Virtual Machine version 1.5 or above). The interface is required fields are indicated by boxes highlighted in red. structured around six data entry pages, together with an Although the data entry forms can be accessed in any initialization page for creating a new entry, and a final order, if required information is omitted, the program will submission page for performing checks and submitting to not allow the user to proceed until it has been entered. Page 2 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 The entry forms of RepbaseSubmitter, and the main infor- The Organism entry page ensures that correct taxonomy mation that can be entered through them, are summa- of entries is maintained; both at the level of species, and rized in Table 1. The Initialization (Select) page allows the for classes of repeat element. As species name is typed, user to begin creation of a new Repbase sequence by load- RepbaseSubmitter dynamically searches the NCBI taxon- ing data from a pre-existing file, or by starting with a com- omy tree [6] and lists matching entries. The species can be pletely blank template. After this initial selection, the selected from the list as soon as the correct one appears, or Summary data entry page is displayed. The primary fields can be typed fully – the more of the species name that is required for creation of a new entry include a Repbase typed, the narrower the list presented. Once a specific spe- accession number. The format of accession numbers is not cies has been selected, the interface pulls the correct taxo- fixed, and is user-defined; however, it must be unique. nomic classification from the NCBI Taxonomy database, This Repbase identifier can be considered analogous to a and enters this information in the relevant field. In addi- HUGO gene name, rather than an abstract database entity tion, this section of the interface facilitates correct classifi- such as a Genbank accession number. There is no cur- cation of transposable entries. The current classification rently accepted standard of assigning of names to trans- scheme implemented in Repbase is given in Table 2, how- posable elements. However, this topic was the subject of a ever the scheme is transparently extensible as new super- recent special working group (Asilomar Conference on families of transposable element are identified. The status "Genomic Impact of Transposable Elements", Asilomar, of the sequence as an autonomous or non-autonomous USA, Mar 31 – Apr 4, 2006). The Summary page also element can also be specified at this point. If non-autono- requires a description of the sequence being submitted. mous, the corresponding mobilizing element may be Ideally this is a succinct outline of the sequence type and indicated. nature, for example "L1-1_MD: a young L1 element from Monodelphis domestica – consensus sequence". A com- The Sequence entry page is the simplest, and requires ments section is also available for a more detailed descrip- only the sequence data to be input. If a DNA or RNA tion of the sequence, and is not limited in scope. sequence was loaded from file at the initial entry creation Examples of such information might include number of page, it will be displayed here. Otherwise, sequence data copies of the sequence in a genome; age distribution of can be cut-and-pasted into the window. The base count transposable elements (e.g. the mean similarity of copies and composition of the sequence is automatically to the consensus sequence); relationship of this sequence updated and entered. Sequences can also be comple- to other transposable elements that may be of interest; etc. mented, if it is determined that the other strand is more Finally, it is possible to specify free-form keywords which appropriate (for example, if it encodes proteins for auton- provide pertinent information specific to this sequence. omous elements). Repbase entries can be searched by keyword, so a user may wish to specify information such as characterization of Autonomous transposable elements encode proteins such protein coding domains present in the sequence (e.g. as transposase, reverse transcriptase, endonucleases, etc. reverse transcriptase, endonuclease). The keyword field is This information is often of interest to researchers using also used internally by Repbase to indicate links to corre- Repbase, and the Proteins interface (shown in Fig. 1) pro- sponding RepeatMasker library entries. The Summary vides a convenient way for identification and annotation page also notes the IP address of the computer submitting of open reading frames (ORFs) in the sequence. Multiple the data to Repbase – this is not user-editable. proteins can be specified for the same Repbase entry, and therefore it is necessary to supply a unique Repbase pro- tein identifier. One is generated automatically for each Table 1: Data entry pages in RepbaseSubmitter. Data Entry Form Purpose Select Initialization page Summary Specification of entry Accession, Keywords, Definition, Comments Sequence Entry of sequence, calculation of DNA content and lengths Organism Source organism/taxonomy; classification based on current Repbase structure Protein Specification of coding regions: prediction of ORFs, annotation on DNA sequence, comments describing protein features/ functions References Relevant references to primary literature or databases (Repbase or external such as Genbank, EMBL) Release Repbase release, relevant database accessions; consensus references Submission Display of final version prior to submission, perform final checks, submit to relational database for review The seven main forms presented by RepbaseSubmitter are listed with their title, and a summary of the information which can be entered. Page 3 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 Table 2: Current Repbase schema for transposable element classification. Major Class Superfamilies DNA transposons Mariner, hAT, MuDR, EnSpm, piggyback, P, Merlin, Harbinger, Transib, Novosib, Mirage, Helitron, Polinton, Rehavkus LTR retrotransposons Gypsy, Copia, DIRS, BEL Endogenous retroviruses ERV1, ERV2, ERV3 Non-LTR retrotransposons LINE1 (L1), RTE-1, CRE, CR1 (LINE3), I, Jockey, NeSL, R2, R4, Rex1, RandI, Penelope Caulimoviridae Simple repeat Satellites (SAT, MSAT) Repbase currently recognizes over 40 superfamilies of transposable/repetitive element. The major classes and superfamilies are listed here. The underlying relational database structure of Repbase allows easy addition and modification of the classification scheme, based on currently accepted conventions. ORF added – users may choose to specify their own iden- bines information that has been reported fragmentarily in tifier, but it must be unique in Repbase, and will be multiple locations. A reference may also be to another checked at the final stage before upload to the review data- database such as Genbank or EMBL, or to another Rep- base. A comment field is associated with each protein base entry. In this case, the user needs to supply the author entry on a sequence. Coordinates of coding regions can be information manually. If the creation of this Repbase entered manually, and the corresponding region will be entry represents new work, the user will generally want to translated and entered as the coding sequence. However a supply a title, and submit it to Repbase Reports. Entries useful feature of the Protein annotation page is the ability already described in another publication should be to predict ORFs. Upon selecting the "Predict" option on directed to Repbase Update. Repbase Reports provides a this page, the user is prompted to specify how many ORFs, medium for publication of novel transposable elements N, are anticipated. The program will graphically display in an online journal form, so that the work may be the N longest ORFs on all strands, along with their corre- referred to in other publications. Finally, the Reference sponding coordinates in the sequence. The user can select page provides an option for "Free Text" references, for an ORF to add to the Repbase entry as a putative protein those cases which do not correspond to traditional jour- coding region; in addition, several fragments of ORFs can nal references, or links to those databases specifically rec- be merged together as one coding region if it is anticipated ognized by Repbase. that they are part of the same protein. This is generally only recommended if resulting gaps are small. Finally, an The Release and Accessions page summarizes the infor- option is provided to truncate a specified coding region to mation supplied on the References page, primarily to the first occurring Methionine. allow selection of a primary reference for sequences which are consensi. Additionally, it is possible to specify a "crea- An important feature of Repbase is the ability to supply tion date" for this Repbase entry (generally the current references to appropriate scientific literature, or to other date); and a "last update" which will be the same as the Repbase entries and other databases. The submission creation data for a new sequence, but may be different if interface facilitates both types of referencing. References this is a refinement of a pre-existing Repbase element. This to scientific literature can be added manually i.e. by sup- section is also the appropriate place to specify accession plying authors, title, journal etc. in the normal manner; number(s) linking to other databases (Genbank etc.) – however, in this case entries are not automatically verified one accession number will be the primary accession for the in any way. As an alternative, RepbaseSubmitter provides sequence. an "Import" option on the References entry page. This allows users to specify partial information such as author The last screen of the submission interface is for actual names, article title, journal name, and then search the submission to the Repbase review database. The database NCBI Pubmed database [7]. A list of matching references entry as it will appear in native Repbase (EMBL) format is is returned, and multiple selections can be made from this displayed, and may be saved to a file. Upon selecting list and included in the Repbase entry. In this way, refer- "submit", the entry is checked for correct formatting, and ences to literature will correspond exactly to how they basic consistency such as unique Repbase accession and appear in Pubmed, which can substantially eliminate sequence information; and is then entered into the errors due to mistyping of reference information. In some mySQL database for approval cases, a particular reference may only apply to part of a sequence. This is often true if the sequence currently being entered is an extension of a previously-existing partial Repbase entry; or if the element being annotated com- Page 4 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 Protein Figure 1 annotation entry form of RepbaseSubmitter Protein annotation entry form of RepbaseSubmitter. The protein prediction sub-window is also shown, showing how ORFs can be predicted and merged into a predicted protein for annotation on the nucleotide sequence. The bottom of the main window shows access buttons for each entry page of the program. RepbaseSubmitter is written in java, and can run on any system with an installed Java Virtual Machine of version 1.5 or above. Censor filtering functions can be disabled if required, but this is Pre-processing of data not recommended, since it can lead to a significant pro- Before performing each search, input data is checked and portion of false hits between the query sequence and sim- formatted. Censor automatically chops long sequences ple repeats that are internal parts of repetitive elements into smaller fragments to reduce BLAST memory require- curated in Repbase. However, disabling annotation of sim- ments and to facilitate splitting of jobs on multiple proc- ple repeats can lead to a significant decrease in overall essor machines. Base composition is calculated for query processing time. and database sequences, and based on the total percent- Similarity searching age of ATCGN bases, Censor decides whether each sequence is nucleotide or protein. This information is In the main search phase, Censor uses BLAST to compare used in automatic selection of the BLAST search program the input sequence to annotated repetitive elements in – BLASTN, BLASTP, BLASTX or TBLASTN. In order to run Repbase, or a custom user-supplied library. There are two a translated versus translated search of nucleotide against separately developed and maintained versions of BLAST nucleotide sequences, TBLASTX must be specified as the available: WU-BLAST, copyrighted and maintained by search program (otherwise BLASTN is used). By default, Washington University [8], and a free version developed simple tandem repeats are masked using filter modules by NCBI [9]. Both versions have their advantages and dis- prior to similarity searching, to prevent false hits. Two advantages. WU-BLAST is faster than NCBI BLAST, and approaches are available for dealing with simple repeats. has more options, making it very flexible. However WU- The built-in BLAST filters, SEG and DUST, can be applied BLAST requires licensing from commercial companies in initial sequence processing. However this prevents and academic users (this can be done online for the lat- identification of simple repeats in the Censor output. ter), while NCBI BLAST is free for all users. As a result, we Another approach is to mask them by first BLASTing the created two versions of standalone Censor, with parame- query sequence against a library of simple repeats, which ters optimized for each version of BLAST. A web-based is included with the Censor distribution. In this case sim- Censor server is also available, which uses WU-BLAST ple repeats will be reported in the program's output. Both solely. The default WU-BLAST parameters for Censor's Page 5 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 "normal", "sensitive", and "rough" modes are described Finally optional tasks are performed, including classifica- in the Supplementary Material (see Additional file 1). In tion of repeats into subfamilies based on maximum simi- addition, all BLAST parameters can be overridden by spec- larity to consensus sequences. Currently the Censor ification on the command line of standalone Censor. The distribution supports only classification of human ALU query sequence is scanned against each library of repeats subfamilies. However other repeat families can be classi- specified using Censor's "-lib" option, in the order in fied after an easy setup process that requires a list of con- which they are listed. After processing each library, sensus sequences and a hierarchy of subfamilies. A detected repeats are masked out from the query sequence complete description of Censor parameters can be found before comparison to the next library. in the program documentation. Details of BLAST parame- ters for the available sensitivity modes are given in the Post-processing and output Supplementary Material (see Additional file 1). Censor performs post-processing of BLAST output by removing overlaps and defragmenting detected repeats. Conclusion The program reports positions of repetitive elements in The resulting new package, RepbaseSubmitter, facilitates ".map" files. Figure 2 shows an example of a repeat map. and automates many aspects of Repbase entry creation Many methods for evaluating the similarity between two and maintenance. The program performs numerous or more homologous sequences exist [10-12]. In the case checks on formatting of entries, and consistent entry of of transposable elements, even a large indel (insertion or certain data fields; as well as ensuring that required data deletion), which corresponds to any uninterrupted align- are provided. ment gap, can reflect one event in evolution (transposi- tional insertion or excision) and should impact the value Availability and system requirements of similarity the same way unrelated to its length. The sim- Project name: Censor ilarity values output in maps are therefore calculated as follows: Sim = match_count/(alignment_length - Project home page: http://www.girinst.org/censor/ query_gap_length - subject_gap_length + gap_count) where: index.php match_count = number of matching base positions in Operating system(s): Unix/Linux alignment; alignment_length = length of alignment, i.e. number of matches + number of mismatches + length of gaps; query_gap_length = total length of alignment gaps on Programming language: Perl, C++ submitted query sequence; subject_gap_length = total length of alignment gaps on library sequence; gap_count = License: GPL number of uninterrupted alignment gaps of any length on either query or subject sequences. In addition to this Any restrictions to use by non-academics: None measure, the Censor output incorporates an alternative similarity measure Pos, that is calculated on the basis of Project name: RepbaseSubmitter positive scores between aligned base pairs. This is typically higher than the previous similarity score, and may be Project home page: http://www.girinst.org/repbase/sub more appropriate for protein alignments. Furthermore, mission.html Censor can produce pair-wise alignments of detected repeats using the SWAT algorithm [13]. For these, the sim- Operating system(s): Any, with Java Virtual Machine 1.5 ilarity reported incorporates an affine gap penalty. or above Maps include simple repeats unless the "-nosimple" Programming language: Java option was specified. The web-based version of Censor provides a graphical representation of the map in SVG Other requirements: Java 1.5 or higher (Scalable Vector Graphics) format, with colour-coding of different repeat types. By default, Censor also produces a License: GPL ".masked" file containing the original sequence with all detected repeats masked out; and a ".found" file contain- Any restrictions to use by non-academics: None ing the genomic sequence fragments that were detected as matching a known repeat. General information on the Authors' contributions query sequence(s) and their repeat content is stored in OK wrote and developed software for Censor and Rep- ".tab" files. baseSubmitter. AG helped with debugging and feature addition of both programs, and wrote the manuscript. LH did the initial design and coding of Repbasesubmitter. JJ Page 6 of 7 (page number not for citation purposes) BMC Bioinformatics 2006, 7:474 http://www.biomedcentral.com/1471-2105/7/474 OR_CBa0028O06.f ENSPM2_OS RIRE3_LTR TRUNCATOR TRUNCATOR2_LTR 1 143 181 429 462 556 702 783 Nam Name From From To To Nam Name From From To To Dir Sim Sim Pos Pos Score Score OR_CBa0028O06.f 1 143 ENSPM2_OS 2893 3035 c 0.9930 0.99 1342 OR_CBa0028O06.f 181 429 RIRE3_LTR 1 250 d 0.8725 0.87 1649 OR_CBa0028O06.f 462 556 TRUNCATOR 2470 2557 c 0.8391 0.84 408 OR_CBa0028O06.f 702 783 TRUNCATOR2_LTR 1248 1323 d 0.8205 0.82 366 E Figure 2 xample of a repeat map, and graphical representation Example of a repeat map, and graphical representation. Name contains locus names of submitted query sequences (first column) and library sequences (fourth column). Repbase names are hyperlinked to their sequences in web-based Censor. From/To contains beginning/end positions of reported fragments on their corresponding sequence. Dir indicates orientation ('d' for direct, 'c' for complementary) of repeat fragment. Column Sim contains the similarity between 2 aligned fragments, cal- culated as described in the text. Pos is roughly the ratio of positive matches (bases that produce positive scores in the align- ment matrix) to alignment length. This ratio is calculated the same way as we calculate similarity (see main text), with positive_count instead of match_count. This information is particularly useful for estimating the quality of protein alignments. Score is the alignment score obtained from BLAST. and Anopheles gambiae genomes. J Mol Evol 2003, 57(Suppl directed development of both programs as Principal 1):S50-9. Investigator. All authors contributed to and approved the 6. NCBI Taxonomy Browser 2003 [http://www.ncbi.nlm.nih.gov/ Taxonomy/taxonomyhome.html]. final manuscript. 7. NCBI PubMed 2003 [http://www.ncbi.nlm.nih.gov/entrez/ query.fcgi?DB=pubmed]. 8. Washington University BLAST Archives 2003 [http:// Additional material blast.wustl.edu]. 9. Reese JT, Pearson WR: Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 2002, Additional file 1 18(11):1500-7. Supplementary Material A. Parameters supplied to WU-BLAST by Censor 10. Rivas E: Evolutionary models for insertions and deletions in a probabilistic modeling framework. BMC Bioinformatics 2005, Click here for file 6:63. [http://www.biomedcentral.com/content/supplementary/1471- 11. Vingron M, Waterman MS: Sequence alignment and penalty 2105-7-474-S1.doc] choice. Review of concepts, case studies and implications. J Mol Biol 1994, 235(1):1-12. 12. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lip- man DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, Acknowledgements 25:3389-3402. Development of Censor was supported by National Institutes of Health 13. Smith TF, Waterman MS: Identification of common molecular sub sequences. J Mol Biol 1981, 147:195-198. grant 5 P41 LM006252-08. We would like to thank Jolanta Walichiewicz for help with preparing the manuscript. References 1. Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichie- wicz J: Repbase Update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 2005, 110:462-7. 2. Jurka J, Klonowski P, Dagman V, Pelton P: CENSOR – a program for identification and elimination of repetitive elements from DNA sequences. Comput Chem 1996, 20:119-21. 3. Smit AFA, Hubley R, Green P: 1996–2006 RepeatMasker Open- 3.0. 1996 [http://www.repeatmasker.org]. 4. Bedell JA, Korf I, Gish W: MaskerAid: a performance enhance- ment to RepeatMasker. Bioinformatics 2000, 16:1040-1041. 5. Quesneville H, Nouaud D, Anxolabehere D: Detection of new transposable element families in Drosophila melanogaster Page 7 of 7 (page number not for citation purposes)

Journal

BMC BioinformaticsUnpaywall

Published: Oct 25, 2006

There are no references for this article.