DNA sequence quality trimming and vector removal

Hui-Hsien Chou; Michael H. Holmes

doi:10.1093/bioinformatics/17.12.1093

DNA sequence quality trimming and vector removal

Chou, Hui-Hsien; Holmes, Michael H. 2001-12-01 00:00:00 Vol. 17 no. 12 2001 BIOINFORMATICS Pages 1093–1104 DNA sequence quality trimming and vector removal 1 2 Hui-Hsien Chou and Michael H. Holmes Department of Zoology and Genetics, Department of Computer Science, Iowa State University, Ames, IA 50011, USA and Department of Bioinformatics, The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA Received on March 8, 2001; revised on June 11, 2001; accepted on June 12, 2001 ABSTRACT DNA data coming directly from automatic sequencing Motivation: Most sequence comparison methods assume machines often violate that assumption. In fact, most raw that the data being compared are trustworthy, but this DNA sequences have base-call errors. Nevertheless, to is not the case with raw DNA sequences obtained from make good use of these unreliable raw sequences, we still automatic sequencing machines. Nevertheless, sequence need to compare them against some other data, such as comparisons need to be done on them in order to vector and contaminant sequences, in order to prepare remove vector splice sites and contaminants. This step is them for later stages of the genomic data processing necessary before other genomic data processing stages pipeline (e.g. fragment assembly or EST clustering). can be carried out, such as fragment assembly or EST This creates a dilemma. We have uncertain raw DNA clustering. A specialized tool is therefore needed to solve sequences, but we need to compare them against other this apparent dilemma. sequences in order to obtain usable data. This data Results: We have designed and implemented a program cleaning stage is critical for the success of later stages that speciﬁcally addresses the problem. This program, of the genomic data processing pipeline, as illustrated in called LUCY, has been in use since 1998 at The Institute Figure 1. Through this cleaning stage, unreliable raw data for Genomic Research (TIGR). During this period, many from sequencing machines can be enhanced to a more rounds of experience-driven modiﬁcations were made to reliable level, so that later stages of the processing can LUCY to improve its accuracy and its ability to deal with use them without concern about their base quality . Our extremely difﬁcult input cases. We believe we have ﬁnally solution at The Institute for Genomic Research (TIGR) for obtained a useful program which strikes a delicate balance this cleaning stage is a specialized tool that was designed among the many issues involved in the raw sequence to handle this stage of genomic data processing. After cleaning problem, and we wish to share it with the research more than two years of practical use in TIGR and several community. rounds of improvement, we believe it is mature enough Availability: LUCY is available directly from TIGR (http:// now to be introduced to the scientiﬁc community. We hope www.tigr.org/softlab). Academic users can download LUCY other researchers can also beneﬁt from using it. after accepting a free academic use license. Business users may need to pay a license fee to use LUCY for METHODS commercial purposes. This tool, called LUCY, was designed to take the base-call Contact: Questions regarding the quality assessment quality assessment of each base into consideration in the module of LUCY should be directed to Michael Holmes cleaning process, to make sure the processed sequences ([email protected]). Questions regarding other aspects have the best overall quality possible based on their of LUCY should be directed to Hui-Hsien Chou (hh- individual base quality values. Thanks to a few powerful [email protected]). chromatogram base-calling programs like phred (Ewing et al., 1998; Ewing and Green, 1998) and TraceTuner (Paracel, 2000), the quality value for each individual base INTRODUCTION of a raw sequence can be estimated. The individual base There have been a lot of studies on how to align DNA quality assessment is generally reliable, but a sequence sequences with each other. Most of these studies are based can have very different base qualities among its bases. on a fundamental assumption that the sequences being compared are trustworthy, i.e. their bases are correct. † However, other quality issues may still exist, such as the need to remove However, in large genome sequencing centers, the raw chimeric sequences at the assembly stage. c Oxford University Press 2001 1093 H.-H.Chou and M.H.Holmes library fragment base-calling program with the same chromatogram ﬁle, base calling gene finding publication construction assembly LUCY may use it to extend the good quality region by automatic sequence cDNA database aligning the second sequence with the ﬁrst one at places annotation sequencing cleaning clustering construction where they match with high ﬁdelity. The rationale is that if two independent base-calling algorithms agree on some Fig. 1. Major steps of a typical genomic data processing pipeline. bases, chances are these bases are correct even though LUCY is designed to handle the sequence cleaning stage between some of them may have lower quality values. This method the base calling and fragment assembly or clustering stages. works well in processing ABI 377 and 370 chromatograms where the second sequence comes from the ABI base caller and the ﬁrst sequence comes from phred. This sequence method does not work very well on ABI 3700 sequences, CLN CLN probably because the new base-caller on the ABI 3700 quality value cutoff uses an algorithm similar to that used by phred. Generally, Low quality region trimming the step to extend the good quality region can be skipped without sacriﬁcing too much useful data, but potentially it 1st sequence may provide that extra linking leg needed to connect two contigs together during fragment assembly. 2nd sequence After having determined a good quality region, LUCY CLZ CLZ Consensus sequence extension then searches for vector splice sites on the sequence, and makes sure that they are not included in the usable sequence CLV CLV region of the sequence. This step is especially difﬁcult Shotgun Insert vector because most of the vector fragments lie in the low quality region of a sequence, so their existence and true Splice site trimming boundary cannot be easily determined by straightforward sequence alignment methods. What LUCY does to solve V e ctor Insert this difﬁculty is to search for the splice sites ﬁrst at a vector ﬁxed region at the beginning of each sequence, taking into consideration the low quality of the base calls in Contaminant removal this region. If that attempt fails, LUCY will search again adaptively, taking into account the boundaries of the good sequence CLN CLN quality region determined earlier. When searching for CLZ CLZ CLV CLV splice sites, LUCY applies different search stringencies at output Bad region Good sequence region for assembly Bad region different areas of a sequence to cope with the variable CLR CLR Output generation quality of base calls. LUCY also considers many special cases that were found through experience at TIGR. Many incremental improvements have been made to this step Fig. 2. Illustrations of LUCY’s major processing steps. See the main since the early versions of LUCY due to the difﬁcult nature text for explanations. of this problem. We believe that LUCY now functions very well with most input data. Following the splice site trimming step, contaminant LUCY’s task is to identify the largest subsequence that is sequences can be quickly identiﬁed and removed from of sufﬁciently high quality and also free of contaminating the input data. Because all vector splice sites have vector sequence. been trimmed during the previous step, the contaminant LUCY does its job by ﬁrst determining the longest removal step becomes relatively easy. Note that we are continuous high quality region of a sequence based on its concerned only with the quick detection of contaminants, base quality values. Some bases within this good quality not the details of the contaminants. Finally, markers iden- region may be wrong, but the overall value must be greater tifying the usable region of each sequence are generated than a user-speciﬁed minimum value (97.5%) . If a second in the output. Illustrations of LUCY’s major processing base-call sequence can be obtained using a different steps are given in Figure 2. LUCY’s processing steps are generally independent of each other and they produce In this paper, we include the default value of a user-adjustable parameter in parentheses. independent trimming pairs using the CL? tags shown in Discard! DNA sequence quality trimming the ﬁgure . When we refer to the good region in this paper, we mean the current cumulative trimming result after each processing step, i.e. the narrowest region obtained by combining all trimming markers. At the end of LUCY processing, the useful region of a sequence becomes the CLR region as shown in the ﬁgure. Most of LUCY’s parameters can be set by the users using its command line arguments. In the following dis- cussion of each processing step, we represent parameters Fig. 3. The raw data in a chromatogram ﬁle can be viewed as four that inﬂuence a speciﬁc step in the typewriter font. sets of overlapping peaks, one each for the A, C, G and T sequencing We also enclose their default values in the parentheses reactions. after the parameter names. Generally, these default values were determined based on our past experience. Users (a) (b b) can change these settings if they need to adjust LUCY’s (c c) behavior in their own data environment, but it should (d) be done with caution. In the user document distributed with LUCY, we provided additional information to help users determine some parameters for their speciﬁc Fig. 4. LUCY’s quality trimming steps. (a) Low quality areas are environments. trimmed from each end, then (b) regions of poor quality within the Except for the quality region determination step, all sequence are identiﬁed and removed from the clean ranges. (c) The other LUCY steps described below can be skipped if users resulting candidate clean ranges are further trimmed to satisfy the overall average probability of error criterion and the criterion of do not supply the necessary input data for a step. the probability of error at terminal bases. (d) The largest remaining candidate is chosen as the ﬁnal clean range. Quality region determination The goal of this step of LUCY’s processing is to ﬁnd the longest region of each input sequence whose base calls estimated base call error are related by the following are of sufﬁciently high quality to warrant using them. We formula: will refer to this region as the clean range in this section. This step makes use of the data in the quality ﬁles that Q =−10 × log (probability of error). are provided as input. The quality ﬁles are created by the base-calling software before LUCY is run. The base-calling programs base their estimates of the The raw data ﬁles (chromatogram ﬁles) from automated probability of error on a number of factors, including peak shape, spacing between peaks, signal strength and DNA sequencing machines, such as the ABI 377 and 3700 sequencers, contain four sets of data traces, one background noise (Ewing et al., 1998; Ewing and Green, 1998). LUCY ﬁrst converts the data from qualities to each for the A, C, G and T sequencing reactions. The probabilities of error according to the formula above (e.g. temporal relationship among the four traces is known, Q = 0 is converted to probability of error = 1.0). All of because the four reactions are run through the sequencer the major steps in the quality trimming process involve at the same time, in a single capillary or a single lane of a calculating some average probabilities of error within polyacrylamide gel. The traces can be visualized as a set certain windows along the length of the sequence. These of peaks, as shown in Figure 3. steps are illustrated in Figure 4. It is the job of the base-calling software to examine Since the beginning and end of each sequence are the traces and determine the sequence of base calls. In typically of low quality, the ﬁrst step is to remove the addition to calling the bases, the base-calling programs lowest quality data from each end. This step is controlled phred and TraceTuner provide a quality value for each by the bracket parameter. Starting from the left end base called. The quality value is based on the estimated of the sequence, LUCY ﬁnds the ﬁrst window of size probability that the base call is in error. Quality ( Q) and window size (10) that has an average probability of error of max avg error (0.02) or less. Similarly, starting These are historical TIGR database ﬁeld names. Their meanings are CLN: clear of bad quality data, CLZ: clear of zgrasta (an alignment tool used to from the right end, LUCY ﬁnds the last window of align the two base-call sequences before LUCY was available), CLV: clear of size window size meeting the same criterion. These vector fragments, and CLR: clear of all bad things. The contaminant removal windows and the bases between them are then subjected step does not produce trimming tags but sets the trash tag of contaminated sequences, which are represented by CLR left and right being both 0. to the remaining steps, and the bases at the ends of the 1095 H.-H.Chou and M.H.Holmes sequence that failed this step are excluded from further In a small minority of cases, quality trimming may pro- consideration. duce more than one candidate clean range that satisﬁes all The next step is to identify regions of the sequence that of the speciﬁed criteria. Only the largest is kept, however. have unacceptably high error rates, and exclude those This point is discussed further in the Section Discussion. regions from the ﬁnal clean range. This step is controlled Consensus sequence extension by a set of window size and max avg err pairs given to the window parameter . By default, LUCY uses two After good quality region determination based on the window sizes in this step: a 50-base window size1 quality of individual bases of a sequence, an optional step allowing a max avg err1 probability of 0.08, and a is to align the sequence to a second sequence, if available, 10-base window size2 allowing a max avg err2 prob- that is obtained using a different base-calling algorithm ability of 0.3. The purpose of the larger window size1 is with the same chromatogram ﬁle. The rationale is that if to exclude large regions of poor quality, but it may fail to two different base-calling algorithms agree on a speciﬁc exclude smaller regions of very low quality. The smaller base call, then that base call is likely to be correct even if window size2 excludes those smaller regions. its quality conﬁdence value is low. This is an attempt to LUCY starts with the candidate clean range from extend the good quality region by using the result from a the ﬁrst quality trimming step, and subjects it to the second base-calling program to boost the conﬁdence level largest window from the window options. From the assigned by the ﬁrst program. An important requirement beginning of the candidate clean range, LUCY calculates when doing this step is that the second sequence must be the average probability of error in each window of size obtained from a different algorithm. Otherwise, the second window size1. Each such window that has an average sequence can be almost the same as the ﬁrst sequence and error of max avg err1 or less is added to a new candidate a false good region can be extended all the way to both clean range, which keeps growing until some window ends of a sequence. This certainly ruins the purpose of the fails this criterion, then the current candidate clean range previous quality trimming step. is terminated. LUCY continues with the next window, In TIGR, the ﬁrst sequence is obtained from phred and and begins a new candidate clean range with the ﬁrst the second sequence comes from the built-in base-calling subsequent window that passes the same error criterion. So this step starts with a single candidate clean range, but program with the ABI 377 sequencers. Normally, in good may result in several candidate ranges being produced. quality regions, the alignment is almost perfect between Each of the candidate clean ranges produced by consid- the two sequences. But in the lower quality areas, the ering the largest window size is then subjected to the next ABI base-caller tends to call the N base (unknown) while window size (window size2) in the same way, and so on, the phred program almost always attempts to guess one. until all window sizes have been considered. Each of these Therefore, this extension step works reasonably well due steps may produce further fragmentation of the candidate to the different characteristics of the base-callers. Recently clean ranges. However, any candidate clean range that is we have stopped doing this extension step at TIGR. The smaller than the length speciﬁed by the overall minimum reason for this is that we want to be able to state our parameter (100) is eliminated from further consideration, quality trimming criteria in terms of base calling error and in practice the number of sequences which produce probabilities. The sequence extension step makes this multiple candidates of sufﬁcient size is small. impossible. Another reason is that the new base-calling The ﬁnal quality trimming step is controlled software bundled with ABI 3700 seems to utilize similar by the error option, with max avg error and base-calling algorithms as in phred, thus the sequences it max error at ends parameters. The max avg error generates are very similar to the phred output and cause parameter speciﬁes the maximum overall average proba- this extension step to extend too much into the bad quality bility of error allowable in the ﬁnal clean range (0.025), region. However, we retain this sequence extension feature and the max error at ends parameter speciﬁes the in LUCY because it is still conceptually sound, and for maximum probability of error for each of the two bases those who are using ABI 377 seuqencers, this step can at the ends of the clean range (0.02). For each of the sometimes extend the usable sequence length. candidate clean ranges produced by the preceding step, Since both sequences should be very similar in LUCY ﬁnds the largest subsequence that satisﬁes both their good quality regions, instead of running a time- of these criteria, and the largest such subsequence found consuming dynamic programming alignment, LUCY among all the candidates becomes the ﬁnal clean range. attempts to quickly locate the band of alignment by doing Do not confuse these with similar options associated with the bracket the following. It converts the ﬁrst sequence into 16 bp parameter; they are independent of each other. ‡ tags, sorts the tags and removes duplicates from the set. Again, the max avg error here is independent to the other similar options associated with the other parameters. Each tag has an accompanying index linking back to its 1096 DNA sequence quality trimming (a) (b) (c) (d) Fig. 5. Quick afﬁnity region ﬁnding using the depth-ﬁrst search algorithm in LUCY. (a) A partial dot graph showing matched bases between two sequences at the right end of an alignment. This partial dot graph is chosen such that the alignment band is right on the diagonal. (b) The search space (dark grey areas) LUCY explored when conducting the extension effort. The dot graph has been lightened in this and subsequent sub-ﬁgures to facilitate reading. (c) The LUCY search space after a mismatch has been introduced at the third base position in the ﬁgure. (d) The LUCY search space after a deletion is made to the third base position on one sequence. the current search path such that the previous mismatch position in the sequence . Next, the second sequence is sequentially converted into 16 bp tags and searches are can be tolerated; the end of either sequence has been quickly conducted in the ﬁrst sequence tag set to ﬁnd the reached; there are over ﬁve mismatches along the current same tags. If some tags can be found on both sequences, search path; or over 100 bases from the ﬁrst mismatch their indices in the ﬁrst sequence are compared against has been scanned without ﬁnding sufﬁcient matched their indices in the second sequence, and their differences bases to compensate the mismatches. These parameters are recorded in a separate hit pool with associated hit come from the choice of an alignment stringency of 95%. counters. If a difference value already exists in the hit Thus for every 100 bases, there can be no more than ﬁve pool, its hit counter will simply be increased by one. At mismatched bases. the end of the tag comparison processing the difference Figure 5a depicts the dot graph around the right end of value with the highest hit count determines the relative the alignment extension between two base-call sequences offset of the two sequences in their best alignment regions. of a read . The dark line at the upper 1/3 of the diagonal Once the relative offset is known, their central align- marks the end of the good quality region LUCY reports. ment band (supposedly the good quality region) can be Figure 5b reveals the actual search space LUCY explored quickly located. Extension attempts are then carried out when doing the alignment extension. We can see that on both ends of this alignment band. Again, since the LUCY’s exploration is very conﬁned and almost linear two sequences should be mostly similar except in the low in the direction of the alignment diagonal. To test the quality regions, we only need to quickly connect the high search behavior of LUCY under different conditions, we afﬁnity regions; we are not concerned with an optimal intentionally create a mismatch pair at the third base of alignment between the two sequences in the low quality the portion of sequences covered in these dot graphs. regions. Therefore, this can be formulated as a search It does not change the search space at all, as shown in problem using the more efﬁcient depth-ﬁrst search algo- Figure 5c. Because there are more than 20 matches beyond rithm to locate the high afﬁnity regions extending from the this mismatch, it is tolerated by LUCY and the end of ends of the central alignment band. At every mismatched the good quality region (the dark line) is not changed. base between the two sequences on the current alignment Finally, we delete one base at the same test position from band, there are three choices to continue the extension ‡ the second sequence to create a gap in the alignment. effort: skip one base on the ﬁrst sequence, skip one base This deletion smears the LUCY search space a bit near the on the second sequence, or skip both. A stack is set up to interruption. However, LUCY quickly recovers the rest of record the choices being made at each mismatch position the high afﬁnity region and reports the same end of good so the algorithm can come back to it later and try another quality region as shown in Figure 5d. choice. The depth-ﬁrst search algorithm continues to ﬁnd The beneﬁt of the depth-ﬁrst search approach over more high afﬁnity regions and makes more choices at dynamic programming alignment is in its almost linear- each mismatch until one of the following stop conditions becomes true: there are sufﬁcient matched bases along Actual data from the right end sequencing of the clone ATIEP78 in the Arabidopsis project (Lin et al., 1999). § ‡ Duplicate tags at different locations may be lost, but this will not inﬂuence No gap will actually be created in the output since only the good region on the outcome. the ﬁrst sequence will be used. 1097 H.-H.Chou and M.H.Holmes time speed. The disadvantage of this method is that it programmed to ﬁnd and remove all vector fragments does not guarantee ﬁnding the longest alignment possible that satisfy the search criterion. The quality of the base between the two sequences under the same alignment calls is usually poor at the beginning of a sequence, conditions. Since the search range is limited, when a much but gradually improves when moving into the sequence. longer bad quality region separates two otherwise good Therefore, the search criterion has to be made adaptive alignments between the two sequences, this method will to the average quality of the bases, such that in the low not be able to ﬁnd the other alignment. Normally we quality region we allow shorter vector fragments to be will not use unconnected shorter good quality regions, so identiﬁed, but in the high quality region we require a more this limitation is not critical and is rare to occur. See the deﬁnite identiﬁcation to avoid cutting good sequences Section Discussion for more explanation. short. Although there can be many different quality areas, we have found that using just three with low, medium Vector splice site trimming and high-stringency search criteria serves the purpose well LUCY requires a vector sequence file and a (see Figure 6). splice site file to conduct the splice site trimming The default range for the three search areas are 40, 60 and contaminant removal steps. The vector sequence- and 100 bp, with three different minimum alignment file contains a single long sequence of the whole lengths of 8, 12 and 16 bp. A local optimal alignment vector, which is used only in the quick contaminant within each area must be equal to or longer than these removal step discussed next. The splice site file is minimum lengths before it is considered an identiﬁed what is actually used for the splice site trimming step. It vector fragment. This means that, by default, LUCY will contains two splice site template sequences upstream and search for splice sites in the ﬁrst 200 DNA bases. If the downstream from the insertion point on the vector (see splice site is not found in the ﬁrst 200 bases or there are Figure 6). The splice site sequences are usually 100–150 some special conditions that occur, additional searches bases in length, with some bases overlapping each other beyond the default areas will also be conducted. When around the insertion point. They should also include any comparing sequences to the upstream splice site template, short linker sequences used during cloning reactions. LUCY will ﬁrst ﬁnd the longest alignment with at least Their actual lengths are not critical, as long as the splice three matched bases for every one mismatch in it. This site is totally covered. does not mean the alignment will have 25% errors in With two splice site sequences in the splice site file, it because only the highest scoring local alignment is LUCY assumes all input sequences are read in the same used. Shorter alignments to the left of the found one can direction as the splice site sequences. Usually users know be ignored. However, if there are other qualiﬁed local the exact direction of their sequencing reactions, i.e. from alignments that can be found after the best alignment, which end of the clones they are reading the data, thus LUCY will continue the process until all potential vector trimming needs to be done just along that direction. fragments have been eliminated. After the end of the However, if that is not the case and the input may consist upstream splice site has been determined, the rest of the of sequences from both forward and reverse reads of sequence will be searched for the downstream splice site clones (probably due to laboratory or sequence tracking with the highest (16 bp) stringency. This is to guard against errors), then LUCY can be instructed to do bidirectional short inserts. trimming as well. If LUCY sees both forward and reverse The three initial search areas are ﬁxed at the beginning splice site template pairs in the splice site file (i.e. of each sequence, without regard to where the good a total of four splice site sequences), it assumes that quality region begins. There may be questions as why bidirectional trimming has been prescribed. In TIGR, we they are not made to stick with the end of the low quality always run bidirectional trimming despite the fact that region (i.e. the beginning of the good quality region) so only one of the trimming directions is actually needed. the alignment criteria are more in line with the actual Trimming in the unnecessary direction might cause a few quality composition of each input sequence. Actually, one sequences to be shortened a little bit, but it guarantees that previous version of LUCY was designed exactly that way, there can be no vector fragments in the good region even but that design was dropped later in favor of the ﬁxed when the assumed read direction of some input sequences search areas for the ﬁrst search attempt. There are several is wrong. reasons for this decision. First, the average position of Because vector splice sites are usually at the beginning the splice sites is related to the speciﬁc vector being of a sequence where the quality of bases is low, a simple used and is more or less ﬁxed at the beginning of each sequence comparison that looks only for the longest sequence, meaning that it depends much less on the alignment with the upstream splice site sequence does sequencing quality. Therefore, shifting the search areas not guarantee ﬁnding all vector fragments that may have to the beginning of the good quality region often causes been obscured by base-call errors. Instead, LUCY is vector fragments to be missed for sequences that have 1098 DNA sequence quality trimming upstream template direction of sequencing downstream template splice site fragment insert splice site area1 area2 area3 search area for end of insert 40 60 100 400 - 800 8 bp min. match 12 bp min. match 16 bp min. match Fig. 6. LUCY’s vector fragment searching areas on a typical sequence. number of sequences at each length or percentage value point average good region length at each length or percentage point (a) (b) (c) default cutoff 0 0 100 200 300 400 500 100 200 300 400 500 10 20 30 40 50 percentage of vector vector fragment length in vector fragment length in the whole sequence the good region only fragments in the good regions Fig. 7. Performance of the simple tag matching method in separating contaminants and good sequences when running against different regions of a sequence. (a) When running on the whole sequence, there is no clear separation between good and contaminated sequences. (b) When running only on the good region of each sequence determined during previous steps, a separation becomes possible since most sequences have far fewer contaminant matches in their good regions. (c) An even better separation can be achieved when taking the fraction of vector matches in the good region of each sequence. The default separation cutoff is set at 20%. a longer initial low quality region. Although it can be the sequence is carried out similarly. If users wish to tell argued that missing vector fragments in the initial low LUCY that they are processing EST sequences but they quality region is acceptable because this region will be also wish to keep the poly-A/T tags for their purposes, excluded off the good region anyway, we ﬁnd that it is they can issue the keep option in combination with the still worthwhile to positively identify the position of vector poly-A/T trimming option cdna. splice sites for purposes such as estimating clone lengths, Contaminant detection sequencing quality assessments, etc. Note that LUCY may Contaminants in the input sequences can come from many conduct additional vector fragment searches under various special conditions. These additional searches are indeed sources, such as Escherichia coli or human. However, adaptive to the good quality region. We will explain in most common ones are cloning vectors themselves. A the Section Discussion some of the special conditions that vector can potentially splice with another vector, forming a trigger additional searches. vector insert. A vector can also pick up a very short insert that causes much of the sequence read to be the vector Poly-A/T tail removal itself. These events do not happen frequently, roughly If the raw DNA sequences are obtained from an at the order of one in a few thousand sequences, but EST library, some users want their poly-A/T tags they do happen. The actual frequency depends on the to be removed before clustering. LUCY does this speciﬁc vector being used for cloning and the laboratory quickly after vector trimming by searching for the ﬁrst processing. Since these are rare events, it is not worthwhile min span (10) or longer poly-T fragment within the to spend a lot of time screening for a few contaminants. ﬁrst initial search range (50) bases inside the We want a method that can quickly identify potential vector-free good region, then attempts to extend from contaminants and throw them off. We do not need to know this initial poly-T seed toward the center of the sequence, where in the contaminated sequences the contaminants allowing no more than max error (3) mismatches match. between every min span (10) consecutive T bases in Our solution is to pre-construct a tag pool from the the scan. This is therefore a linear-time and linear-space full-length vector sequence file. The tag pool is operation. The poly-A tail trimming at the other end of made of every vector tag size (10 bp) fragment of the 1099 H.-H.Chou and M.H.Holmes contaminant sequence. Tags for the reverse strand of the such as E.coli, yeast, or human. Potentially, we can in- crease the tag size used and check the sequential ordering contaminant are automatically generated and put into the of matched tags to obtain more conclusive evidence of same pool as well. The pool is then sorted and duplicate contaminant matching. This can be done similarly to the tags are removed. This is similar to the good quality region secondary sequence extension step explained previously. extension step mentioned previously, but it needs to be However, this larger contaminant screening capability done only once for the entire running session of LUCY. has not been built into LUCY yet. In TIGR, we use During screening, each input sequence is converted into another tool, the dds.btab program derived from the tags and searched against the contaminant tag pool to see if AAT package (Huang et al., 1997), to screen for larger there is any match. We count the number of matches found contaminants after LUCY processing. on a sequence and use that to determine if the sequence is a contaminant. SUMMARY OF PARAMETERS AND OTHER This method may sound too simple to be effective. SYSTEM CONSIDERATIONS Indeed, if we try it on a test set of 5022 sequences from the Arabidopsis thaliana genome project (Lin et al., LUCY’s control parameters are summarized in Figure 8. 1999) as shown in Figure 7a, it does not seem to tell us Most of these parameters have been discussed before, exactly which hits are real contaminants. Most sequences except the following: (98%) have less than 30 bp total match to the contaminant • The three pass along values deﬁne the minimum, sequence. However, beyond 30 bp, the distribution seems maximum and medium clone lengths of a particular to go uniformly all the way up to 530 bp. Two percent of library. Usually, the preparer of the library knows these the test sequences have nontrivial good regions and should values. They are critical to the fragment assembly pro- not be simply discarded. Most of them are just shorter gram in bridging contigs and constructing scaffolds, insert sequences that still contain valuable data. We need but LUCY does not interpret these values. It just passes a better way to distinguish useless sequences from useful them along with the output sequences. ones. Thanks to the sophisticated vector splice site trimming • After all kinds of checking, comparing and trimming, step above, if we run this contaminant search only within the good region of a sequence must still be longer than the good region of each sequence, the distinction becomes the minimum good sequence length (100 bp) for it to more obvious. Basically, this rules out all vector splice be considered useful to the subsequent data processing sites in the comparison and therefore reduces matches stages. We do not want our assembly program to to the contaminant found within short inserts, as seen in receive many small, trashy fragments. Figure 7b. This revised search criterion sharply pushes • If users have multiple CPUs in their computers, they sequences to the left and reveals that most (99.5%) can dramatically increase LUCY’s speed by telling it to sequences have less than 30 bp matches to the contaminant run xtra execution threads concurrently. For example, in their good region. We can improve the separation even on a Quad-CPU computer, LUCY’s execution time can further by looking at the fraction of the good region be cut to a quarter of its time spent on a single CPU of each sequence involved in contaminant matches. The computer. By default, LUCY runs just one thread. result, as shown in Figure 7c, is a clear separation of the only vector insert, which has a 47% match to the • Users can tell LUCY to use a speciﬁc set of output ﬁle contaminant in its good region. This is about one in ﬁve names, to be quiet during processing and only report thousand input sequences. The other sequences all have errors, to inform them of names of sequences that are less than 10% match in their good regions. However, some dropped due to low quality concerns or are resurrected short inserts may still be dropped because they have less by secondary sequence extension, and to produce a than the minimum good sequence length. We will explain debug ﬁle listing the computed tag values (shown in this further in the following Section Discussion. The Figure 2) for each sequence. These tag values reveal vector cutoff (20%) is used to separate contaminants how LUCY arrives at a trimming decision for each sequence. from good sequences. Readers should note that although this simple All parameters of LUCY are explained in greater detail in tag-matching method is adequate to recognize vector the document that comes with its distribution. contaminants, a much more sophisticated algorithm is LUCY does not access (nor depend on) a database needed to screen successfully for larger contaminants server. It is a design decision to separate LUCY from † any site-speciﬁc infrastructure assumption. In TIGR, we Among the 5022 sequences, 310 have less than minimum good sequence length and are not included in the statistics. use a separate program, RICKY, to drive LUCY and 1100 DNA sequence quality trimming Parameters Default values Related operation steps pass_along min_value max_value med_value 0 0 0 pass to assembly program error max_avg_error max_error_at_ends 0.025 0.02 quality area determination window window_size max_avg_error … 50 0.08 10 0.3 quality area determination bracket window_size max_avg_error 10 0.02 quality area determination range area1 area2 area3 40 60 100 vector splice site trimming alignment area1 area2 area3 8 12 16 vector splice site trimming vector vector_sequence_file splice_site_file none vector splice site trimming cdna [min_span max_error initial_search_range] none or 10 3 50 poly-A/T trimming keep none poly-A/T trimming size vector_tag_size 10 contaminant removal threshold vector_cutoff 20 contaminant removal minimum good_sequence_length 100 overall quality control xtra cpu_threads 1 overall program control output, quiet, inform_me, debug none overall program control Fig. 8. Summary of LUCY parameters and their default values. provide input/output bridging between LUCY and our LUCY makes no assumption about the order of se- quences in the three input ﬁles: the ﬁrst sequence ﬁle, database server. Since programs like RICKY are built the companion quality value ﬁle, and the optional second around a speciﬁc infrastructure that varies from institution to institution, users need to design their own solutions sequence ﬁle. As long as all necessary information can be found, DNA and quality sequences can be in different to automate the process of running base-calling software, order in the ﬁrst two input ﬁles. LUCY’s output will getting input to LUCY, converting LUCY’s output to SQL always be in the order of the ﬁrst sequence ﬁle, so a update commands, and ﬁnally uploading LUCY’s output to trivial ‘feature’ of LUCY is to sort quality sequences their databases. RICKY will not be useful outside TIGR’s with the DNA sequences. A DNA sequence without environment. LUCY, on the other hand, does not depend its companion quality sequence or vice versa will be on any other program to run. Users can operate LUCY reported as an error. The second sequence ﬁle is allowed by manually providing its input data, as is often done in to have missing sequences that appear only in the ﬁrst smaller research laboratories. sequence ﬁle. Sequences that appear only in the second LUCY reads its input data from multi-FASTA text sequence ﬁle will be ignored by LUCY. The following is ﬁles, and writes its output also to multi-FASTA ﬁles. A a typical output sequence from LUCY, where the header multi-FASTA ﬁle is simply the concatenation of many line contains this information: the sequence name, the FASTA sequences (National Center for Biotechnology three pass along values of the minimum, maximum and Information, 2001), like those generated directly by phred. medium clone lengths of the library, and the left and right Each DNA sequence in a multi-FASTA ﬁle is delimited trimming positions determined by LUCY. by a header line that begins with the greater-than symbol ‘>’ followed by the sequence name, and may include >GCCAA03TF 1500 3000 2000 43 490 AGCCAAGTTTGCAGCCCTGCAGGTCGACTCTAGAGGATCCCCAGGATGATCAGCCACATT other information about the sequence. LUCY records GGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTGGGGAATCTTGCGCAATGG only the sequence name information which is terminated GCGAAAGCCTGACGCAGCCATGCCGCGTGAATGATGAAGGTCTTAGGATTGTAAAATTCT TTCACCGGGGACGATAATGACGGTACCCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGC by the ﬁrst space character that appears on the header ... line. The header line is followed by several more lines of data making up the actual DNA sequence, which At TIGR, we run LUCY on a Sun workstation running can be in either upper or lower case letters. LUCY the Solaris operating system (Version 5.5.1). LUCY has accepts the standard IUB/IUPAC DNA codes including also been compiled and run successfully under the Linux the ambiguous letters (e.g. N for unknown bases). For each operating system on a PC. Almost all ANSI C compilers base in a DNA sequence, there must be a corresponding should be able to compile LUCY source code since it is number denoting its quality value in the companion programmed using only standard C libraries and header quality sequence. Quality sequences share the same header ﬁles. LUCY has not been run on the MacOS or Windows line format but are made up of numbers separated by by us, but we believe porting it to these two platforms spaces. Blank lines are not allowed in the input ﬁles and should not be too difﬁcult since all source code are all lines must be shorter than 256 characters. Each input included in its distribution. It is very likely that LUCY can sequence is terminated by either the beginning of another run without any signiﬁcant modiﬁcation inside a Windows header line or by the end of the input ﬁle. command shell. 1101 H.-H.Chou and M.H.Holmes LUCY is a CPU-bound program, not a memory-bound software for genomic data processing. LUCY’s simple program. LUCY’s static memory requirement grows very interface requirements allows it to be easily incorporated slowly with the number of input sequences, at roughly into any existing pipeline. LUCY is written in ANSI C 60 bytes plus the sequence name storage for each input code so it may improve certain processing steps when sequence. This is because LUCY does not store any compared to Perl or script based solutions. Additionally, sequence data in memory after processing them. To save since LUCY is independent of any database infrastructure, memory, LUCY uses direct ﬁle addressing to access each it can be used automatically or manually, by both large and sequence when it is needed, and stores only those pointers small research laboratories. in the memory. Therefore, by all practical considerations LUCY can handle any number of input sequences. The DISCUSSION OF SPECIAL CASES dynamic memory requirement of LUCY is proportional to We have given separate descriptions of LUCY’s major the longest sequence in the input data, not the number processing steps. However, the actual processing within of input sequences, but its actual size varies from one LUCY is far more complex than what has been explained. processing step to the next. Often times it is the joint result from these processing steps The slowest step in LUCY is the vector trimming step that is used to make the ﬁnal decision about the fate of a which employs a quadratic time dynamic programming sequence. Trimming vector splice sites in the low quality algorithm to ﬁnd vector fragments in each sequence. To region of sequences proves to be a very difﬁcult problem. speed up LUCY’s processing, multi-threading capability Since version 1.03 of LUCY (now at 1.16), all new has been built into LUCY as introduced above. On a Linux improvements were made possible because people using PC with four CPUs running at 500 Mhz, it takes LUCY just LUCY discovered abnormal trimming results with their 1 min and 12 s to process 5022 raw input sequences. When eyes and provided them to us for inspection. We found that using one CPU, LUCY’s processing time will be roughly solutions for many of these special cases can contradict four times that. This performance is considered acceptable each other, meaning that making LUCY handle one special to most applications. Also related to vector trimming, it case may introduce new problems in sequences that had is important that users do not turn on phred’s quality been handled correctly before. We present some examples trimming function when doing base-calling. This often in this Section Discussion to illustrate the difﬁculty of cuts the resulting sequences short and prevents LUCY the problem. This presentation is not meant to include all from seeing the vector fragments. Leaving the trimming special cases that can happen during LUCY processing. to LUCY will give it the information it needs to do its job To start with, LUCY always looks at the initial search well. areas for any vector fragment (Figure 9a). Under normal We cannot provide any generally applicable statistics conditions, there are some low quality bases (20–50 bp) of the trimming results of LUCY because these certainly at the beginning of a sequence. Vector fragments are well depend a lot on input data quality, contaminant contents within the initial search areas and can be identiﬁed even and LUCY’s parameter settings. We can, however, provide within the low quality region. The rest of the sequence is some information speciﬁc to the 5022 test sequences then searched against the downstream splice site sequence we mentioned earlier that are obtained from TIGR’s to guard against short inserts. Usually the clone is much Arabidopsis genome project (Lin et al., 1999). With longer (2 kbp) than the sequencing machine can read default LUCY parameters, 310 of the input sequences at once (<900 bp), so LUCY will not ﬁnd any vector are discarded because their good regions are shorter than fragment at the other end of a sequence read. Most input the minimum length of 100 bp after trimming. Of the sequences (well over 90%) fall into this case. remaining 4712 useful sequences, their total length is Sometimes, a low quality region can separate two good 4148 326 bases, and their total good region length is regions, as seen in Figure 9b. This may be the result of 2072 533 bases. Therefore, about 50% of the raw data incidents such as power interruption caused by lightning are trimmed away by LUCY. The resulting good regions during sequencing runs . In this case, LUCY ﬁnds the have a base-call conﬁdence level of above 97.5%, and longest high quality region and uses that as the good re- contain no vector fragments or contaminants. Among the gion. Although the other good regions, as long as they are 5022 input sequences, only one is suspected to be a vector longer than minimum good sequence length, could still be insert. used separately, this would require LUCY to generate new Although many major genome sequencing centers have sequence names that could cause some database tracking already developed their own automatic data processing nightmares (e.g. what subsequences come from which pipelines to perform some of the tasks LUCY is designed original sequence, etc). The extremely low frequency of to do (Veklerov et al., 1996; Smith et al., 1997; Dear et al., 1998; Wendl et al., 1998), we believe LUCY can still † This is no joke; it actually happened a few times in TIGR during be a useful addition to the freely available bioinformatic thunderstorm seasons. 1102 DNA sequence quality trimming low quality region high quality region vector fragments (a) good region (b) good region (c) good region (d) good region (e) good region (f) i good region (g) i good region (h) good region (i) good region initial search areas secondary and end of insert search area 200 400 - 800 Fig. 9. Examples of special cases of sequence cleaning that LUCY has been designed to handle. Although we plot each example using the same length sequence for esthetic reasons, it should be understood that each sequence can be of variable length and can contain more than one problematic scenario. See the main text for explanation of each case. this occurrence, and the nature of the shotgun fragment sequence inside the high quality region, but no other assembly principle, make the effort to salvage the shorter vector fragments are found near it to conﬁrm the match? good regions in a sequence unnecessary. Should LUCY treat it as a positive identiﬁcation of a vector Vector fragments may not always be in the low quality fragment and trim the sequence accordingly, or should region. Sometimes, they run into the good quality region LUCY consider it a match by random chance that can as well. The other end of the vector splice site may also be ignored? There is no absolute correct answer to this. be found towards the end of a sequence, as shown in Therefore, LUCY employs a heuristic judgment: if the Figure 9c. If the sequence is a short insert, the length of distance between the vector match and the end of the the good region may be much shorter than the length of the search areas is shorter than the width of the vector match sequence, as exempliﬁed in Figure 9d. Normally, LUCY itself, i.e. i < j in Figure 9f, then LUCY declares it a will drop a sequence whose good region is shorter than vector fragment and trims it. Otherwise, LUCY rejects it minimum good sequence length to prevent useless data as a random match. Additionally, if the vector match is from clogging the subsequent data processing stages. equal or more than half the total length of the good region Sometimes, the low quality region runs longer than the ( j > 1/2 length of good region), then LUCY will trim it as initial search areas. In this case, as shown in Figure 9e, well. Since the search areas are 200 bp long, it basically LUCY has to make a second attempt to search for vector means that LUCY will trim all singular vector matches fragments beyond the initial search areas, in order to ﬁnd longer than the minimum alignment length in each area, any additional vector fragments inside the high quality but will ignore most short vector fragment matches in the region. The reason for doing this is because some of the middle of a long high quality region. low quality bases may mask the true identity of vector The heuristic above means that even if LUCY does throw fragments. In order to guarantee that no vector fragment is out some false-positive vector fragments, they will all be included in the reported good region, LUCY must search limited to the ﬁrst 200 bases of a sequence, and LUCY part of the high quality region beyond the initial search needs much more concrete evidence before it will accept areas as well. This is the second adaptive vector splice site a vector fragment match inside the high quality region. search we mentioned previously. In addition to these, the heuristic has to be augmented If LUCY does not ﬁnd any vector fragment within the for additional special cases where the low quality region initial search areas, it may be the case that the sequencing extends much into the sequence and goes beyond the PCR reaction started well before the splice site. To guard initial search areas, as shown in Figure 9g. In this new against this, LUCY needs to conduct a second search case, the boundary of heuristic judgement will be shifted attempt beyond the initial search areas as well. However, to the right end of the low quality region. The reason for this may create another problem, as shown in Figure 9f. this is similar to case 9e, to prevent any vector fragment What if LUCY ﬁnds a short match to the splice site from getting into the good region selected by LUCY. 1103 H.-H.Chou and M.H.Holmes Another special case is when LUCY ﬁnds a vector This allows the wisdom and experience of biologists to be fragment that extends right up to the end of the initial slowly translated into functional program code. search areas, as seen in Figure 9h. Here, the end of the Further improvements to LUCY may prove difﬁcult, as vector fragment may be the actual boundary of it, or it LUCY now strikes a delicate balance among the various may simply be an artiﬁcially created boundary because of strategies for dealing with real-world problems that have the fact that LUCY searches for vectors in the ﬁrst 200 been encountered at TIGR. If LUCY is able to do a better bases ﬁrst. To conﬁrm the actual ending of the vector job today than its previous versions, it is all because we fragment, LUCY has to conduct the second search as have a few hardworking biologists who actually talk to well . Finally, even when splice site trimming is done computer scientists. normally and there is a good region inside a sequence, if the good region length is less than the minimum ACKNOWLEDGEMENTS good sequence length after the length of matched vector We wish to thank Dr Granger Sutton and Dr Tony Kerlavage fragments are subtracted from it, LUCY will still reject the for leading us into this difﬁcult but intriguing problem. We sequence in fear of introducing contaminants into the data would also like to thank Ms Anna Glodek, Mr John Scott processing pipeline. Matched vector fragments, as shown and Mr Terrance Shea for providing us their valuable in Figure 9i, are found during the quick contaminant suggestions when we were developing LUCY. Finally, removal step. we thank Dr Steven Salzberg and Dr Claire Frasier for Although these special cases are also discussed sep- generously allowing us to make this program freely arately, it is not necessary that they occur separately. available to the academic community. More often for low quality sequences, they actually occur simultaneously. For example, a sequence can have a REFERENCES short insert, a long initial low quality region, another low Dear,S., Durbin,R., Hillier,L., Marth,G., Thierry-Mieg,J. and quality region separating two good quality regions, and a Mott,R. (1998) Sequence assembly with CAFTOOLS. Genome single vector fragment match inside its good region with Res., 8, 260–267. no other vector fragment found in the initial search areas Ewing,B. and Green,P. (1998) Base-calling of automated sequencer due to numerous base-calling errors. LUCY has been traces using phred. II. Error probabilities. Genome Res., 8, 186– made to handle most, if not all, possible combinations of special situations together. Ewing,B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assess- ment. Genome Res., 8, 175–185. CONCLUSION Huang,X., Adams,M., Zhou,H. and Kerlavage,A. (1997) A tool for Although there are many DNA sequence comparison analyzing and annotating genomic sequences. Genomics, 46,37– and analysis algorithms in the bioinformatics literature, our experience has been that to make them function Lin,X. et al. (1999) Sequence and analysis of chromosome 2 of the correctly inside a program that processes real-world data, plant Arabidopsis thaliana. Nature, 402, 761–768. the programmers must make an extra effort to consider National Center for Biotechnology Information (2001) FASTA ﬁle special cases reported by users. This is especially true format speciﬁcation. Available on the Internet. http://www.ncbi. nlm.nih.gov/blast/html/search.html when the amount of data is large and the quality of data Paracel (2000) TRACETUNER, capturing the most information can be low, as in the sequence cleaning stage. We believe from the latest DNA sequencing systems. Information on the that this is not just an isolated situation for LUCY.Itis, Internet. http://www.paracel.com/html/tracetuner.html indeed, a common phenomenon that arises during many Smith,T.M., Abajian,C. and Hood,L. (1997) HOPPER: software for bioinformatic software development cycles. Due to the automating data tracking and ﬂow in DNA sequencing. Comput. inherently unpredictable nature of biological data, there is Appl. Biosci. (CABIO), 13, 175–182. always some distance between the theorectical design of a Veklerov,E., Eeckman,F. and Martin,C. (1996) MTT: a software bioinformatic solution and the successful implementation tool for quality control in sequence assembly. Microb. Comp. of the solution in a working program that can handle Genomics, 1. real-world data reliably. The only way to shorten such Wendl,M., Dear,S., Hodgson,D. and Hillier,L. (1998) Automated distance to perfection, in our opinion, is to form a close sequence preprocessing in a large-scale sequencing environment. Genome Res., 8, 975–984. collaboration between computer scientists and biologists. This used to be a bug in previous versions of LUCY. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/dna-sequence-quality-trimming-and-vector-removal-g0as5CZdsD

Loading next page...

References (8)

Athanasios Theologis, Joseph Ecker, Curtis Palm, N. Federspiel, S. Kaul, O. White, Jose Alonso, Hootan Alta, R. Araujo, Cheryl Bowman, Shelise Brooks, Eugen Buehler, April Chan, Qimin Chao, Huaming Chen, Rosa Cheuk, Christina Chin, Mike Chung, Lane Conn, Aaron Conway, Andrew Conway, Todd Creasy, Ken Dewar, P. Dunn, Pelin Etgu, T. Feldblyum, J. Feng, B. Fong, C. Fujii, John Gill, Andrew Goldsmith, Brian Haas, Nancy Hansen, Beth Hughes, L. Huizar, J. Hunter, J. Jenkins, Chanda Johnson-Hopson, Shehnaz Khan, E. Khaykin, Christopher Kim, H. Koo, Irina Kremenetskaia, D. Kurtz, A. Kwan, Bao Lam, Stephanie Langin-Hooper, A. Lee, Jeong Lee, Catherine Lenz, Joycelyn. Li, YaPing Li, Xiaoying Lin, Shirley Liu, Zhaoying Liu, Jason Luros, R. Maiti, Andre Marziali, J. Militscher, Molly Miranda, Michelle Nguyen, W. Nierman, Brian Osborne, G. Pai, J. Peterson, Paul Pham, M. Rizzo, T. Rooney, D. Rowley, H. Sakano, Steven Salzberg, J. Schwartz, Paul Shinn, Audrey Southwick, Hui Sun, Luke Tallon, G. Tambunga, Mitsue Toriumi, Christopher Town, T. Utterback, S. Aken, M. Vaysberg, V. Vysotskaia, M. Walker, Dongying Wu, Guixia Yu, Claire Fraser, J. Venter, Ronald Davis (1999)
Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana
Nature, 402
Brent Ewing, Philip Green (1998)
Base-calling of automated sequencer traces using phred. II. Error probabilities.
Genome research, 8 3
Bland Ewing, L. Hillier, M. Wendl, Philip Green (1998)
Base-calling of automated sequencer traces using phred. I. Accuracy assessment.
Genome research, 8 3
Xiaoqiu Huang, M. Adams, Hao Zhou, A. Kerlavage (1997)
A tool for analyzing and annotating genomic sequences.
Genomics, 46 1
Todd Smith, C. Abajian, L. Hood (1997)
Hopper: software for automating data tracking and flow in DNA sequencing
Computer applications in the biosciences : CABIOS, 13 2
E. Veklerov, F. Eeckman, C. Martin (1996)
MTT: a software tool for quality control in sequence assembly.
Microbial & comparative genomics, 1 3
M. Wendl, Simon Dear, Dave Hodgson, L. Hillier (1998)
Automated sequence preprocessing in a large-scale sequencing environment.
Genome research, 8 9
Simon Dear, Richard Durbin, L. Hillier, Gabor Marth, Jean Thierry-Mieg, Richard Mott (1998)
Sequence assembly with CAFTOOLS.
Genome research, 8 3

Publisher: Oxford University Press
Copyright: © Oxford University Press 2001
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/17.12.1093
Publisher site: See Article on Publisher Site

Abstract

Vol. 17 no. 12 2001 BIOINFORMATICS Pages 1093–1104 DNA sequence quality trimming and vector removal 1 2 Hui-Hsien Chou and Michael H. Holmes Department of Zoology and Genetics, Department of Computer Science, Iowa State University, Ames, IA 50011, USA and Department of Bioinformatics, The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA Received on March 8, 2001; revised on June 11, 2001; accepted on June 12, 2001 ABSTRACT DNA data coming directly from automatic sequencing Motivation: Most sequence comparison methods assume machines often violate that assumption. In fact, most raw that the data being compared are trustworthy, but this DNA sequences have base-call errors. Nevertheless, to is not the case with raw DNA sequences obtained from make good use of these unreliable raw sequences, we still automatic sequencing machines. Nevertheless, sequence need to compare them against some other data, such as comparisons need to be done on them in order to vector and contaminant sequences, in order to prepare remove vector splice sites and contaminants. This step is them for later stages of the genomic data processing necessary before other genomic data processing stages pipeline (e.g. fragment assembly or EST clustering). can be carried out, such as fragment assembly or EST This creates a dilemma. We have uncertain raw DNA clustering. A specialized tool is therefore needed to solve sequences, but we need to compare them against other this apparent dilemma. sequences in order to obtain usable data. This data Results: We have designed and implemented a program cleaning stage is critical for the success of later stages that speciﬁcally addresses the problem. This program, of the genomic data processing pipeline, as illustrated in called LUCY, has been in use since 1998 at The Institute Figure 1. Through this cleaning stage, unreliable raw data for Genomic Research (TIGR). During this period, many from sequencing machines can be enhanced to a more rounds of experience-driven modiﬁcations were made to reliable level, so that later stages of the processing can LUCY to improve its accuracy and its ability to deal with use them without concern about their base quality . Our extremely difﬁcult input cases. We believe we have ﬁnally solution at The Institute for Genomic Research (TIGR) for obtained a useful program which strikes a delicate balance this cleaning stage is a specialized tool that was designed among the many issues involved in the raw sequence to handle this stage of genomic data processing. After cleaning problem, and we wish to share it with the research more than two years of practical use in TIGR and several community. rounds of improvement, we believe it is mature enough Availability: LUCY is available directly from TIGR (http:// now to be introduced to the scientiﬁc community. We hope www.tigr.org/softlab). Academic users can download LUCY other researchers can also beneﬁt from using it. after accepting a free academic use license. Business users may need to pay a license fee to use LUCY for METHODS commercial purposes. This tool, called LUCY, was designed to take the base-call Contact: Questions regarding the quality assessment quality assessment of each base into consideration in the module of LUCY should be directed to Michael Holmes cleaning process, to make sure the processed sequences ([email protected]). Questions regarding other aspects have the best overall quality possible based on their of LUCY should be directed to Hui-Hsien Chou (hh- individual base quality values. Thanks to a few powerful [email protected]). chromatogram base-calling programs like phred (Ewing et al., 1998; Ewing and Green, 1998) and TraceTuner (Paracel, 2000), the quality value for each individual base INTRODUCTION of a raw sequence can be estimated. The individual base There have been a lot of studies on how to align DNA quality assessment is generally reliable, but a sequence sequences with each other. Most of these studies are based can have very different base qualities among its bases. on a fundamental assumption that the sequences being compared are trustworthy, i.e. their bases are correct. † However, other quality issues may still exist, such as the need to remove However, in large genome sequencing centers, the raw chimeric sequences at the assembly stage. c Oxford University Press 2001 1093 H.-H.Chou and M.H.Holmes library fragment base-calling program with the same chromatogram ﬁle, base calling gene finding publication construction assembly LUCY may use it to extend the good quality region by automatic sequence cDNA database aligning the second sequence with the ﬁrst one at places annotation sequencing cleaning clustering construction where they match with high ﬁdelity. The rationale is that if two independent base-calling algorithms agree on some Fig. 1. Major steps of a typical genomic data processing pipeline. bases, chances are these bases are correct even though LUCY is designed to handle the sequence cleaning stage between some of them may have lower quality values. This method the base calling and fragment assembly or clustering stages. works well in processing ABI 377 and 370 chromatograms where the second sequence comes from the ABI base caller and the ﬁrst sequence comes from phred. This sequence method does not work very well on ABI 3700 sequences, CLN CLN probably because the new base-caller on the ABI 3700 quality value cutoff uses an algorithm similar to that used by phred. Generally, Low quality region trimming the step to extend the good quality region can be skipped without sacriﬁcing too much useful data, but potentially it 1st sequence may provide that extra linking leg needed to connect two contigs together during fragment assembly. 2nd sequence After having determined a good quality region, LUCY CLZ CLZ Consensus sequence extension then searches for vector splice sites on the sequence, and makes sure that they are not included in the usable sequence CLV CLV region of the sequence. This step is especially difﬁcult Shotgun Insert vector because most of the vector fragments lie in the low quality region of a sequence, so their existence and true Splice site trimming boundary cannot be easily determined by straightforward sequence alignment methods. What LUCY does to solve V e ctor Insert this difﬁculty is to search for the splice sites ﬁrst at a vector ﬁxed region at the beginning of each sequence, taking into consideration the low quality of the base calls in Contaminant removal this region. If that attempt fails, LUCY will search again adaptively, taking into account the boundaries of the good sequence CLN CLN quality region determined earlier. When searching for CLZ CLZ CLV CLV splice sites, LUCY applies different search stringencies at output Bad region Good sequence region for assembly Bad region different areas of a sequence to cope with the variable CLR CLR Output generation quality of base calls. LUCY also considers many special cases that were found through experience at TIGR. Many incremental improvements have been made to this step Fig. 2. Illustrations of LUCY’s major processing steps. See the main since the early versions of LUCY due to the difﬁcult nature text for explanations. of this problem. We believe that LUCY now functions very well with most input data. Following the splice site trimming step, contaminant LUCY’s task is to identify the largest subsequence that is sequences can be quickly identiﬁed and removed from of sufﬁciently high quality and also free of contaminating the input data. Because all vector splice sites have vector sequence. been trimmed during the previous step, the contaminant LUCY does its job by ﬁrst determining the longest removal step becomes relatively easy. Note that we are continuous high quality region of a sequence based on its concerned only with the quick detection of contaminants, base quality values. Some bases within this good quality not the details of the contaminants. Finally, markers iden- region may be wrong, but the overall value must be greater tifying the usable region of each sequence are generated than a user-speciﬁed minimum value (97.5%) . If a second in the output. Illustrations of LUCY’s major processing base-call sequence can be obtained using a different steps are given in Figure 2. LUCY’s processing steps are generally independent of each other and they produce In this paper, we include the default value of a user-adjustable parameter in parentheses. independent trimming pairs using the CL? tags shown in Discard! DNA sequence quality trimming the ﬁgure . When we refer to the good region in this paper, we mean the current cumulative trimming result after each processing step, i.e. the narrowest region obtained by combining all trimming markers. At the end of LUCY processing, the useful region of a sequence becomes the CLR region as shown in the ﬁgure. Most of LUCY’s parameters can be set by the users using its command line arguments. In the following dis- cussion of each processing step, we represent parameters Fig. 3. The raw data in a chromatogram ﬁle can be viewed as four that inﬂuence a speciﬁc step in the typewriter font. sets of overlapping peaks, one each for the A, C, G and T sequencing We also enclose their default values in the parentheses reactions. after the parameter names. Generally, these default values were determined based on our past experience. Users (a) (b b) can change these settings if they need to adjust LUCY’s (c c) behavior in their own data environment, but it should (d) be done with caution. In the user document distributed with LUCY, we provided additional information to help users determine some parameters for their speciﬁc Fig. 4. LUCY’s quality trimming steps. (a) Low quality areas are environments. trimmed from each end, then (b) regions of poor quality within the Except for the quality region determination step, all sequence are identiﬁed and removed from the clean ranges. (c) The other LUCY steps described below can be skipped if users resulting candidate clean ranges are further trimmed to satisfy the overall average probability of error criterion and the criterion of do not supply the necessary input data for a step. the probability of error at terminal bases. (d) The largest remaining candidate is chosen as the ﬁnal clean range. Quality region determination The goal of this step of LUCY’s processing is to ﬁnd the longest region of each input sequence whose base calls estimated base call error are related by the following are of sufﬁciently high quality to warrant using them. We formula: will refer to this region as the clean range in this section. This step makes use of the data in the quality ﬁles that Q =−10 × log (probability of error). are provided as input. The quality ﬁles are created by the base-calling software before LUCY is run. The base-calling programs base their estimates of the The raw data ﬁles (chromatogram ﬁles) from automated probability of error on a number of factors, including peak shape, spacing between peaks, signal strength and DNA sequencing machines, such as the ABI 377 and 3700 sequencers, contain four sets of data traces, one background noise (Ewing et al., 1998; Ewing and Green, 1998). LUCY ﬁrst converts the data from qualities to each for the A, C, G and T sequencing reactions. The probabilities of error according to the formula above (e.g. temporal relationship among the four traces is known, Q = 0 is converted to probability of error = 1.0). All of because the four reactions are run through the sequencer the major steps in the quality trimming process involve at the same time, in a single capillary or a single lane of a calculating some average probabilities of error within polyacrylamide gel. The traces can be visualized as a set certain windows along the length of the sequence. These of peaks, as shown in Figure 3. steps are illustrated in Figure 4. It is the job of the base-calling software to examine Since the beginning and end of each sequence are the traces and determine the sequence of base calls. In typically of low quality, the ﬁrst step is to remove the addition to calling the bases, the base-calling programs lowest quality data from each end. This step is controlled phred and TraceTuner provide a quality value for each by the bracket parameter. Starting from the left end base called. The quality value is based on the estimated of the sequence, LUCY ﬁnds the ﬁrst window of size probability that the base call is in error. Quality ( Q) and window size (10) that has an average probability of error of max avg error (0.02) or less. Similarly, starting These are historical TIGR database ﬁeld names. Their meanings are CLN: clear of bad quality data, CLZ: clear of zgrasta (an alignment tool used to from the right end, LUCY ﬁnds the last window of align the two base-call sequences before LUCY was available), CLV: clear of size window size meeting the same criterion. These vector fragments, and CLR: clear of all bad things. The contaminant removal windows and the bases between them are then subjected step does not produce trimming tags but sets the trash tag of contaminated sequences, which are represented by CLR left and right being both 0. to the remaining steps, and the bases at the ends of the 1095 H.-H.Chou and M.H.Holmes sequence that failed this step are excluded from further In a small minority of cases, quality trimming may pro- consideration. duce more than one candidate clean range that satisﬁes all The next step is to identify regions of the sequence that of the speciﬁed criteria. Only the largest is kept, however. have unacceptably high error rates, and exclude those This point is discussed further in the Section Discussion. regions from the ﬁnal clean range. This step is controlled Consensus sequence extension by a set of window size and max avg err pairs given to the window parameter . By default, LUCY uses two After good quality region determination based on the window sizes in this step: a 50-base window size1 quality of individual bases of a sequence, an optional step allowing a max avg err1 probability of 0.08, and a is to align the sequence to a second sequence, if available, 10-base window size2 allowing a max avg err2 prob- that is obtained using a different base-calling algorithm ability of 0.3. The purpose of the larger window size1 is with the same chromatogram ﬁle. The rationale is that if to exclude large regions of poor quality, but it may fail to two different base-calling algorithms agree on a speciﬁc exclude smaller regions of very low quality. The smaller base call, then that base call is likely to be correct even if window size2 excludes those smaller regions. its quality conﬁdence value is low. This is an attempt to LUCY starts with the candidate clean range from extend the good quality region by using the result from a the ﬁrst quality trimming step, and subjects it to the second base-calling program to boost the conﬁdence level largest window from the window options. From the assigned by the ﬁrst program. An important requirement beginning of the candidate clean range, LUCY calculates when doing this step is that the second sequence must be the average probability of error in each window of size obtained from a different algorithm. Otherwise, the second window size1. Each such window that has an average sequence can be almost the same as the ﬁrst sequence and error of max avg err1 or less is added to a new candidate a false good region can be extended all the way to both clean range, which keeps growing until some window ends of a sequence. This certainly ruins the purpose of the fails this criterion, then the current candidate clean range previous quality trimming step. is terminated. LUCY continues with the next window, In TIGR, the ﬁrst sequence is obtained from phred and and begins a new candidate clean range with the ﬁrst the second sequence comes from the built-in base-calling subsequent window that passes the same error criterion. So this step starts with a single candidate clean range, but program with the ABI 377 sequencers. Normally, in good may result in several candidate ranges being produced. quality regions, the alignment is almost perfect between Each of the candidate clean ranges produced by consid- the two sequences. But in the lower quality areas, the ering the largest window size is then subjected to the next ABI base-caller tends to call the N base (unknown) while window size (window size2) in the same way, and so on, the phred program almost always attempts to guess one. until all window sizes have been considered. Each of these Therefore, this extension step works reasonably well due steps may produce further fragmentation of the candidate to the different characteristics of the base-callers. Recently clean ranges. However, any candidate clean range that is we have stopped doing this extension step at TIGR. The smaller than the length speciﬁed by the overall minimum reason for this is that we want to be able to state our parameter (100) is eliminated from further consideration, quality trimming criteria in terms of base calling error and in practice the number of sequences which produce probabilities. The sequence extension step makes this multiple candidates of sufﬁcient size is small. impossible. Another reason is that the new base-calling The ﬁnal quality trimming step is controlled software bundled with ABI 3700 seems to utilize similar by the error option, with max avg error and base-calling algorithms as in phred, thus the sequences it max error at ends parameters. The max avg error generates are very similar to the phred output and cause parameter speciﬁes the maximum overall average proba- this extension step to extend too much into the bad quality bility of error allowable in the ﬁnal clean range (0.025), region. However, we retain this sequence extension feature and the max error at ends parameter speciﬁes the in LUCY because it is still conceptually sound, and for maximum probability of error for each of the two bases those who are using ABI 377 seuqencers, this step can at the ends of the clean range (0.02). For each of the sometimes extend the usable sequence length. candidate clean ranges produced by the preceding step, Since both sequences should be very similar in LUCY ﬁnds the largest subsequence that satisﬁes both their good quality regions, instead of running a time- of these criteria, and the largest such subsequence found consuming dynamic programming alignment, LUCY among all the candidates becomes the ﬁnal clean range. attempts to quickly locate the band of alignment by doing Do not confuse these with similar options associated with the bracket the following. It converts the ﬁrst sequence into 16 bp parameter; they are independent of each other. ‡ tags, sorts the tags and removes duplicates from the set. Again, the max avg error here is independent to the other similar options associated with the other parameters. Each tag has an accompanying index linking back to its 1096 DNA sequence quality trimming (a) (b) (c) (d) Fig. 5. Quick afﬁnity region ﬁnding using the depth-ﬁrst search algorithm in LUCY. (a) A partial dot graph showing matched bases between two sequences at the right end of an alignment. This partial dot graph is chosen such that the alignment band is right on the diagonal. (b) The search space (dark grey areas) LUCY explored when conducting the extension effort. The dot graph has been lightened in this and subsequent sub-ﬁgures to facilitate reading. (c) The LUCY search space after a mismatch has been introduced at the third base position in the ﬁgure. (d) The LUCY search space after a deletion is made to the third base position on one sequence. the current search path such that the previous mismatch position in the sequence . Next, the second sequence is sequentially converted into 16 bp tags and searches are can be tolerated; the end of either sequence has been quickly conducted in the ﬁrst sequence tag set to ﬁnd the reached; there are over ﬁve mismatches along the current same tags. If some tags can be found on both sequences, search path; or over 100 bases from the ﬁrst mismatch their indices in the ﬁrst sequence are compared against has been scanned without ﬁnding sufﬁcient matched their indices in the second sequence, and their differences bases to compensate the mismatches. These parameters are recorded in a separate hit pool with associated hit come from the choice of an alignment stringency of 95%. counters. If a difference value already exists in the hit Thus for every 100 bases, there can be no more than ﬁve pool, its hit counter will simply be increased by one. At mismatched bases. the end of the tag comparison processing the difference Figure 5a depicts the dot graph around the right end of value with the highest hit count determines the relative the alignment extension between two base-call sequences offset of the two sequences in their best alignment regions. of a read . The dark line at the upper 1/3 of the diagonal Once the relative offset is known, their central align- marks the end of the good quality region LUCY reports. ment band (supposedly the good quality region) can be Figure 5b reveals the actual search space LUCY explored quickly located. Extension attempts are then carried out when doing the alignment extension. We can see that on both ends of this alignment band. Again, since the LUCY’s exploration is very conﬁned and almost linear two sequences should be mostly similar except in the low in the direction of the alignment diagonal. To test the quality regions, we only need to quickly connect the high search behavior of LUCY under different conditions, we afﬁnity regions; we are not concerned with an optimal intentionally create a mismatch pair at the third base of alignment between the two sequences in the low quality the portion of sequences covered in these dot graphs. regions. Therefore, this can be formulated as a search It does not change the search space at all, as shown in problem using the more efﬁcient depth-ﬁrst search algo- Figure 5c. Because there are more than 20 matches beyond rithm to locate the high afﬁnity regions extending from the this mismatch, it is tolerated by LUCY and the end of ends of the central alignment band. At every mismatched the good quality region (the dark line) is not changed. base between the two sequences on the current alignment Finally, we delete one base at the same test position from band, there are three choices to continue the extension ‡ the second sequence to create a gap in the alignment. effort: skip one base on the ﬁrst sequence, skip one base This deletion smears the LUCY search space a bit near the on the second sequence, or skip both. A stack is set up to interruption. However, LUCY quickly recovers the rest of record the choices being made at each mismatch position the high afﬁnity region and reports the same end of good so the algorithm can come back to it later and try another quality region as shown in Figure 5d. choice. The depth-ﬁrst search algorithm continues to ﬁnd The beneﬁt of the depth-ﬁrst search approach over more high afﬁnity regions and makes more choices at dynamic programming alignment is in its almost linear- each mismatch until one of the following stop conditions becomes true: there are sufﬁcient matched bases along Actual data from the right end sequencing of the clone ATIEP78 in the Arabidopsis project (Lin et al., 1999). § ‡ Duplicate tags at different locations may be lost, but this will not inﬂuence No gap will actually be created in the output since only the good region on the outcome. the ﬁrst sequence will be used. 1097 H.-H.Chou and M.H.Holmes time speed. The disadvantage of this method is that it programmed to ﬁnd and remove all vector fragments does not guarantee ﬁnding the longest alignment possible that satisfy the search criterion. The quality of the base between the two sequences under the same alignment calls is usually poor at the beginning of a sequence, conditions. Since the search range is limited, when a much but gradually improves when moving into the sequence. longer bad quality region separates two otherwise good Therefore, the search criterion has to be made adaptive alignments between the two sequences, this method will to the average quality of the bases, such that in the low not be able to ﬁnd the other alignment. Normally we quality region we allow shorter vector fragments to be will not use unconnected shorter good quality regions, so identiﬁed, but in the high quality region we require a more this limitation is not critical and is rare to occur. See the deﬁnite identiﬁcation to avoid cutting good sequences Section Discussion for more explanation. short. Although there can be many different quality areas, we have found that using just three with low, medium Vector splice site trimming and high-stringency search criteria serves the purpose well LUCY requires a vector sequence file and a (see Figure 6). splice site file to conduct the splice site trimming The default range for the three search areas are 40, 60 and contaminant removal steps. The vector sequence- and 100 bp, with three different minimum alignment file contains a single long sequence of the whole lengths of 8, 12 and 16 bp. A local optimal alignment vector, which is used only in the quick contaminant within each area must be equal to or longer than these removal step discussed next. The splice site file is minimum lengths before it is considered an identiﬁed what is actually used for the splice site trimming step. It vector fragment. This means that, by default, LUCY will contains two splice site template sequences upstream and search for splice sites in the ﬁrst 200 DNA bases. If the downstream from the insertion point on the vector (see splice site is not found in the ﬁrst 200 bases or there are Figure 6). The splice site sequences are usually 100–150 some special conditions that occur, additional searches bases in length, with some bases overlapping each other beyond the default areas will also be conducted. When around the insertion point. They should also include any comparing sequences to the upstream splice site template, short linker sequences used during cloning reactions. LUCY will ﬁrst ﬁnd the longest alignment with at least Their actual lengths are not critical, as long as the splice three matched bases for every one mismatch in it. This site is totally covered. does not mean the alignment will have 25% errors in With two splice site sequences in the splice site file, it because only the highest scoring local alignment is LUCY assumes all input sequences are read in the same used. Shorter alignments to the left of the found one can direction as the splice site sequences. Usually users know be ignored. However, if there are other qualiﬁed local the exact direction of their sequencing reactions, i.e. from alignments that can be found after the best alignment, which end of the clones they are reading the data, thus LUCY will continue the process until all potential vector trimming needs to be done just along that direction. fragments have been eliminated. After the end of the However, if that is not the case and the input may consist upstream splice site has been determined, the rest of the of sequences from both forward and reverse reads of sequence will be searched for the downstream splice site clones (probably due to laboratory or sequence tracking with the highest (16 bp) stringency. This is to guard against errors), then LUCY can be instructed to do bidirectional short inserts. trimming as well. If LUCY sees both forward and reverse The three initial search areas are ﬁxed at the beginning splice site template pairs in the splice site file (i.e. of each sequence, without regard to where the good a total of four splice site sequences), it assumes that quality region begins. There may be questions as why bidirectional trimming has been prescribed. In TIGR, we they are not made to stick with the end of the low quality always run bidirectional trimming despite the fact that region (i.e. the beginning of the good quality region) so only one of the trimming directions is actually needed. the alignment criteria are more in line with the actual Trimming in the unnecessary direction might cause a few quality composition of each input sequence. Actually, one sequences to be shortened a little bit, but it guarantees that previous version of LUCY was designed exactly that way, there can be no vector fragments in the good region even but that design was dropped later in favor of the ﬁxed when the assumed read direction of some input sequences search areas for the ﬁrst search attempt. There are several is wrong. reasons for this decision. First, the average position of Because vector splice sites are usually at the beginning the splice sites is related to the speciﬁc vector being of a sequence where the quality of bases is low, a simple used and is more or less ﬁxed at the beginning of each sequence comparison that looks only for the longest sequence, meaning that it depends much less on the alignment with the upstream splice site sequence does sequencing quality. Therefore, shifting the search areas not guarantee ﬁnding all vector fragments that may have to the beginning of the good quality region often causes been obscured by base-call errors. Instead, LUCY is vector fragments to be missed for sequences that have 1098 DNA sequence quality trimming upstream template direction of sequencing downstream template splice site fragment insert splice site area1 area2 area3 search area for end of insert 40 60 100 400 - 800 8 bp min. match 12 bp min. match 16 bp min. match Fig. 6. LUCY’s vector fragment searching areas on a typical sequence. number of sequences at each length or percentage value point average good region length at each length or percentage point (a) (b) (c) default cutoff 0 0 100 200 300 400 500 100 200 300 400 500 10 20 30 40 50 percentage of vector vector fragment length in vector fragment length in the whole sequence the good region only fragments in the good regions Fig. 7. Performance of the simple tag matching method in separating contaminants and good sequences when running against different regions of a sequence. (a) When running on the whole sequence, there is no clear separation between good and contaminated sequences. (b) When running only on the good region of each sequence determined during previous steps, a separation becomes possible since most sequences have far fewer contaminant matches in their good regions. (c) An even better separation can be achieved when taking the fraction of vector matches in the good region of each sequence. The default separation cutoff is set at 20%. a longer initial low quality region. Although it can be the sequence is carried out similarly. If users wish to tell argued that missing vector fragments in the initial low LUCY that they are processing EST sequences but they quality region is acceptable because this region will be also wish to keep the poly-A/T tags for their purposes, excluded off the good region anyway, we ﬁnd that it is they can issue the keep option in combination with the still worthwhile to positively identify the position of vector poly-A/T trimming option cdna. splice sites for purposes such as estimating clone lengths, Contaminant detection sequencing quality assessments, etc. Note that LUCY may Contaminants in the input sequences can come from many conduct additional vector fragment searches under various special conditions. These additional searches are indeed sources, such as Escherichia coli or human. However, adaptive to the good quality region. We will explain in most common ones are cloning vectors themselves. A the Section Discussion some of the special conditions that vector can potentially splice with another vector, forming a trigger additional searches. vector insert. A vector can also pick up a very short insert that causes much of the sequence read to be the vector Poly-A/T tail removal itself. These events do not happen frequently, roughly If the raw DNA sequences are obtained from an at the order of one in a few thousand sequences, but EST library, some users want their poly-A/T tags they do happen. The actual frequency depends on the to be removed before clustering. LUCY does this speciﬁc vector being used for cloning and the laboratory quickly after vector trimming by searching for the ﬁrst processing. Since these are rare events, it is not worthwhile min span (10) or longer poly-T fragment within the to spend a lot of time screening for a few contaminants. ﬁrst initial search range (50) bases inside the We want a method that can quickly identify potential vector-free good region, then attempts to extend from contaminants and throw them off. We do not need to know this initial poly-T seed toward the center of the sequence, where in the contaminated sequences the contaminants allowing no more than max error (3) mismatches match. between every min span (10) consecutive T bases in Our solution is to pre-construct a tag pool from the the scan. This is therefore a linear-time and linear-space full-length vector sequence file. The tag pool is operation. The poly-A tail trimming at the other end of made of every vector tag size (10 bp) fragment of the 1099 H.-H.Chou and M.H.Holmes contaminant sequence. Tags for the reverse strand of the such as E.coli, yeast, or human. Potentially, we can in- crease the tag size used and check the sequential ordering contaminant are automatically generated and put into the of matched tags to obtain more conclusive evidence of same pool as well. The pool is then sorted and duplicate contaminant matching. This can be done similarly to the tags are removed. This is similar to the good quality region secondary sequence extension step explained previously. extension step mentioned previously, but it needs to be However, this larger contaminant screening capability done only once for the entire running session of LUCY. has not been built into LUCY yet. In TIGR, we use During screening, each input sequence is converted into another tool, the dds.btab program derived from the tags and searched against the contaminant tag pool to see if AAT package (Huang et al., 1997), to screen for larger there is any match. We count the number of matches found contaminants after LUCY processing. on a sequence and use that to determine if the sequence is a contaminant. SUMMARY OF PARAMETERS AND OTHER This method may sound too simple to be effective. SYSTEM CONSIDERATIONS Indeed, if we try it on a test set of 5022 sequences from the Arabidopsis thaliana genome project (Lin et al., LUCY’s control parameters are summarized in Figure 8. 1999) as shown in Figure 7a, it does not seem to tell us Most of these parameters have been discussed before, exactly which hits are real contaminants. Most sequences except the following: (98%) have less than 30 bp total match to the contaminant • The three pass along values deﬁne the minimum, sequence. However, beyond 30 bp, the distribution seems maximum and medium clone lengths of a particular to go uniformly all the way up to 530 bp. Two percent of library. Usually, the preparer of the library knows these the test sequences have nontrivial good regions and should values. They are critical to the fragment assembly pro- not be simply discarded. Most of them are just shorter gram in bridging contigs and constructing scaffolds, insert sequences that still contain valuable data. We need but LUCY does not interpret these values. It just passes a better way to distinguish useless sequences from useful them along with the output sequences. ones. Thanks to the sophisticated vector splice site trimming • After all kinds of checking, comparing and trimming, step above, if we run this contaminant search only within the good region of a sequence must still be longer than the good region of each sequence, the distinction becomes the minimum good sequence length (100 bp) for it to more obvious. Basically, this rules out all vector splice be considered useful to the subsequent data processing sites in the comparison and therefore reduces matches stages. We do not want our assembly program to to the contaminant found within short inserts, as seen in receive many small, trashy fragments. Figure 7b. This revised search criterion sharply pushes • If users have multiple CPUs in their computers, they sequences to the left and reveals that most (99.5%) can dramatically increase LUCY’s speed by telling it to sequences have less than 30 bp matches to the contaminant run xtra execution threads concurrently. For example, in their good region. We can improve the separation even on a Quad-CPU computer, LUCY’s execution time can further by looking at the fraction of the good region be cut to a quarter of its time spent on a single CPU of each sequence involved in contaminant matches. The computer. By default, LUCY runs just one thread. result, as shown in Figure 7c, is a clear separation of the only vector insert, which has a 47% match to the • Users can tell LUCY to use a speciﬁc set of output ﬁle contaminant in its good region. This is about one in ﬁve names, to be quiet during processing and only report thousand input sequences. The other sequences all have errors, to inform them of names of sequences that are less than 10% match in their good regions. However, some dropped due to low quality concerns or are resurrected short inserts may still be dropped because they have less by secondary sequence extension, and to produce a than the minimum good sequence length. We will explain debug ﬁle listing the computed tag values (shown in this further in the following Section Discussion. The Figure 2) for each sequence. These tag values reveal vector cutoff (20%) is used to separate contaminants how LUCY arrives at a trimming decision for each sequence. from good sequences. Readers should note that although this simple All parameters of LUCY are explained in greater detail in tag-matching method is adequate to recognize vector the document that comes with its distribution. contaminants, a much more sophisticated algorithm is LUCY does not access (nor depend on) a database needed to screen successfully for larger contaminants server. It is a design decision to separate LUCY from † any site-speciﬁc infrastructure assumption. In TIGR, we Among the 5022 sequences, 310 have less than minimum good sequence length and are not included in the statistics. use a separate program, RICKY, to drive LUCY and 1100 DNA sequence quality trimming Parameters Default values Related operation steps pass_along min_value max_value med_value 0 0 0 pass to assembly program error max_avg_error max_error_at_ends 0.025 0.02 quality area determination window window_size max_avg_error … 50 0.08 10 0.3 quality area determination bracket window_size max_avg_error 10 0.02 quality area determination range area1 area2 area3 40 60 100 vector splice site trimming alignment area1 area2 area3 8 12 16 vector splice site trimming vector vector_sequence_file splice_site_file none vector splice site trimming cdna [min_span max_error initial_search_range] none or 10 3 50 poly-A/T trimming keep none poly-A/T trimming size vector_tag_size 10 contaminant removal threshold vector_cutoff 20 contaminant removal minimum good_sequence_length 100 overall quality control xtra cpu_threads 1 overall program control output, quiet, inform_me, debug none overall program control Fig. 8. Summary of LUCY parameters and their default values. provide input/output bridging between LUCY and our LUCY makes no assumption about the order of se- quences in the three input ﬁles: the ﬁrst sequence ﬁle, database server. Since programs like RICKY are built the companion quality value ﬁle, and the optional second around a speciﬁc infrastructure that varies from institution to institution, users need to design their own solutions sequence ﬁle. As long as all necessary information can be found, DNA and quality sequences can be in different to automate the process of running base-calling software, order in the ﬁrst two input ﬁles. LUCY’s output will getting input to LUCY, converting LUCY’s output to SQL always be in the order of the ﬁrst sequence ﬁle, so a update commands, and ﬁnally uploading LUCY’s output to trivial ‘feature’ of LUCY is to sort quality sequences their databases. RICKY will not be useful outside TIGR’s with the DNA sequences. A DNA sequence without environment. LUCY, on the other hand, does not depend its companion quality sequence or vice versa will be on any other program to run. Users can operate LUCY reported as an error. The second sequence ﬁle is allowed by manually providing its input data, as is often done in to have missing sequences that appear only in the ﬁrst smaller research laboratories. sequence ﬁle. Sequences that appear only in the second LUCY reads its input data from multi-FASTA text sequence ﬁle will be ignored by LUCY. The following is ﬁles, and writes its output also to multi-FASTA ﬁles. A a typical output sequence from LUCY, where the header multi-FASTA ﬁle is simply the concatenation of many line contains this information: the sequence name, the FASTA sequences (National Center for Biotechnology three pass along values of the minimum, maximum and Information, 2001), like those generated directly by phred. medium clone lengths of the library, and the left and right Each DNA sequence in a multi-FASTA ﬁle is delimited trimming positions determined by LUCY. by a header line that begins with the greater-than symbol ‘>’ followed by the sequence name, and may include >GCCAA03TF 1500 3000 2000 43 490 AGCCAAGTTTGCAGCCCTGCAGGTCGACTCTAGAGGATCCCCAGGATGATCAGCCACATT other information about the sequence. LUCY records GGGACTGAGACACGGCCCAAACTCCTACGGGAGGCAGCAGTGGGGAATCTTGCGCAATGG only the sequence name information which is terminated GCGAAAGCCTGACGCAGCCATGCCGCGTGAATGATGAAGGTCTTAGGATTGTAAAATTCT TTCACCGGGGACGATAATGACGGTACCCGGAGAAGAAGCCCCGGCTAACTTCGTGCCAGC by the ﬁrst space character that appears on the header ... line. The header line is followed by several more lines of data making up the actual DNA sequence, which At TIGR, we run LUCY on a Sun workstation running can be in either upper or lower case letters. LUCY the Solaris operating system (Version 5.5.1). LUCY has accepts the standard IUB/IUPAC DNA codes including also been compiled and run successfully under the Linux the ambiguous letters (e.g. N for unknown bases). For each operating system on a PC. Almost all ANSI C compilers base in a DNA sequence, there must be a corresponding should be able to compile LUCY source code since it is number denoting its quality value in the companion programmed using only standard C libraries and header quality sequence. Quality sequences share the same header ﬁles. LUCY has not been run on the MacOS or Windows line format but are made up of numbers separated by by us, but we believe porting it to these two platforms spaces. Blank lines are not allowed in the input ﬁles and should not be too difﬁcult since all source code are all lines must be shorter than 256 characters. Each input included in its distribution. It is very likely that LUCY can sequence is terminated by either the beginning of another run without any signiﬁcant modiﬁcation inside a Windows header line or by the end of the input ﬁle. command shell. 1101 H.-H.Chou and M.H.Holmes LUCY is a CPU-bound program, not a memory-bound software for genomic data processing. LUCY’s simple program. LUCY’s static memory requirement grows very interface requirements allows it to be easily incorporated slowly with the number of input sequences, at roughly into any existing pipeline. LUCY is written in ANSI C 60 bytes plus the sequence name storage for each input code so it may improve certain processing steps when sequence. This is because LUCY does not store any compared to Perl or script based solutions. Additionally, sequence data in memory after processing them. To save since LUCY is independent of any database infrastructure, memory, LUCY uses direct ﬁle addressing to access each it can be used automatically or manually, by both large and sequence when it is needed, and stores only those pointers small research laboratories. in the memory. Therefore, by all practical considerations LUCY can handle any number of input sequences. The DISCUSSION OF SPECIAL CASES dynamic memory requirement of LUCY is proportional to We have given separate descriptions of LUCY’s major the longest sequence in the input data, not the number processing steps. However, the actual processing within of input sequences, but its actual size varies from one LUCY is far more complex than what has been explained. processing step to the next. Often times it is the joint result from these processing steps The slowest step in LUCY is the vector trimming step that is used to make the ﬁnal decision about the fate of a which employs a quadratic time dynamic programming sequence. Trimming vector splice sites in the low quality algorithm to ﬁnd vector fragments in each sequence. To region of sequences proves to be a very difﬁcult problem. speed up LUCY’s processing, multi-threading capability Since version 1.03 of LUCY (now at 1.16), all new has been built into LUCY as introduced above. On a Linux improvements were made possible because people using PC with four CPUs running at 500 Mhz, it takes LUCY just LUCY discovered abnormal trimming results with their 1 min and 12 s to process 5022 raw input sequences. When eyes and provided them to us for inspection. We found that using one CPU, LUCY’s processing time will be roughly solutions for many of these special cases can contradict four times that. This performance is considered acceptable each other, meaning that making LUCY handle one special to most applications. Also related to vector trimming, it case may introduce new problems in sequences that had is important that users do not turn on phred’s quality been handled correctly before. We present some examples trimming function when doing base-calling. This often in this Section Discussion to illustrate the difﬁculty of cuts the resulting sequences short and prevents LUCY the problem. This presentation is not meant to include all from seeing the vector fragments. Leaving the trimming special cases that can happen during LUCY processing. to LUCY will give it the information it needs to do its job To start with, LUCY always looks at the initial search well. areas for any vector fragment (Figure 9a). Under normal We cannot provide any generally applicable statistics conditions, there are some low quality bases (20–50 bp) of the trimming results of LUCY because these certainly at the beginning of a sequence. Vector fragments are well depend a lot on input data quality, contaminant contents within the initial search areas and can be identiﬁed even and LUCY’s parameter settings. We can, however, provide within the low quality region. The rest of the sequence is some information speciﬁc to the 5022 test sequences then searched against the downstream splice site sequence we mentioned earlier that are obtained from TIGR’s to guard against short inserts. Usually the clone is much Arabidopsis genome project (Lin et al., 1999). With longer (2 kbp) than the sequencing machine can read default LUCY parameters, 310 of the input sequences at once (<900 bp), so LUCY will not ﬁnd any vector are discarded because their good regions are shorter than fragment at the other end of a sequence read. Most input the minimum length of 100 bp after trimming. Of the sequences (well over 90%) fall into this case. remaining 4712 useful sequences, their total length is Sometimes, a low quality region can separate two good 4148 326 bases, and their total good region length is regions, as seen in Figure 9b. This may be the result of 2072 533 bases. Therefore, about 50% of the raw data incidents such as power interruption caused by lightning are trimmed away by LUCY. The resulting good regions during sequencing runs . In this case, LUCY ﬁnds the have a base-call conﬁdence level of above 97.5%, and longest high quality region and uses that as the good re- contain no vector fragments or contaminants. Among the gion. Although the other good regions, as long as they are 5022 input sequences, only one is suspected to be a vector longer than minimum good sequence length, could still be insert. used separately, this would require LUCY to generate new Although many major genome sequencing centers have sequence names that could cause some database tracking already developed their own automatic data processing nightmares (e.g. what subsequences come from which pipelines to perform some of the tasks LUCY is designed original sequence, etc). The extremely low frequency of to do (Veklerov et al., 1996; Smith et al., 1997; Dear et al., 1998; Wendl et al., 1998), we believe LUCY can still † This is no joke; it actually happened a few times in TIGR during be a useful addition to the freely available bioinformatic thunderstorm seasons. 1102 DNA sequence quality trimming low quality region high quality region vector fragments (a) good region (b) good region (c) good region (d) good region (e) good region (f) i good region (g) i good region (h) good region (i) good region initial search areas secondary and end of insert search area 200 400 - 800 Fig. 9. Examples of special cases of sequence cleaning that LUCY has been designed to handle. Although we plot each example using the same length sequence for esthetic reasons, it should be understood that each sequence can be of variable length and can contain more than one problematic scenario. See the main text for explanation of each case. this occurrence, and the nature of the shotgun fragment sequence inside the high quality region, but no other assembly principle, make the effort to salvage the shorter vector fragments are found near it to conﬁrm the match? good regions in a sequence unnecessary. Should LUCY treat it as a positive identiﬁcation of a vector Vector fragments may not always be in the low quality fragment and trim the sequence accordingly, or should region. Sometimes, they run into the good quality region LUCY consider it a match by random chance that can as well. The other end of the vector splice site may also be ignored? There is no absolute correct answer to this. be found towards the end of a sequence, as shown in Therefore, LUCY employs a heuristic judgment: if the Figure 9c. If the sequence is a short insert, the length of distance between the vector match and the end of the the good region may be much shorter than the length of the search areas is shorter than the width of the vector match sequence, as exempliﬁed in Figure 9d. Normally, LUCY itself, i.e. i < j in Figure 9f, then LUCY declares it a will drop a sequence whose good region is shorter than vector fragment and trims it. Otherwise, LUCY rejects it minimum good sequence length to prevent useless data as a random match. Additionally, if the vector match is from clogging the subsequent data processing stages. equal or more than half the total length of the good region Sometimes, the low quality region runs longer than the ( j > 1/2 length of good region), then LUCY will trim it as initial search areas. In this case, as shown in Figure 9e, well. Since the search areas are 200 bp long, it basically LUCY has to make a second attempt to search for vector means that LUCY will trim all singular vector matches fragments beyond the initial search areas, in order to ﬁnd longer than the minimum alignment length in each area, any additional vector fragments inside the high quality but will ignore most short vector fragment matches in the region. The reason for doing this is because some of the middle of a long high quality region. low quality bases may mask the true identity of vector The heuristic above means that even if LUCY does throw fragments. In order to guarantee that no vector fragment is out some false-positive vector fragments, they will all be included in the reported good region, LUCY must search limited to the ﬁrst 200 bases of a sequence, and LUCY part of the high quality region beyond the initial search needs much more concrete evidence before it will accept areas as well. This is the second adaptive vector splice site a vector fragment match inside the high quality region. search we mentioned previously. In addition to these, the heuristic has to be augmented If LUCY does not ﬁnd any vector fragment within the for additional special cases where the low quality region initial search areas, it may be the case that the sequencing extends much into the sequence and goes beyond the PCR reaction started well before the splice site. To guard initial search areas, as shown in Figure 9g. In this new against this, LUCY needs to conduct a second search case, the boundary of heuristic judgement will be shifted attempt beyond the initial search areas as well. However, to the right end of the low quality region. The reason for this may create another problem, as shown in Figure 9f. this is similar to case 9e, to prevent any vector fragment What if LUCY ﬁnds a short match to the splice site from getting into the good region selected by LUCY. 1103 H.-H.Chou and M.H.Holmes Another special case is when LUCY ﬁnds a vector This allows the wisdom and experience of biologists to be fragment that extends right up to the end of the initial slowly translated into functional program code. search areas, as seen in Figure 9h. Here, the end of the Further improvements to LUCY may prove difﬁcult, as vector fragment may be the actual boundary of it, or it LUCY now strikes a delicate balance among the various may simply be an artiﬁcially created boundary because of strategies for dealing with real-world problems that have the fact that LUCY searches for vectors in the ﬁrst 200 been encountered at TIGR. If LUCY is able to do a better bases ﬁrst. To conﬁrm the actual ending of the vector job today than its previous versions, it is all because we fragment, LUCY has to conduct the second search as have a few hardworking biologists who actually talk to well . Finally, even when splice site trimming is done computer scientists. normally and there is a good region inside a sequence, if the good region length is less than the minimum ACKNOWLEDGEMENTS good sequence length after the length of matched vector We wish to thank Dr Granger Sutton and Dr Tony Kerlavage fragments are subtracted from it, LUCY will still reject the for leading us into this difﬁcult but intriguing problem. We sequence in fear of introducing contaminants into the data would also like to thank Ms Anna Glodek, Mr John Scott processing pipeline. Matched vector fragments, as shown and Mr Terrance Shea for providing us their valuable in Figure 9i, are found during the quick contaminant suggestions when we were developing LUCY. Finally, removal step. we thank Dr Steven Salzberg and Dr Claire Frasier for Although these special cases are also discussed sep- generously allowing us to make this program freely arately, it is not necessary that they occur separately. available to the academic community. More often for low quality sequences, they actually occur simultaneously. For example, a sequence can have a REFERENCES short insert, a long initial low quality region, another low Dear,S., Durbin,R., Hillier,L., Marth,G., Thierry-Mieg,J. and quality region separating two good quality regions, and a Mott,R. (1998) Sequence assembly with CAFTOOLS. Genome single vector fragment match inside its good region with Res., 8, 260–267. no other vector fragment found in the initial search areas Ewing,B. and Green,P. (1998) Base-calling of automated sequencer due to numerous base-calling errors. LUCY has been traces using phred. II. Error probabilities. Genome Res., 8, 186– made to handle most, if not all, possible combinations of special situations together. Ewing,B., Hillier,L., Wendl,M.C. and Green,P. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assess- ment. Genome Res., 8, 175–185. CONCLUSION Huang,X., Adams,M., Zhou,H. and Kerlavage,A. (1997) A tool for Although there are many DNA sequence comparison analyzing and annotating genomic sequences. Genomics, 46,37– and analysis algorithms in the bioinformatics literature, our experience has been that to make them function Lin,X. et al. (1999) Sequence and analysis of chromosome 2 of the correctly inside a program that processes real-world data, plant Arabidopsis thaliana. Nature, 402, 761–768. the programmers must make an extra effort to consider National Center for Biotechnology Information (2001) FASTA ﬁle special cases reported by users. This is especially true format speciﬁcation. Available on the Internet. http://www.ncbi. nlm.nih.gov/blast/html/search.html when the amount of data is large and the quality of data Paracel (2000) TRACETUNER, capturing the most information can be low, as in the sequence cleaning stage. We believe from the latest DNA sequencing systems. Information on the that this is not just an isolated situation for LUCY.Itis, Internet. http://www.paracel.com/html/tracetuner.html indeed, a common phenomenon that arises during many Smith,T.M., Abajian,C. and Hood,L. (1997) HOPPER: software for bioinformatic software development cycles. Due to the automating data tracking and ﬂow in DNA sequencing. Comput. inherently unpredictable nature of biological data, there is Appl. Biosci. (CABIO), 13, 175–182. always some distance between the theorectical design of a Veklerov,E., Eeckman,F. and Martin,C. (1996) MTT: a software bioinformatic solution and the successful implementation tool for quality control in sequence assembly. Microb. Comp. of the solution in a working program that can handle Genomics, 1. real-world data reliably. The only way to shorten such Wendl,M., Dear,S., Hodgson,D. and Hillier,L. (1998) Automated distance to perfection, in our opinion, is to form a close sequence preprocessing in a large-scale sequencing environment. Genome Res., 8, 975–984. collaboration between computer scientists and biologists. This used to be a bug in previous versions of LUCY.

Journal

Bioinformatics – Oxford University Press

Published: Dec 1, 2001

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

DNA sequence quality trimming and vector removal

DNA sequence quality trimming and vector removal

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

DNA sequence quality trimming and vector removal

DNA sequence quality trimming and vector removal

References (8)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies