Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

Yuhong Zhang; Sanchit Misra; Ankit Agrawal; Md Patwary; Wei-keng Liao; Zhiguang Qin; Alok Choudhary

doi:10.1186/1471-2105-13-S5-S3

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

Zhang, Yuhong; Misra, Sanchit; Agrawal, Ankit; Patwary, Md; Liao, Wei-keng; Qin, Zhiguang; Choudhary, Alok 2012-04-12 00:00:00 Background: Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. Results: We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm’s data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel Core™i7 CPU 920 processor. Conclusions: Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment. Background applications have been developed based on pairwise Introduction sequence alignment, such as BLAST [3], PSI-BLAST The past decades have witnessed dramatically increasing [4-6], and FASTA [7]. PSA produces a score for an trends in the quantity and variety of publicly available alignment as a measure of the similarity between two genomic and proteomic sequence data. Dealing with the sequences. Generally, the higher the score, the more massivedataand making senseofthemarebigchal- related the sequences. However, the alignment score lenges in bioinformatics [1,2]. One of the most widely depends on various factors such as alignment methods, scoring schemes, sequence lengths, and sequence com- used procedures for extracting information from proteo- mic and genomic datais pairwisesequencealignment positions [8,9]. Judging the relationship between two (PSA). Given two sequences, PSA finds the extent of sequences solely based on the scores can often lead to similarity between them. Many bioinformatics wrong conclusion. Therefore, it is more appropriate to measure the quality of PSA using the statistical signifi- cance of the score rather than the score itself [10,11]. * Correspondence: [email protected] School of Computer Science and Engineering, University of Electronic Statistical significance of sequence alignment scores is Science and Technology of China, Chengdu, China very important to know whether an observed sequence Full list of author information is available at the end of the article © 2012 Zhang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 2 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 similarity could imply a functional or evolutionary link, strongly motivates the use of GPUs to accelerate the or is achanceevent [8,12].Accurateestimationofsta- PSSE. Acceleration of PSSE using MPI [35] and FPGA tistical significance of gapped sequence alignment has [36] has been explored earlier, and in this work, we attracted a lot of research in recent years [13-26]. design a GPU implementation for the same. Compared to [35,36], we consider the estimation of multi-pair PSS Pairwise statistical significance along with single-pair PSS using GPUs. Consider a pair of sequences s and s of lengths m and 1 2 n, respectively, the scoring scheme, SC (substitution Contributions matrix, gap opening penalty, gap extension penalty), and We present a GPU implementation to accelerate pair- the number of permutations N of s , pairwise statistical wise statistical significance estimation of local sequence significance (PSS) of the two sequences is calculated by alignment using standard substitution matrices. Our the following function [21], which is described below: parallel implementation makes use of CUDA (Compute Unified Device Architecture) parallel programming para- PSS(s , s , m, n, SC, N), 1 2 digm. Our design uses an efficient data reorganization method to produce coalesced global memory access, Through permuting s N times randomly, the function and a tiled-based memory scheme to increase the GPU generates N scores by aligning s1 against each of the N occupancy, a key measure for GPU efficiency [37]. permuted sequences and then fits these scores to an Through careful analysis of the data dependency and extreme value distribution (EVD) [8,27,28] using cen- access patterns of PSSE, we reorganize the data sored maximum likelihood [29]. The returned value is sequences into aligned arrays that coalesce the global the pairwise statistical significance of s and s .The 1 2 memory access pattern. Such data access contiguity EVD describes an approximate distribution of optimal keeps the GPU cores occupied with computation and scores of a gapless alignment [4]. Based on this distribu- allows the thread scheduler to overlap the global mem- tion, the probability (i.e., P-value) of observing an ory access with the computation. In addition to the abil- sequence pair with a score S greater than x,can be ity to calculate the optimal tile size for data to be given by: shipped to the GPU, our design can also issue a large −λx −E(x) enough number of threads to maximize the occupancy. (1) P(S > x) ≈ 1 − exp(−Kmne )= 1 − e We further extend the parallelization technique to where l and K are calculational constants and E(x), estimate pairwise statistical significance using position- also known as E-value, is the expected number of dis- specific substitution matrices, which has earlier demon- tinct local alignments with score values of at least x. strated significantly better sequence comparison accu- Note that the above distribution is for a gapless align- racy than using standard substitution matrices [11]. The ment. For the cases of gapped alignment, although no implementation is also extended to take advantage of asymptotic score distribution has yet been established dual-GPUs to accelerate those computations. As a analytically, computational experiments strongly indicate result, maximum performance could be obtained by har- these scores still roughly follow Gumbel law [8,13,30]. vesting the power of many-core GPUs. In addition to not needing a database to estimate the The performance evaluation was carried out on NVI- statistical significance of an alignment, pairwise statisti- DIA Telsa C2050 GPU. For multi-pair PSSE implemen- cal significance is shown to be more accurate than data- tation, we observe nearly 250 (370)× speedups using a base statistical significance reported by popular database single-GPU Tesla C2050 GPU (dual-Tesla C2050) over search programs like BLAST, PSI-BLAST, and the CPU implementation using an Intel Core ™i7 CPU SSEARCH [21]. However, it involves thousands of such 920 processor. The proposed optimizations and efficient permutations and alignments, which are enormously framework for PSSE, which have direct applicability to time consuming and can be impractical for estimating speedup homology detection, are also applicable to pairwise statistical significance of a large number of many other sequence comparison based applications, sequence pairs. Hence, use of high-performance com- such as DNA sequence mapping, phylogenetic tree con- puting (HPC) techniques (such as multi-cores CPU, struction and database search especially in the era of many-core graphics processing units (GPUs), FPGAs, next-generation sequencing. etc.) is highly conducive to accelerate the computation of PSSE. Moreover, large data sets demand more com- Methods puting power. In many demanding bioinformatics appli- In this section we present GPU implementations both cations, such as sequence alignment [31,32], and protein for single-pair PSSE and multi-pair PSSE. Along with sequence database search [33,34], many-core GPU has the methodological details, we also discuss several per- demonstrated its extreme computing power. This formance optimization techniques used in this paper, Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 3 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 such as, GPU memory access optimization, occupancy of s to s , in this layout the gap between the memory 2 1 maximization, and substitution matrix customization. accesses by the neighboring threads is at least the length These techniques have significantly speeded up PSSE. of the sequence. For example, in the intuitive layout, if Careful analysis of the data pipelines of PSSE shows thread T accesses the first residue (i.e., ‘R’), and thread that the computation of PSSE can be decomposed into T accesses the first residue (i.e., ‘E’), the gap between three computation kernels: Permutation, Alignment and the access data is n. This results in non-coalesced mem- Fitting. Permutation and Alignment comprise the over- ory reads (i.e., serialized reads), which significantly dete- whelming majority (a more than 99.8%) of the overall riorates the performance. execution time [35]. Therefore, efforts should be spent We therefore reorganize the layout of sequence data to optimize these two kernels to achieve high perfor- in memory to obtain coalesced reads. Now, to achieve mance. Also, we observe that permutation presents high coalesced access, we reorganize layout of sequences in degrees of data independency that are naturally suitable memory as aligned structure of arrays, as shown in Fig- for single-instruction, multiple-thread (SIMT) architec- ure 1 (b) In the optimized layout, the characters (in tures [38] and therefore, can be mapped very well to granularity of 4 bytes) that lie at the same index in dif- task parallelism models of GPU. Moreover, even though ferent permuted sequences stay at neighboring positions. the alignment task suffers from data dependency, we Then if the first uchar4 of the first permuted sequence show that with clever optimizations, it can be heavily (i.e. ‘REGN’) is requested by thread T ,the first uchar4 accelerated using GPUs. of the second permuted sequence (i.e. ‘ARNE’)is requested by T , and so on. This results in reading a Design consecutive memory (each thread reads 4 bytes) by a GPU memory access optimization warp of threads in a single access. Thus the global It is especially important to optimize global memory memory access is coalesced, and therefore high perfor- access as its bandwidth is low and its latency is hun- mance is achieved. dreds of clock cycles [38]. Moreover, global memory As the sequences remain unchanged during the align- coalescing is the most critical optimization for GPU ment, they can be thought of as read-only data, which programming [39]. Since the kernels of PSSE usually can be bound to texture memory. For read patterns, tex- work over large numbers of sequences that reside in the ture memory fetches is a better alternative to global global memory, the performance is highly dependent on memory reads because of texture memory cache, which hiding memory latency. When a GPU kernel is accessing can further improve the performance. global memory, all threads in groups of 32 (i.e. warp) Occupancy maximization access a bank of memory at one time. A batch of mem- Hiding global memory latency is very important to ory accesses is considered coalesced when the data achieve high performance on the GPU. This can be requested by a warp of threads are located in contiguous done by creating enough threads to keep the CUDA memory addresses. For example, if the data requested by cores always occupied while many other threads are threads within a warp are located in 32 consecutive waiting for global memory accesses [39]. GPU occu- th memory addresses (such that the i address is accessed pancy, as defined below, is a metric to determine how th by the i thread), the memory can be read in a single effectively the hardware is kept busy: access. Hence, this memory access operation runs 32 Occupancy =(B × T )/T (2) num max times faster. If the memory access is not coalesced, it is divided into multiple reads and hence serialized [37]. where T is maximum number of resident threads max After permutation, if the sequence s and its N per- that can be launched on a streaming multiprocessor muted copies were stored contiguously one after (SM) (which is a constant for a specific GPU), T is num another in the global memory, the intuitive memory lay- the number of active threads per block and B is the out would be as shown in Figure 1 (a) Note that, we number of active blocks per Streaming Multiprocessor need one byte (uchar) to store each amino acid residue. (SM). Moreover, GPU can read four-byte (packed as a CUDA B also depends upon the GPU physical limitations (e. built-in vector data type uchar4) of data from the global g. the amount of registers, shared memory and threads memory to registers in one instruction. To achieve high supported in each model). It can be given in the follow- parallelism of global memory access, uchar4 is used to ing way: store the permuted sequences. Dummy amino acid sym- bols are padded in the end to make the length of B = min(B , B , B , B ) (3) user reg shr hw sequences a multiple of 4. where B is the hardware limit (only 8 blocks are Considering inter-task parallelism, where each thread hw allowed per SM), and B , B , are the potential blocks works on the alignment of one of the permuted copies reg shr Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 4 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 Figure 1 Two GPU memory layout strategies. (a) The intuitive layout that one sequence is appended after another. (b) The reorganized layout such that the uchar4 at the same indices in different sequences stay at neighboring positions. determined by the available registers, shared memory, � To achieve 100% occupancy, (B × T )/T ≥ hw num max respectively and B is blocks set by the user. Note that 1. Hence, setting T such that T ≥ T /B is user num num max hw B ≤ B , therefore we obtain: preferable. hw Substitution matrix customization Occupancy ≤ (B × T )/T (4) hw num max The Smith-Waterman (SW) algorithm [40] looks up the substitution score matrix very frequently while comput- Based on the above analysis, higher occupancy can be ing the alignment scores. Lowering the number of look- pursued according to the following rules: ups obviously reduces the overall execution time. As suggested by [41], performance improvement could � To avoid wasting computation on under-populated furtherbeobtainedbycustomizing thematrix. Usually, warps, the number of threads per block should be the substitution score matrix is indexed by query- chosen as a multiple of the warp size (currently, 32). sequence symbol and the subject-sequence symbol as in � Total number of active threads per SM should be BLOSUM62. Another way is to index the substitution close to maximum number of resident threads per scorematrixbyquery-sequenceposition and the sub- multiprocessor (T ). In short, we should use all max ject-sequence symbol. We refer to this customized the available threads if possible. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 5 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 substitution matrix as local query profile. For example, L|. This increases the memory requirement compared to let us consider a query sequence Q of length L over a the original layout, but reduces the lookups of the sub- set of residue alphabet ω. For each residue, we store a stitution matrix significantly as explained below. substitution score for every query sequence position. For In traditional matrix layout, if the substitution scores example, in Figure 2 (a) the substitution scores for of a subject sequence residue (i.e., ‘A’)againstquery matching residue ‘A’ with each symbol of the query sequence residues (i.e., ‘A’,’K’,’L’,’G’) are required subse- sequence Q are stored in the first row, and the substitu- quently, GPU has to look up the matrix four times one tion scores for matching residue ‘C’ are stored in the by one, because the score S[A][A], S[A][K], S[A][L], S[A] [G] are usually stored far from each other. In contrast, if next row, and so on. Here the substitution score of the residue of subject sequence against the same symbol of we use CUDA built-in vector data type int4 (which query sequence at different position is always same. packs 4 integers together) to store the customized sub- By contrast, position-specific score matrix (PSSM) [4] stitution matrix in the texture memory, the same four offers a variation of this approach. We call this a posi- scores are stored at neighboring memory locations (as tion specific query profile in which scores are further shown in Figure 2 (b). By reading an int4,the GPUcan refined. In this case the same residue (e.g. ‘A’) appearing get these four scores S[A][i] in one instruction, where i at different position of query sequence has different is the position of residues in query sequence. This scores, as shown in Figure 2 (b). For a given query reduces the number of lookups by a factor of 4, there- sequence, PSSM can be pre-constructed by PSI-BLAST fore, a higher performance can be obtained. [3,4]. In essence, both coalesced memory layout and custo- Using the customized approach, in both cases, the mized substitution matrix optimization heighten perfor- dimension |ω|× |ω| of the substitution matrix is mance of memory access by improving data locality replaced by a query-specific matrix of dimension |ω|×| among threads in GPU programming. Figure 2 The customized substitution matrix. (a) Local query profile: the same symbol (e.g., ‘A’) at different position in query has same scores. (b) Position specific query profile: the same symbol (e.g., ‘A’) at different position in query has different scores. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 6 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 Single-pair PSSE implementation improving the performance byanalyzing thesingle-pair As mentioned previously, to simulate the required ran- PSSE implementation (which we do in a later section). dom sequences, a lot of random numbers are needed, Owing to a dramatic increase of data set for multiple- since each s copy has to be permuted thousands of pairs implementation, there are some significant differ- times. Although the most widely used pseudorandom ences relative to performance of GPU hardware between number generators such as linear congruential genera- the two implementations. Through analyzing our experi- tors (LCGs) can meet our requirements, the current ver- mental results of the single-pair PSSE, we observe that sion of CUDA does not support calls to the host inter-task parallelism performs better than intra-task random function. Hence, we develop an efficient ran- parallelism (results shown later). Hence, we consider dom number generator similar to lrand48() on GPUs. inter-task parallelism in computing multi-pair PSSE. The single-pair PSSE processes only one pair of query Based on the guidelines for optimizing memory and and subject sequences. The idea of computing single- occupancy described earlier, we compare three imple- pair PSSE is as follows. Given the query sequence s and mentation strategies. the subject sequence s ,tocompute PSSE,thousands Algorithm 1: Pseudo-code of single-pair PSSE (say N) of randomly permutations of s are needed. To Input:(s , s ) - Sequence-pair; M - Substitution 2 1 2 obtain these N random sequences of s ,first,aset of N matrix; G - Gap opening penalty; GE -Gap extension random numbers is generated in CPU as described in penalty; N - Number of permutes; previous section. These numbers are then transferred to Output: pss- Pairwise statistical significance GPU and considered as seeds by the threads of GPU. Each thread then generates new random numbers using 1. Initialization its own seed and swap the symbols of s accordingly to (a) Generate a number N of random numbers in obtain a permuted sequence. Thus the N random per- CPU; mutations of s are obtained in parallel. The algorithm (b) Copy LCG seeds to GPU global memory; then uses SW algorithm to compute alignment score of (c) Copy s and s to GPU global memory; 1 2 s and the N permuted copies of s in parallel on GPU. 2. Generate random numbers (RNs) using LCG in 1 2 The scores are then transferred to CPU for fitting. The GPU details of computing single-pair PSSE is outlined by 3. Permute sequence s N times using the RNs Algorithm1.NotethatStep4in Algorithm1usesthe 4. Reorganize sequences s and its N times permuted optimized memory layout, explained previously in detail, copies as shown in Figure 2 (b) if using inter-task for s and its N permuted copies. parallel Smith-Waterman(); Smith-Waterman is a dynamic programming algo- 5. Align s and its N permuted copies against s 2 1 rithm to identify the optimal local alignment between a using Smith-Waterman() (inter-task or intra-task); pair of sequences. In general, there are two different 6. Transfer N alignment scores from GPU into CPU; methods for parallelizing the alignment task [33]. The 7. Fitting in CPU first method is regarded as inter-task parallelism. In this (a) (K, l) ¬ EV DCensordMLFit(Scores); lx case, each thread performs alignment of one pair of (b) pss ¬ 1- exp(-Kmne - ); sequences. Hence, in a thread block, multiple alignment return pss tasks are performed in parallel [42]. The second one is Intuitive strategy intra-task parallelism. Here, alignment of each pair of Given Q query sequences and S subject sequences, the sequences is assigned to a block of threads, splitting the intuitivestrategyistosimplyperform thesame ‘single- whole task into a number of sub-tasks. Each thread in pair’ procedure Q × S times. In other words, in each the thread block then performs its own sub-tasks, coop- iteration, we send a single pair of query and subject erating to exploit the parallel characteristics of cells in sequences to the GPU. The GPU processes that pair the anti-diagonals of the local alignment matrix [43]. and returns the result to the CPU. The same procedure We use a similar alignment kernel as proposed in [34] is repeated for all query and subject sequence pairs. with some modifications for further improvement. It is However, this strategy suffers from low occupancy. We worthwhile to mention that Step 7 describes fitting, analyze the cause along with its performance results in which is implemented on the CPU. This is because it the next section. involves recursion that is not supported very well on Data reuse strategy GPU. In the first strategy, the subject sequences from the database are permuted for every query-subject sequence Multi-pair PSSE implementation pair. Hence, the same subject sequence is permuted The multi-pair PSSE processes multiple queries and every time it is sent to the GPU. A better strategy is to subject sequences. We can obtain some hints about create permutations of each subject sequence only once Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 7 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 and reuse them to align with all the queries. “One per- mutation, all queries” is the idea of the second strategy. Because of the reuse of permuted sequences, higher per- formance is expected than the first strategy. However, the occupancy of GPU, to be shown in the next section, is still not elevated. Adaptively tile-based strategy The low occupancy of the above two strategies is due to the underutilized computing power of GPU. In addition, these two strategies do not work well when the size of subject sequence database becomes too big to be fitted into GPU global memory. For instance, if the size of the original subject sequence database is 5 MB, it becomes 5000 MB when each of the sequences is permuted, say, 1000 times. This prohibits transfer of all the subject sequences to GPU at the same time. We therefore need an optimal number of subject sequences to be shipped Figure 3 Adaptive tile-based strategy. The optimal tile size T can to GPU keeping in mind that the subject sequences and be calculated according to the hardware configuration of GPU. After calculated, the T subject sequences together are transferred to their permuted copies fit in global memory. Moreover, GPU global memory. Permutations and alignments are done in the number of new generated sequences should be parallel in GPU. Then T × 1000 alignment scores are moved back to enough to keep all CUDA cores busy, i.e., keep a high CPU for T fittings. occupancy of GPU, which is very important to harness the GPU power. Herein we develop a memory tiling technique that is Results and discussion self-tuning based on the hardware configuration and can All our experiments have been are carried out using achieve a close-to-optimal performance. The idea behind © Intel Core™i7 CPU 920 processors running @2.67 the technique is as follows. In out-of-core fashion, the GHz. The system has 4 cores, 4 GB of memory, dual data in the main memory is divided into smaller chunks Tesla C2050 GPU (each with 448 CUDA cores) and is called tiles and transferred to the GPU global memory. In running a 64-bit Linux-based operating system. Our our case, the tiles are the number of subject sequences to program has been compiled using gcc 4.4.1 and CUDA be transferred to the GPU at a time. The tile size T can 4.0. The sequence data used in this work comprises of a be calculated using the following equation: non-redundant subset of the CATH 2.3 database [44]. This dataset consists of 2771 domain sequences as our SM × T num max (5) T = subject library and includes 86 CATH queries as our query set.WederivePSSMsfor the86testqueries where SM is the total number of SMs in GPU, num against the non-redundant protein database using PSI- T is the maximum number of resident threads per BLAST (provided along with the BLAST+ package [3]) max SM, and N is the number of permutations. over a maximum of five iterations and with other In Tesla C2050 used in our experiments, there are 14 default parameters. BLOSUM62 and PSSMs have been SMs and the maximum number of resident threads per used as the scoring matrices with affine gap penalty of SM is 1024. Let N = 1000, then, T = ⌊14 × 1024/1000⌋ 10 + 2k for a gap of length k. We permute the 2771 = 14, which means that there are 14 distinct subject subject sequences N = 1000 time as in several previous sequences to be transferred to the GPU’s global memory studies [21,35,36,45]. at a time. Based on the second strategy (i.e., data reuse), 14 subject sequences and their permuted copies are Single-pair PSSE result analysis aligned against one query sequence at a time, until all In single-pair PSSE implementation, we use both the the query sequences are processed. As a result, 14 × intra-task and inter-task parallelism methods. We 1000 alignment scores in total are obtained in each choose four pairs of query and subject sequences of round, which are subsequently transferred to CPU for length 200, 400, 800, and 1600 from CATH 2.3 data- fitting. The CPU takes the 1000 alignment scores for base. We compute PSSE for all these pairs using 64, each subject-query sequence pair and uses them to 128, 256, and 512 threads per block. The experimental compute the corresponding pss. The tile-based strategy results have been plotted in Figure 4. All speedups are has been described in Figure 3. computed over the corresponding the CPU Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 8 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 Figure 4 Intra-task and inter-task parallelism for single-pair PSSE. We choose four pairs of query and subject sequences of length 200, 400, 800, and 1600 from CATH database. implementation. We observe that the inter-task parallel last anti-diagonal. Only the cells of the alignment matrix implementation performs significantly better than the belonging to the same anti-diagonal can be computed in intra-task parallel implementation. Employing further parallel. In this case, most threads in a block have no optimizations and newer version of CUDA, both meth- work to do. As a result, the performance of the intra- ods show a higher speedup compared to our previous task parallel implementation is worse than that of the inter-task parallel implementation even though it has results [31]. For intra-task parallel implementation, the best speed- higher occupancy. ups for sequences of length 200, 400, and 800 are In contrast, inter-task parallelism would have a total of 24.57×, 35.09×, and 43.63×, respectively. All these are 1000 threads (one for each alignment task). If each obtained using 128 threads per block. But the best block contains 64 threads, then the total number of speedup for sequences of length 1600 is 46.34× and blocks is B = ⌈1000/64⌉ = 16. The assignment of 16 used 256 threads per block. These results tell us that it blocks to 14 available SMs will result in 12 SMs with is hard to find a general rule to parameterize the set- one active block and two SMs with two blocks. There- tings to achieve peak performance. A possible reason is fore,the occupancyfor thetwo SMs withtwo blocks is that many factors, such as the number of threads per (2 × 64)/1024 = 12.5%. For the 12 SMs with only one block, the available registers, and shared memory, may block its occupancy is (1 × 64)/1024 = 6.25%. Because contradict with each other. of the low occupancy, there are not sufficient threads to The intra-task parallel implementation creates enough keep CUDA cores busy when the global memory is thread blocks (one block for each alignment task) to accessed. As a result, the latency-hiding capabilities of keep the occupancy high. However, the SW alignment this method are limited. As the number of threads per matrix must be serially computed from the first to the block T increases, the total active blocks (B) num Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 9 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 decreases and some SMs even have no blocks assigned. pair PSSE implementation using the inter-task paralle- For example, if T =512,then B = ⌈1000/512⌉ =2. lism method. As expected, we observe poor performance num Hence only two SMs are working with one block each. for both strategies, as shown in Figure 5. The data reuse However, theseSMs haveahigheroccupancy(1× strategy produces higher performance than the intuitive 512)/1024 = 50%, which compensates for the decrease strategy, because the number of permutations is reduced in the number of working SMs. Consequently, even byafactorof Q, the number of query sequences. To though there is a reduction in speedups, it is not alleviate the low occupancy problem, our proposed tile- directly proportional to the reduction in the number of based strategy uses a carefully tuned tile size that effec- active SMs. When the number of threads per block is tively increase the occupancy. Recall that, as a result of 64,for thesequencelengthof200,400,800,and 1600, tiling, we send 14 subject sequences and one query the best speedups are 52.14×, 56.36×, 73.39×, and sequence to the GPU in each round. After the permuta- 73.16×, respectively. tion step, we have 14 × 1000 alignments to be per- In brief, getting performance out of a GPU is about formed, assuming each subject sequence is permuted keeping the CUDA cores busy. Both inter-task paralle- 1000 times. Consequently, each SM has 1000 alignments lism and intra-task parallelism, under the situations of to perform. Also, note that, the maximum number of small data set being processed, seem to fail in this blocks that can be launched simultaneously on an SM is regard, but not necessarily for the same reason. 8. Therefore, when the number of threads per block is 64, the number of blocks per SM is min(⌈1024/64⌉,8) = Multi-pair PSSE result analysis 8, resulting in an occupancy of (8 × 64)/1024 = 50%, In the intuitive and data reuse strategies, most of the based on Equation (2) and (3). This is a significant SMs suffer from the same low occupancy as the single- improvement over the intuitive and data-reuse Figure 5 Performance of three strategies for multi-pair PSSE. All experiments are run using 2771 subject sequences and 86 query sequences from the CATH 2.3 database. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 10 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 strategies. The maximum theoretical occupancy of an method and related programs in CUDA is available for SM for the three strategies is given in Table 1. free academic use at http://cucis.ece.northwestern.edu/ Due to a high occupancy, the tile-based strategy using projects/PSSE/. single (dual) GPU(s) achieves speedups of 240.31 (338.63)×, 243.51 (361.63)×, 240.71 (363.99)×, and Conclusions 243.84 (369.95)× using 64, 128, 256, and 512 threads In this paper, we present a high performance accelerator per block, respectively, as shown in Figure 5. to estimate the pairwise statistical significance of local The billion cell updates per second (GCUPS) value is sequence alignment, which supports standard substitu- another commonly used performance measure in tion matrix like BLOSUM62 as well as PSSMs. bioinformatics [33]. The tile-based strategy using single Our accelerator harvests the computation power of (dual) GPU(s) achieves performance results in the many-core GPUs by using CUDA, which results in high range of 16.55 (23.34) to 16.79 (25.50) GCUPS, as end-to-end speedups for PSSE. We also demonstrate a shown in Figure 5. Although a direct comparison comparative performance analysis of single-pair and across different GPU implementations and hardware in multi-pair implementations. The proposed optimizations not fair, just for the sake of completeness, we cited and efficient framework are applicable to a wide variety below the performance in GCUPS reported by some of next-generation sequencing comparison based appli- existing implementations of the Smith-Waterman cations, such as, DNA sequence mapping and database alignment task. It is worthwhile to mention here that search. As the size of biological sequence databases are [46] has a peak performance of 4.65 to 8.99 GCUPS increasing rapidly, even more powerful high perfor- for various query lengths on an NVIDIA 9800 and [42] mance computing accelerator platforms are expected to has a peak performance of 3.5 GCUPS on a worksta- be more and more common and imperative for tion running two GeForce 8800 GTX. Our implemen- sequence analysis, for which our work can serve as a tation show higher GCUPS. meaningful stepping stone. In summary, low occupancy is known to interfere with the ability to hide latency on memory-bound kernels, Abbreviations causing performance degradation. However, increasing BLAST: Basic Local Alignment Search Tool; BLOSUM: BLOcks of Amino Acid occupancy does not necessarily increase performance. In SUbstitution Matrix; CUDA: Compute Unified Device Architecture; EVD: Extreme Value Distribution; GCUPS: Billion Cell Updates Per Second; GPU: general, once a 50% occupancy is achieved, further opti- Graphics Processing Unit; PSSM: Position Specific Scoring Matrix; PSSE: mization to gain additional occupancy has little effect Pairwise Statistical Significance Estimation; PSI BLAST: Position Specific on performance [37]. Our experiments verify this claim. Iterative BLAST; PSA: Pairwise Sequence Alignment; PSS: Pairwise Statistical Significance; SIMT: Single: Instruction, Multiple: Thread; SSM: Standard Since the GPU implementation presented in this paper Substitution Matrix SM: Streaming Multiprocessor; SW: Smith-Waterman. uses the same algorithm for PSSE (specifically the Smith-Waterman algorithm for getting alignment scores, Acknowledgements This work is supported in part by NSF award numbers CCF-0621443, OCI- the same fitting routine to get statistical parameters K 0724599, CCF-0833131, CNS-0830927, IIS-0905205, OCI-0956311, CCF- and l, and the same algorithm parameters) as in [11], 0938000, CCF-1043085, CCF-1029166, and OCI-1144061, and in part by DOE the retrieval accuracy of the proposed implementation is grants DE-FC02-07ER25808, DE-FG02-08ER25848, DE-SC0001283, DE- SC0005309, and DE-SC0005340.This work is also supported in part by expected to be identical to [11]. According to [11], pair- National Nature Science Foundation of China (NO. 60973118, NO. 61133016), wise statistical significance with standard substitution Sichuan provincial science and technology agency (NO. matrices performs at least comparable or significantly M110106012010HH2027), Ministry of Education of China (NO. 708078), and China Scholarship Council. better than database statistical significance (using This article has been published as part of BMC Bioinformatics Volume 13 BLAST, PSI-BLAST, and SSEARCH). Moreover, pair- Supplement 5, 2012: Selected articles from the First IEEE International wise statistical significance with PSSMs performs signifi- Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are cantly better than using standard substitution matrices, available online at http://www.biomedcentral.com/bmcbioinformatics/ and also better than PSI-BLAST using pre-constructed supplements/13/S5. position-specific substitution matrices. More details can Author details be found in [11]. The implementation of the proposed School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China. Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, USA. Table 1 The maximum occupancy for three strategies Authors’ contributions Threads/blocks 64 128 256 512 YZ conceptualized the study, carried out the design and implementation of Intuitive occupancy 12.5% 6.25% 6.25% 6.25% the algorithm, analyzed the results and drafted the manuscript; SM, AA, MMAP, WL, ZQ, and AC participated analysis of the results and contributed Data-reuse occupancy 12.5% 6.25% 6.25% 6.25% to revise the manuscript. All authors read and approved the final Tile-based occupancy 50% 100% 100% 100% manuscript. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 11 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 22. Agrawal A, Choudhary A, Huang X: Derived distribution points heuristic Competing interests for fast pairwise statistical significance estimation. Proceedings of the First The authors declare that they have no competing interests. ACM International Conference on Bioinformatics and Computational Biology, ACM 2010, 312-321. Published: 12 April 2012 23. Agrawal A, Misra S, Honbo D, Choudhary AN: Parallel pairwise statistical significance estimation of local sequence alignment using Message References Passing Interface library. Concurrency and Computation: Practice and 1. Roos DS: COMPUTATIONAL BIOLOGY: Bioinformatics-trying to swim in a Experience 2011, 23(17):2269-2279. sea of data. Science 2001, 291:1260-1261. 24. Yu Y, Gertz E, Agarwala R, Schäffer A, Altschul S: Retrieval accuracy, 2. Yooseph S, Sutton G, Rusch D, Halpern A, Williamson S, Remington K, statistical significance and compositional similarity in protein sequence Eisen J, Heidelberg K, Manning G, Li W, et al: The sorcerer II global ocean database searches. Nucleic Acids Research 2006, 34(20):5966. sampling expedition: expanding the universe of protein families. PLoS 25. Zuyderduyn S: Statistical analysis and significance testing of serial Biology 2007, 5(3):e16. analysis of gene expression data using a Poisson mixture model. BMC 3. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos JS, Bealer K, bioinformatics 2007, 8:282. Madden TL: BLAST+: architecture and applications. BMC Bioinformatics 26. Aleksandar P: Island method for estimating the statistical significance of 2009, 10:421. profile-profile alignment scores. BMC Bioinformatics 2009, 10. 4. Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: 27. Karlin S, Altschul S: Methods for assessing the statistical significance of Gapped BLAST and PSI-BLAST: a new generation of protein database molecular sequence features by using general scoring schemes. search programs. Nucleic Acids Research 1997, 25(17):3389-3402. Proceedings of the National Academy of Sciences 1990, 87(6):2264. 5. Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, 28. O R, B R, H T: Rapid assessment of extremal statistics for gapped local Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein alignment. Proceedings of the Seventh International Conference on Intelligent database searches with composition-based statistics and other Systems for Molecular Biology 1999, 211-222. refinements. Nucleic Acids Research 2001, 29(14):2994-3005. 29. Eddy SR: Maximum likelihood fitting of extreme value distributions. 1997, 6. Yu Y, Altschul S: The construction of amino acid substitution matrices for [citeseer.ist.psu.edu/370503. html]. [Unpublished work]. the comparison of proteins with non-standard compositions. 30. Waterman M, Vingron M: Rapid and accurate estimates of statistical Bioinformatics 2005, 21(7):902-911. significance for sequence data base searches. Proceedings of the National 7. Pearson WR: Flexible sequence similarity searching with the FASTA3 Academy of Sciences 1994, 91(11):4625-4628. program package. Methods in molecular biology 2000, 132:185-219. 31. Zhang Y, Misra S, Honbo D, Agrawal A, keng Liao W, Choudhary A: Efficient 8. Mott R: Accurate formula for p-values of gapped local sequence and pairwise statistical significance estimation for local sequence alignment profile alignments. Journal of Molecular Biology 2000, 300:649-659. using GPU. IEEE 1st International Conference on Computational Advances in 9. Pearson W, Lipman D: Improved tools for biological sequence Bio and Medical Sciences (ICCABS) 2011, 226-231. comparison. Proceedings of the National Academy of Sciences 1988, 32. Samuel A, Malcolm M, Aaron G, Kevin G, Mahesh V, David R, Cesar A, 85(8):2444. James W, Owen W, Florian F: CloVR: A virtual machine for automated and 10. Pagni M, Jongeneel C: Making sense of score statistics for sequence portable sequence analysis from the desktop using cloud computing. alignments. Briefings in Bioinformatics 2001, 2:51-67. BMC Bioinformatics 2011, 12:1-15. 11. Agrawal A, Huang X: Pairwise statistical significance of local sequence 33. Liu Y, Maskell DL, Schmidt B: CUDASW++: optimizing Smith-Waterman alignment using sequence-specific and position-specific substitution sequence database searches for CUDA-enabled graphics processing matrices. IEEE/ACM Transactions on Computational Biology and units. BMC Research Notes 2009, 2:73. Bioinformatics 2011, 8:194-205. 34. Liu Y, Schmidt B, Maskell DL: CUDASW++2.0: enhanced Smith-Waterman 12. Mitrophanov A, Borodovsky M: Statistical significance in biological protein database search on CUDA-enabled GPUs based on SIMT and sequence analysis. Briefings in Bioinformatics 2006, 7:2-24. virtualized SIMD abstractions. BMC Research Notes 2010, 3:93. 13. Pearson W: Empirical statistical estimates for sequence similarity 35. Agrawal A, Misra S, Honbo D, Choudhary AN: MPIPairwiseStatSig: parallel searches. Journal of Molecular Biology 1998, 276:71-84. pairwise statistical significance estimation of local sequence alignment. 14. Altschul SF, Bundschuh R, Olsen R, Hwa T: The estimation of statistical Proceedings of the 19th ACM International Symposium on High Performance parameters for local alignment score distributions. Nucleic Acids Research Distributed Computing 2010, 470-476. 2001, 29(2):351-361. 36. Honbo D, Agrawal A, Choudhary AN: Efficient pairwise statistical 15. Agrawal A, Brendel V, Huang X: Pairwise statistical significance and significance estimation using FPGAs. Proceedings of BIOCOMP 2010, empirical determination of effective gap opening penalties for protein 2010:571-577. local sequence alignment. International Journal of Computational Biology 37. NVIDIA: NVIDIA CUDA C: Best Practices Guide 4.1 2011. and Drug Design 2008, 1(4):347-367. 38. NVIDIA: NVIDIA CUDA C: Programming Guide 4.1 2011. 16. Poleksic A, Danzer JF, Hambly K, Debe DA: Convergent island statistics: a 39. Ryoo S, Rodrigues C, Baghsorkhi S, Stone S, Kirk D, Hwu W: Optimization fast method for determining local alignment score significance. principles and application performance evaluation of a multithreaded Bioinformatics 2005, 21(12):2827-2831. GPU using CUDA. Proceedings of the 13th ACM SIGPLAN Symposium on 17. Agrawal A, Huang X: PSIBLAST PairwiseStatSig: reordering PSI-BLAST hits Principles and Practice of Parallel Programming, ACM 2008, 73-82. using pairwise statistical significance. Bioinformatics 2009, 25(8):1082-1083. 40. Smith T, Waterman M: Identification of common molecular 18. Agrawal A, Huang X: Pairwise statistical significance of local sequence subsequences. Journal of Molecular Biology 1981, 147:195-197. alignment using multiple parameter sets and empirical justification of 41. Rognes T, Seeberg E: Six-fold speed-up of Smith-Waterman sequence parameter set change penalty. BMC bioinformatics 2009, 10(Suppl 3):S1. database searches using parallel processing on common 19. Sierk ML, Smoot ME, Bass EJ, Pearson WR: Improving pairwise sequence microprocessors. Bioinformatics 2000, 16(8):699-706. alignment accuracy using near-optimal protein sequence alignments. 42. Manavski S, Valle G: CUDA compatible GPU cards as efficient hardware BMC Bioinformatics 2010, 11:146. accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics 20. Agrawal A, Choudhary A, Huang X: Sequence-specific sequence 2008, 9(Suppl 2):S10. comparison using pairwise statistical significance. Software Tools and 43. Liu W, Schmidt B, Voss G, Müller-Wittig W: Streaming algorithms for Algorithms for Biological Systems, Volume 696 of Advances in Experimental biological sequence alignment on GPUs. IEEE Transactions on Parallel and Medicine and Biology Springer New York; 2011, 297-306. Distributed Systems 2007, 18(9):1270-1281. 21. Agrawal A, Brendel V, Huang X: Pairwise statistical significance versus 44. Sierk ML, Pearson WR: Sensitivity and selectivity in protein structure database statistical significance for local alignment of protein comparison. Protein Science 2004, 13(3):773-785. sequences. Bioinformatics Research and Applications 2008, 50-61. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 12 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 45. Zhang Y, Patwary M, Misra S, Agrawal A, Liao W, Choudhary AN: Enhancing parallelism of pairwise statistical significance estimation for local sequence alignment. Workshop on Hybrid Multi-core Computing 2011, 1-8. 46. Ligowski L, Rudnicki W: An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases. IEEE International Symposium on Parallel & Distributed Processing 2009, 1-8. doi:10.1186/1471-2105-13-S5-S3 Cite this article as: Zhang et al.: Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU’s power. BMC Bioinformatics 2012 13(Suppl 5):S3. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color ﬁgure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals http://www.deepdyve.com/lp/springer-journals/accelerating-pairwise-statistical-significance-estimation-for-local-yh1TWN5eVN

Loading next page...

References (56)

S Karlin, S Altschul (1990)
Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes
Proceedings of the National Academy of Sciences, 87
L Ligowski, W Rudnicki (2009)
IEEE International Symposium on Parallel & Distributed Processing
A Samuel, M Malcolm, G Aaron, G Kevin, V Mahesh, R David, A Cesar, W James, W Owen, F Florian (2011)
CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing
BMC Bioinformatics, 12
(2011)
NVIDIA CUDA C: Best Practices Guide 4.1
T Rognes, E Seeberg (2000)
Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors
Bioinformatics, 16
Y Zhang, M Patwary, S Misra, A Agrawal, W Liao, AN Choudhary (2011)
Workshop on Hybrid Multi-core Computing
A Agrawal, A Choudhary, X Huang (2011)
Software Tools and Algorithms for Biological Systems, Volume 696 of Advances in Experimental Medicine and Biology
(2010)
MPIPairwiseStatSig: parallel pairwise statistical significance estimation of local sequence alignment
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
AA Schäffer, L Aravind, TL Madden, S Shavirin, JL Spouge, YI Wolf, EV Koonin, SF Altschul (2001)
Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements
Nucleic Acids Research, 29
W Liu, B Schmidt, G Voss, W Müller-Wittig (2007)
Streaming algorithms for biological sequence alignment on GPUs
IEEE Transactions on Parallel and Distributed Systems, 18
Y Liu, B Schmidt, DL Maskell (2010)
CUDASW++2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions
BMC Research Notes, 3
DS Roos (2001)
COMPUTATIONAL BIOLOGY: Bioinformatics-trying to swim in a sea of data
Science, 291
A Mitrophanov, M Borodovsky (2006)
Statistical significance in biological sequence analysis
Briefings in Bioinformatics, 7
Y Yu, S Altschul (2005)
The construction of amino acid substitution matrices for the comparison of proteins with non-standard compositions
Bioinformatics, 21
M Waterman, M Vingron (1994)
Rapid and accurate estimates of statistical significance for sequence data base searches
Proceedings of the National Academy of Sciences, 91
(2010)
Derived distribution points heuristic for fast pairwise statistical significance estimation
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, ACM
ML Sierk, WR Pearson (2004)
Sensitivity and selectivity in protein structure comparison
Protein Science, 13
W Pearson (1998)
Empirical statistical estimates for sequence similarity searches
Journal of Molecular Biology, 276
S Ryoo, C Rodrigues, S Baghsorkhi, S Stone, D Kirk, W Hwu (2008)
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM
P Aleksandar (2009)
BMC Bioinformatics
A Agrawal, S Misra, D Honbo, AN Choudhary (2011)
Parallel pairwise statistical significance estimation of local sequence alignment using Message Passing Interface library
Concurrency and Computation: Practice and Experience, 23
S Zuyderduyn (2007)
Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model
BMC bioinformatics, 8
A Agrawal, V Brendel, X Huang (2008)
Bioinformatics Research and Applications
WR Pearson (2000)
Flexible sequence similarity searching with the FASTA3 program package
Methods in molecular biology, 132
ML Sierk, ME Smoot, EJ Bass, WR Pearson (2010)
Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments
BMC Bioinformatics, 11
R Mott (2000)
Accurate formula for p-values of gapped local sequence and profile alignments
Journal of Molecular Biology, 300
W Pearson, D Lipman (1988)
Improved tools for biological sequence comparison
Proceedings of the National Academy of Sciences, 85
SF Altschul, R Bundschuh, R Olsen, T Hwa (2001)
The estimation of statistical parameters for local alignment score distributions
Nucleic Acids Research, 29
(2008)
Pairwise statistical significance versus database statistical significance for local alignment of protein sequences
Bioinformatics Research and Applications
A Agrawal, X Huang (2009)
Pairwise statistical significance of local sequence alignment using multiple parameter sets and empirical justification of parameter set change penalty
BMC bioinformatics, 10
A Agrawal, S Misra, D Honbo, AN Choudhary (2010)
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
S Yooseph, G Sutton, D Rusch, A Halpern, S Williamson, K Remington, J Eisen, K Heidelberg, G Manning, W Li (2007)
The sorcerer II global ocean sampling expedition: expanding the universe of protein families
PLoS Biology, 5
A Poleksic, JF Danzer, K Hambly, DA Debe (2005)
Convergent island statistics: a fast method for determining local alignment score significance
Bioinformatics, 21
(1999)
Rapid assessment of extremal statistics for gapped local alignment
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology
D Honbo, A Agrawal, AN Choudhary (2010)
Efficient pairwise statistical significance estimation using FPGAs
Proceedings of BIOCOMP, 2010
(2008)
Optimization principles and application performance evaluation of a multithreaded GPU using CUDA
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM
(2011)
Efficient pairwise statistical significance estimation for local sequence alignment using GPU
IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS)
SR Eddy (1997)
Maximum likelihood fitting of extreme value distributions
(2009)
An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases
IEEE International Symposium on Parallel & Distributed Processing
A Agrawal, A Choudhary, X Huang (2010)
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, ACM
Y Yu, E Gertz, R Agarwala, A Schäffer, S Altschul (2006)
Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches
Nucleic Acids Research, 34
(2011)
Enhancing parallelism of pairwise statistical significance estimation for local sequence alignment
Workshop on Hybrid Multi-core Computing
C Camacho, G Coulouris, V Avagyan, N Ma, JS Papadopoulos, K Bealer, TL Madden (2009)
BLAST+: architecture and applications
BMC Bioinformatics, 10
S Altschul, T Madden, A Schäffer, J Zhang, Z Zhang, W Miller, D Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 25
M Pagni, C Jongeneel (2001)
Making sense of score statistics for sequence alignments
Briefings in Bioinformatics, 2
(2011)
Sequence-specific sequence comparison using pairwise statistical significance
Software Tools and Algorithms for Biological Systems, Volume 696 of Advances in Experimental Medicine and Biology
A Agrawal, V Brendel, X Huang (2008)
Pairwise statistical significance and empirical determination of effective gap opening penalties for protein local sequence alignment
International Journal of Computational Biology and Drug Design, 1
A Agrawal, X Huang (2009)
PSIBLAST PairwiseStatSig: reordering PSI-BLAST hits using pairwise statistical significance
Bioinformatics, 25
Y Liu, DL Maskell, B Schmidt (2009)
CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units
BMC Research Notes, 2
Y Zhang, S Misra, D Honbo, A Agrawal, W keng Liao, A Choudhary (2011)
IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS)
(2011)
NVIDIA CUDA C: Programming Guide 4.1
T Smith, M Waterman (1981)
Identification of common molecular subsequences
Journal of Molecular Biology, 147
S Manavski, G Valle (2008)
CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment
BMC Bioinformatics, 9
(2009)
Island method for estimating the statistical significance of profile-profile alignment scores
BMC Bioinformatics, 10
A Agrawal, X Huang (2011)
Pairwise statistical significance of local sequence alignment using sequence-specific and position-specific substitution matrices
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 8
R O, R B, T H (1999)
Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology

Publisher: Springer Journals
Copyright: Copyright © 2012 by Zhang et al.; licensee BioMed Central Ltd.
Subject: Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN: 1471-2105
DOI: 10.1186/1471-2105-13-S5-S3
pmid: 22537007
Publisher site: See Article on Publisher Site

Abstract

Background: Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability. Results: We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm’s data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel Core™i7 CPU 920 processor. Conclusions: Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment. Background applications have been developed based on pairwise Introduction sequence alignment, such as BLAST [3], PSI-BLAST The past decades have witnessed dramatically increasing [4-6], and FASTA [7]. PSA produces a score for an trends in the quantity and variety of publicly available alignment as a measure of the similarity between two genomic and proteomic sequence data. Dealing with the sequences. Generally, the higher the score, the more massivedataand making senseofthemarebigchal- related the sequences. However, the alignment score lenges in bioinformatics [1,2]. One of the most widely depends on various factors such as alignment methods, scoring schemes, sequence lengths, and sequence com- used procedures for extracting information from proteo- mic and genomic datais pairwisesequencealignment positions [8,9]. Judging the relationship between two (PSA). Given two sequences, PSA finds the extent of sequences solely based on the scores can often lead to similarity between them. Many bioinformatics wrong conclusion. Therefore, it is more appropriate to measure the quality of PSA using the statistical signifi- cance of the score rather than the score itself [10,11]. * Correspondence: [email protected] School of Computer Science and Engineering, University of Electronic Statistical significance of sequence alignment scores is Science and Technology of China, Chengdu, China very important to know whether an observed sequence Full list of author information is available at the end of the article © 2012 Zhang et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 2 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 similarity could imply a functional or evolutionary link, strongly motivates the use of GPUs to accelerate the or is achanceevent [8,12].Accurateestimationofsta- PSSE. Acceleration of PSSE using MPI [35] and FPGA tistical significance of gapped sequence alignment has [36] has been explored earlier, and in this work, we attracted a lot of research in recent years [13-26]. design a GPU implementation for the same. Compared to [35,36], we consider the estimation of multi-pair PSS Pairwise statistical significance along with single-pair PSS using GPUs. Consider a pair of sequences s and s of lengths m and 1 2 n, respectively, the scoring scheme, SC (substitution Contributions matrix, gap opening penalty, gap extension penalty), and We present a GPU implementation to accelerate pair- the number of permutations N of s , pairwise statistical wise statistical significance estimation of local sequence significance (PSS) of the two sequences is calculated by alignment using standard substitution matrices. Our the following function [21], which is described below: parallel implementation makes use of CUDA (Compute Unified Device Architecture) parallel programming para- PSS(s , s , m, n, SC, N), 1 2 digm. Our design uses an efficient data reorganization method to produce coalesced global memory access, Through permuting s N times randomly, the function and a tiled-based memory scheme to increase the GPU generates N scores by aligning s1 against each of the N occupancy, a key measure for GPU efficiency [37]. permuted sequences and then fits these scores to an Through careful analysis of the data dependency and extreme value distribution (EVD) [8,27,28] using cen- access patterns of PSSE, we reorganize the data sored maximum likelihood [29]. The returned value is sequences into aligned arrays that coalesce the global the pairwise statistical significance of s and s .The 1 2 memory access pattern. Such data access contiguity EVD describes an approximate distribution of optimal keeps the GPU cores occupied with computation and scores of a gapless alignment [4]. Based on this distribu- allows the thread scheduler to overlap the global mem- tion, the probability (i.e., P-value) of observing an ory access with the computation. In addition to the abil- sequence pair with a score S greater than x,can be ity to calculate the optimal tile size for data to be given by: shipped to the GPU, our design can also issue a large −λx −E(x) enough number of threads to maximize the occupancy. (1) P(S > x) ≈ 1 − exp(−Kmne )= 1 − e We further extend the parallelization technique to where l and K are calculational constants and E(x), estimate pairwise statistical significance using position- also known as E-value, is the expected number of dis- specific substitution matrices, which has earlier demon- tinct local alignments with score values of at least x. strated significantly better sequence comparison accu- Note that the above distribution is for a gapless align- racy than using standard substitution matrices [11]. The ment. For the cases of gapped alignment, although no implementation is also extended to take advantage of asymptotic score distribution has yet been established dual-GPUs to accelerate those computations. As a analytically, computational experiments strongly indicate result, maximum performance could be obtained by har- these scores still roughly follow Gumbel law [8,13,30]. vesting the power of many-core GPUs. In addition to not needing a database to estimate the The performance evaluation was carried out on NVI- statistical significance of an alignment, pairwise statisti- DIA Telsa C2050 GPU. For multi-pair PSSE implemen- cal significance is shown to be more accurate than data- tation, we observe nearly 250 (370)× speedups using a base statistical significance reported by popular database single-GPU Tesla C2050 GPU (dual-Tesla C2050) over search programs like BLAST, PSI-BLAST, and the CPU implementation using an Intel Core ™i7 CPU SSEARCH [21]. However, it involves thousands of such 920 processor. The proposed optimizations and efficient permutations and alignments, which are enormously framework for PSSE, which have direct applicability to time consuming and can be impractical for estimating speedup homology detection, are also applicable to pairwise statistical significance of a large number of many other sequence comparison based applications, sequence pairs. Hence, use of high-performance com- such as DNA sequence mapping, phylogenetic tree con- puting (HPC) techniques (such as multi-cores CPU, struction and database search especially in the era of many-core graphics processing units (GPUs), FPGAs, next-generation sequencing. etc.) is highly conducive to accelerate the computation of PSSE. Moreover, large data sets demand more com- Methods puting power. In many demanding bioinformatics appli- In this section we present GPU implementations both cations, such as sequence alignment [31,32], and protein for single-pair PSSE and multi-pair PSSE. Along with sequence database search [33,34], many-core GPU has the methodological details, we also discuss several per- demonstrated its extreme computing power. This formance optimization techniques used in this paper, Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 3 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 such as, GPU memory access optimization, occupancy of s to s , in this layout the gap between the memory 2 1 maximization, and substitution matrix customization. accesses by the neighboring threads is at least the length These techniques have significantly speeded up PSSE. of the sequence. For example, in the intuitive layout, if Careful analysis of the data pipelines of PSSE shows thread T accesses the first residue (i.e., ‘R’), and thread that the computation of PSSE can be decomposed into T accesses the first residue (i.e., ‘E’), the gap between three computation kernels: Permutation, Alignment and the access data is n. This results in non-coalesced mem- Fitting. Permutation and Alignment comprise the over- ory reads (i.e., serialized reads), which significantly dete- whelming majority (a more than 99.8%) of the overall riorates the performance. execution time [35]. Therefore, efforts should be spent We therefore reorganize the layout of sequence data to optimize these two kernels to achieve high perfor- in memory to obtain coalesced reads. Now, to achieve mance. Also, we observe that permutation presents high coalesced access, we reorganize layout of sequences in degrees of data independency that are naturally suitable memory as aligned structure of arrays, as shown in Fig- for single-instruction, multiple-thread (SIMT) architec- ure 1 (b) In the optimized layout, the characters (in tures [38] and therefore, can be mapped very well to granularity of 4 bytes) that lie at the same index in dif- task parallelism models of GPU. Moreover, even though ferent permuted sequences stay at neighboring positions. the alignment task suffers from data dependency, we Then if the first uchar4 of the first permuted sequence show that with clever optimizations, it can be heavily (i.e. ‘REGN’) is requested by thread T ,the first uchar4 accelerated using GPUs. of the second permuted sequence (i.e. ‘ARNE’)is requested by T , and so on. This results in reading a Design consecutive memory (each thread reads 4 bytes) by a GPU memory access optimization warp of threads in a single access. Thus the global It is especially important to optimize global memory memory access is coalesced, and therefore high perfor- access as its bandwidth is low and its latency is hun- mance is achieved. dreds of clock cycles [38]. Moreover, global memory As the sequences remain unchanged during the align- coalescing is the most critical optimization for GPU ment, they can be thought of as read-only data, which programming [39]. Since the kernels of PSSE usually can be bound to texture memory. For read patterns, tex- work over large numbers of sequences that reside in the ture memory fetches is a better alternative to global global memory, the performance is highly dependent on memory reads because of texture memory cache, which hiding memory latency. When a GPU kernel is accessing can further improve the performance. global memory, all threads in groups of 32 (i.e. warp) Occupancy maximization access a bank of memory at one time. A batch of mem- Hiding global memory latency is very important to ory accesses is considered coalesced when the data achieve high performance on the GPU. This can be requested by a warp of threads are located in contiguous done by creating enough threads to keep the CUDA memory addresses. For example, if the data requested by cores always occupied while many other threads are threads within a warp are located in 32 consecutive waiting for global memory accesses [39]. GPU occu- th memory addresses (such that the i address is accessed pancy, as defined below, is a metric to determine how th by the i thread), the memory can be read in a single effectively the hardware is kept busy: access. Hence, this memory access operation runs 32 Occupancy =(B × T )/T (2) num max times faster. If the memory access is not coalesced, it is divided into multiple reads and hence serialized [37]. where T is maximum number of resident threads max After permutation, if the sequence s and its N per- that can be launched on a streaming multiprocessor muted copies were stored contiguously one after (SM) (which is a constant for a specific GPU), T is num another in the global memory, the intuitive memory lay- the number of active threads per block and B is the out would be as shown in Figure 1 (a) Note that, we number of active blocks per Streaming Multiprocessor need one byte (uchar) to store each amino acid residue. (SM). Moreover, GPU can read four-byte (packed as a CUDA B also depends upon the GPU physical limitations (e. built-in vector data type uchar4) of data from the global g. the amount of registers, shared memory and threads memory to registers in one instruction. To achieve high supported in each model). It can be given in the follow- parallelism of global memory access, uchar4 is used to ing way: store the permuted sequences. Dummy amino acid sym- bols are padded in the end to make the length of B = min(B , B , B , B ) (3) user reg shr hw sequences a multiple of 4. where B is the hardware limit (only 8 blocks are Considering inter-task parallelism, where each thread hw allowed per SM), and B , B , are the potential blocks works on the alignment of one of the permuted copies reg shr Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 4 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 Figure 1 Two GPU memory layout strategies. (a) The intuitive layout that one sequence is appended after another. (b) The reorganized layout such that the uchar4 at the same indices in different sequences stay at neighboring positions. determined by the available registers, shared memory, � To achieve 100% occupancy, (B × T )/T ≥ hw num max respectively and B is blocks set by the user. Note that 1. Hence, setting T such that T ≥ T /B is user num num max hw B ≤ B , therefore we obtain: preferable. hw Substitution matrix customization Occupancy ≤ (B × T )/T (4) hw num max The Smith-Waterman (SW) algorithm [40] looks up the substitution score matrix very frequently while comput- Based on the above analysis, higher occupancy can be ing the alignment scores. Lowering the number of look- pursued according to the following rules: ups obviously reduces the overall execution time. As suggested by [41], performance improvement could � To avoid wasting computation on under-populated furtherbeobtainedbycustomizing thematrix. Usually, warps, the number of threads per block should be the substitution score matrix is indexed by query- chosen as a multiple of the warp size (currently, 32). sequence symbol and the subject-sequence symbol as in � Total number of active threads per SM should be BLOSUM62. Another way is to index the substitution close to maximum number of resident threads per scorematrixbyquery-sequenceposition and the sub- multiprocessor (T ). In short, we should use all max ject-sequence symbol. We refer to this customized the available threads if possible. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 5 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 substitution matrix as local query profile. For example, L|. This increases the memory requirement compared to let us consider a query sequence Q of length L over a the original layout, but reduces the lookups of the sub- set of residue alphabet ω. For each residue, we store a stitution matrix significantly as explained below. substitution score for every query sequence position. For In traditional matrix layout, if the substitution scores example, in Figure 2 (a) the substitution scores for of a subject sequence residue (i.e., ‘A’)againstquery matching residue ‘A’ with each symbol of the query sequence residues (i.e., ‘A’,’K’,’L’,’G’) are required subse- sequence Q are stored in the first row, and the substitu- quently, GPU has to look up the matrix four times one tion scores for matching residue ‘C’ are stored in the by one, because the score S[A][A], S[A][K], S[A][L], S[A] [G] are usually stored far from each other. In contrast, if next row, and so on. Here the substitution score of the residue of subject sequence against the same symbol of we use CUDA built-in vector data type int4 (which query sequence at different position is always same. packs 4 integers together) to store the customized sub- By contrast, position-specific score matrix (PSSM) [4] stitution matrix in the texture memory, the same four offers a variation of this approach. We call this a posi- scores are stored at neighboring memory locations (as tion specific query profile in which scores are further shown in Figure 2 (b). By reading an int4,the GPUcan refined. In this case the same residue (e.g. ‘A’) appearing get these four scores S[A][i] in one instruction, where i at different position of query sequence has different is the position of residues in query sequence. This scores, as shown in Figure 2 (b). For a given query reduces the number of lookups by a factor of 4, there- sequence, PSSM can be pre-constructed by PSI-BLAST fore, a higher performance can be obtained. [3,4]. In essence, both coalesced memory layout and custo- Using the customized approach, in both cases, the mized substitution matrix optimization heighten perfor- dimension |ω|× |ω| of the substitution matrix is mance of memory access by improving data locality replaced by a query-specific matrix of dimension |ω|×| among threads in GPU programming. Figure 2 The customized substitution matrix. (a) Local query profile: the same symbol (e.g., ‘A’) at different position in query has same scores. (b) Position specific query profile: the same symbol (e.g., ‘A’) at different position in query has different scores. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 6 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 Single-pair PSSE implementation improving the performance byanalyzing thesingle-pair As mentioned previously, to simulate the required ran- PSSE implementation (which we do in a later section). dom sequences, a lot of random numbers are needed, Owing to a dramatic increase of data set for multiple- since each s copy has to be permuted thousands of pairs implementation, there are some significant differ- times. Although the most widely used pseudorandom ences relative to performance of GPU hardware between number generators such as linear congruential genera- the two implementations. Through analyzing our experi- tors (LCGs) can meet our requirements, the current ver- mental results of the single-pair PSSE, we observe that sion of CUDA does not support calls to the host inter-task parallelism performs better than intra-task random function. Hence, we develop an efficient ran- parallelism (results shown later). Hence, we consider dom number generator similar to lrand48() on GPUs. inter-task parallelism in computing multi-pair PSSE. The single-pair PSSE processes only one pair of query Based on the guidelines for optimizing memory and and subject sequences. The idea of computing single- occupancy described earlier, we compare three imple- pair PSSE is as follows. Given the query sequence s and mentation strategies. the subject sequence s ,tocompute PSSE,thousands Algorithm 1: Pseudo-code of single-pair PSSE (say N) of randomly permutations of s are needed. To Input:(s , s ) - Sequence-pair; M - Substitution 2 1 2 obtain these N random sequences of s ,first,aset of N matrix; G - Gap opening penalty; GE -Gap extension random numbers is generated in CPU as described in penalty; N - Number of permutes; previous section. These numbers are then transferred to Output: pss- Pairwise statistical significance GPU and considered as seeds by the threads of GPU. Each thread then generates new random numbers using 1. Initialization its own seed and swap the symbols of s accordingly to (a) Generate a number N of random numbers in obtain a permuted sequence. Thus the N random per- CPU; mutations of s are obtained in parallel. The algorithm (b) Copy LCG seeds to GPU global memory; then uses SW algorithm to compute alignment score of (c) Copy s and s to GPU global memory; 1 2 s and the N permuted copies of s in parallel on GPU. 2. Generate random numbers (RNs) using LCG in 1 2 The scores are then transferred to CPU for fitting. The GPU details of computing single-pair PSSE is outlined by 3. Permute sequence s N times using the RNs Algorithm1.NotethatStep4in Algorithm1usesthe 4. Reorganize sequences s and its N times permuted optimized memory layout, explained previously in detail, copies as shown in Figure 2 (b) if using inter-task for s and its N permuted copies. parallel Smith-Waterman(); Smith-Waterman is a dynamic programming algo- 5. Align s and its N permuted copies against s 2 1 rithm to identify the optimal local alignment between a using Smith-Waterman() (inter-task or intra-task); pair of sequences. In general, there are two different 6. Transfer N alignment scores from GPU into CPU; methods for parallelizing the alignment task [33]. The 7. Fitting in CPU first method is regarded as inter-task parallelism. In this (a) (K, l) ¬ EV DCensordMLFit(Scores); lx case, each thread performs alignment of one pair of (b) pss ¬ 1- exp(-Kmne - ); sequences. Hence, in a thread block, multiple alignment return pss tasks are performed in parallel [42]. The second one is Intuitive strategy intra-task parallelism. Here, alignment of each pair of Given Q query sequences and S subject sequences, the sequences is assigned to a block of threads, splitting the intuitivestrategyistosimplyperform thesame ‘single- whole task into a number of sub-tasks. Each thread in pair’ procedure Q × S times. In other words, in each the thread block then performs its own sub-tasks, coop- iteration, we send a single pair of query and subject erating to exploit the parallel characteristics of cells in sequences to the GPU. The GPU processes that pair the anti-diagonals of the local alignment matrix [43]. and returns the result to the CPU. The same procedure We use a similar alignment kernel as proposed in [34] is repeated for all query and subject sequence pairs. with some modifications for further improvement. It is However, this strategy suffers from low occupancy. We worthwhile to mention that Step 7 describes fitting, analyze the cause along with its performance results in which is implemented on the CPU. This is because it the next section. involves recursion that is not supported very well on Data reuse strategy GPU. In the first strategy, the subject sequences from the database are permuted for every query-subject sequence Multi-pair PSSE implementation pair. Hence, the same subject sequence is permuted The multi-pair PSSE processes multiple queries and every time it is sent to the GPU. A better strategy is to subject sequences. We can obtain some hints about create permutations of each subject sequence only once Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 7 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 and reuse them to align with all the queries. “One per- mutation, all queries” is the idea of the second strategy. Because of the reuse of permuted sequences, higher per- formance is expected than the first strategy. However, the occupancy of GPU, to be shown in the next section, is still not elevated. Adaptively tile-based strategy The low occupancy of the above two strategies is due to the underutilized computing power of GPU. In addition, these two strategies do not work well when the size of subject sequence database becomes too big to be fitted into GPU global memory. For instance, if the size of the original subject sequence database is 5 MB, it becomes 5000 MB when each of the sequences is permuted, say, 1000 times. This prohibits transfer of all the subject sequences to GPU at the same time. We therefore need an optimal number of subject sequences to be shipped Figure 3 Adaptive tile-based strategy. The optimal tile size T can to GPU keeping in mind that the subject sequences and be calculated according to the hardware configuration of GPU. After calculated, the T subject sequences together are transferred to their permuted copies fit in global memory. Moreover, GPU global memory. Permutations and alignments are done in the number of new generated sequences should be parallel in GPU. Then T × 1000 alignment scores are moved back to enough to keep all CUDA cores busy, i.e., keep a high CPU for T fittings. occupancy of GPU, which is very important to harness the GPU power. Herein we develop a memory tiling technique that is Results and discussion self-tuning based on the hardware configuration and can All our experiments have been are carried out using achieve a close-to-optimal performance. The idea behind © Intel Core™i7 CPU 920 processors running @2.67 the technique is as follows. In out-of-core fashion, the GHz. The system has 4 cores, 4 GB of memory, dual data in the main memory is divided into smaller chunks Tesla C2050 GPU (each with 448 CUDA cores) and is called tiles and transferred to the GPU global memory. In running a 64-bit Linux-based operating system. Our our case, the tiles are the number of subject sequences to program has been compiled using gcc 4.4.1 and CUDA be transferred to the GPU at a time. The tile size T can 4.0. The sequence data used in this work comprises of a be calculated using the following equation: non-redundant subset of the CATH 2.3 database [44]. This dataset consists of 2771 domain sequences as our SM × T num max (5) T = subject library and includes 86 CATH queries as our query set.WederivePSSMsfor the86testqueries where SM is the total number of SMs in GPU, num against the non-redundant protein database using PSI- T is the maximum number of resident threads per BLAST (provided along with the BLAST+ package [3]) max SM, and N is the number of permutations. over a maximum of five iterations and with other In Tesla C2050 used in our experiments, there are 14 default parameters. BLOSUM62 and PSSMs have been SMs and the maximum number of resident threads per used as the scoring matrices with affine gap penalty of SM is 1024. Let N = 1000, then, T = ⌊14 × 1024/1000⌋ 10 + 2k for a gap of length k. We permute the 2771 = 14, which means that there are 14 distinct subject subject sequences N = 1000 time as in several previous sequences to be transferred to the GPU’s global memory studies [21,35,36,45]. at a time. Based on the second strategy (i.e., data reuse), 14 subject sequences and their permuted copies are Single-pair PSSE result analysis aligned against one query sequence at a time, until all In single-pair PSSE implementation, we use both the the query sequences are processed. As a result, 14 × intra-task and inter-task parallelism methods. We 1000 alignment scores in total are obtained in each choose four pairs of query and subject sequences of round, which are subsequently transferred to CPU for length 200, 400, 800, and 1600 from CATH 2.3 data- fitting. The CPU takes the 1000 alignment scores for base. We compute PSSE for all these pairs using 64, each subject-query sequence pair and uses them to 128, 256, and 512 threads per block. The experimental compute the corresponding pss. The tile-based strategy results have been plotted in Figure 4. All speedups are has been described in Figure 3. computed over the corresponding the CPU Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 8 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 Figure 4 Intra-task and inter-task parallelism for single-pair PSSE. We choose four pairs of query and subject sequences of length 200, 400, 800, and 1600 from CATH database. implementation. We observe that the inter-task parallel last anti-diagonal. Only the cells of the alignment matrix implementation performs significantly better than the belonging to the same anti-diagonal can be computed in intra-task parallel implementation. Employing further parallel. In this case, most threads in a block have no optimizations and newer version of CUDA, both meth- work to do. As a result, the performance of the intra- ods show a higher speedup compared to our previous task parallel implementation is worse than that of the inter-task parallel implementation even though it has results [31]. For intra-task parallel implementation, the best speed- higher occupancy. ups for sequences of length 200, 400, and 800 are In contrast, inter-task parallelism would have a total of 24.57×, 35.09×, and 43.63×, respectively. All these are 1000 threads (one for each alignment task). If each obtained using 128 threads per block. But the best block contains 64 threads, then the total number of speedup for sequences of length 1600 is 46.34× and blocks is B = ⌈1000/64⌉ = 16. The assignment of 16 used 256 threads per block. These results tell us that it blocks to 14 available SMs will result in 12 SMs with is hard to find a general rule to parameterize the set- one active block and two SMs with two blocks. There- tings to achieve peak performance. A possible reason is fore,the occupancyfor thetwo SMs withtwo blocks is that many factors, such as the number of threads per (2 × 64)/1024 = 12.5%. For the 12 SMs with only one block, the available registers, and shared memory, may block its occupancy is (1 × 64)/1024 = 6.25%. Because contradict with each other. of the low occupancy, there are not sufficient threads to The intra-task parallel implementation creates enough keep CUDA cores busy when the global memory is thread blocks (one block for each alignment task) to accessed. As a result, the latency-hiding capabilities of keep the occupancy high. However, the SW alignment this method are limited. As the number of threads per matrix must be serially computed from the first to the block T increases, the total active blocks (B) num Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 9 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 decreases and some SMs even have no blocks assigned. pair PSSE implementation using the inter-task paralle- For example, if T =512,then B = ⌈1000/512⌉ =2. lism method. As expected, we observe poor performance num Hence only two SMs are working with one block each. for both strategies, as shown in Figure 5. The data reuse However, theseSMs haveahigheroccupancy(1× strategy produces higher performance than the intuitive 512)/1024 = 50%, which compensates for the decrease strategy, because the number of permutations is reduced in the number of working SMs. Consequently, even byafactorof Q, the number of query sequences. To though there is a reduction in speedups, it is not alleviate the low occupancy problem, our proposed tile- directly proportional to the reduction in the number of based strategy uses a carefully tuned tile size that effec- active SMs. When the number of threads per block is tively increase the occupancy. Recall that, as a result of 64,for thesequencelengthof200,400,800,and 1600, tiling, we send 14 subject sequences and one query the best speedups are 52.14×, 56.36×, 73.39×, and sequence to the GPU in each round. After the permuta- 73.16×, respectively. tion step, we have 14 × 1000 alignments to be per- In brief, getting performance out of a GPU is about formed, assuming each subject sequence is permuted keeping the CUDA cores busy. Both inter-task paralle- 1000 times. Consequently, each SM has 1000 alignments lism and intra-task parallelism, under the situations of to perform. Also, note that, the maximum number of small data set being processed, seem to fail in this blocks that can be launched simultaneously on an SM is regard, but not necessarily for the same reason. 8. Therefore, when the number of threads per block is 64, the number of blocks per SM is min(⌈1024/64⌉,8) = Multi-pair PSSE result analysis 8, resulting in an occupancy of (8 × 64)/1024 = 50%, In the intuitive and data reuse strategies, most of the based on Equation (2) and (3). This is a significant SMs suffer from the same low occupancy as the single- improvement over the intuitive and data-reuse Figure 5 Performance of three strategies for multi-pair PSSE. All experiments are run using 2771 subject sequences and 86 query sequences from the CATH 2.3 database. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 10 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 strategies. The maximum theoretical occupancy of an method and related programs in CUDA is available for SM for the three strategies is given in Table 1. free academic use at http://cucis.ece.northwestern.edu/ Due to a high occupancy, the tile-based strategy using projects/PSSE/. single (dual) GPU(s) achieves speedups of 240.31 (338.63)×, 243.51 (361.63)×, 240.71 (363.99)×, and Conclusions 243.84 (369.95)× using 64, 128, 256, and 512 threads In this paper, we present a high performance accelerator per block, respectively, as shown in Figure 5. to estimate the pairwise statistical significance of local The billion cell updates per second (GCUPS) value is sequence alignment, which supports standard substitu- another commonly used performance measure in tion matrix like BLOSUM62 as well as PSSMs. bioinformatics [33]. The tile-based strategy using single Our accelerator harvests the computation power of (dual) GPU(s) achieves performance results in the many-core GPUs by using CUDA, which results in high range of 16.55 (23.34) to 16.79 (25.50) GCUPS, as end-to-end speedups for PSSE. We also demonstrate a shown in Figure 5. Although a direct comparison comparative performance analysis of single-pair and across different GPU implementations and hardware in multi-pair implementations. The proposed optimizations not fair, just for the sake of completeness, we cited and efficient framework are applicable to a wide variety below the performance in GCUPS reported by some of next-generation sequencing comparison based appli- existing implementations of the Smith-Waterman cations, such as, DNA sequence mapping and database alignment task. It is worthwhile to mention here that search. As the size of biological sequence databases are [46] has a peak performance of 4.65 to 8.99 GCUPS increasing rapidly, even more powerful high perfor- for various query lengths on an NVIDIA 9800 and [42] mance computing accelerator platforms are expected to has a peak performance of 3.5 GCUPS on a worksta- be more and more common and imperative for tion running two GeForce 8800 GTX. Our implemen- sequence analysis, for which our work can serve as a tation show higher GCUPS. meaningful stepping stone. In summary, low occupancy is known to interfere with the ability to hide latency on memory-bound kernels, Abbreviations causing performance degradation. However, increasing BLAST: Basic Local Alignment Search Tool; BLOSUM: BLOcks of Amino Acid occupancy does not necessarily increase performance. In SUbstitution Matrix; CUDA: Compute Unified Device Architecture; EVD: Extreme Value Distribution; GCUPS: Billion Cell Updates Per Second; GPU: general, once a 50% occupancy is achieved, further opti- Graphics Processing Unit; PSSM: Position Specific Scoring Matrix; PSSE: mization to gain additional occupancy has little effect Pairwise Statistical Significance Estimation; PSI BLAST: Position Specific on performance [37]. Our experiments verify this claim. Iterative BLAST; PSA: Pairwise Sequence Alignment; PSS: Pairwise Statistical Significance; SIMT: Single: Instruction, Multiple: Thread; SSM: Standard Since the GPU implementation presented in this paper Substitution Matrix SM: Streaming Multiprocessor; SW: Smith-Waterman. uses the same algorithm for PSSE (specifically the Smith-Waterman algorithm for getting alignment scores, Acknowledgements This work is supported in part by NSF award numbers CCF-0621443, OCI- the same fitting routine to get statistical parameters K 0724599, CCF-0833131, CNS-0830927, IIS-0905205, OCI-0956311, CCF- and l, and the same algorithm parameters) as in [11], 0938000, CCF-1043085, CCF-1029166, and OCI-1144061, and in part by DOE the retrieval accuracy of the proposed implementation is grants DE-FC02-07ER25808, DE-FG02-08ER25848, DE-SC0001283, DE- SC0005309, and DE-SC0005340.This work is also supported in part by expected to be identical to [11]. According to [11], pair- National Nature Science Foundation of China (NO. 60973118, NO. 61133016), wise statistical significance with standard substitution Sichuan provincial science and technology agency (NO. matrices performs at least comparable or significantly M110106012010HH2027), Ministry of Education of China (NO. 708078), and China Scholarship Council. better than database statistical significance (using This article has been published as part of BMC Bioinformatics Volume 13 BLAST, PSI-BLAST, and SSEARCH). Moreover, pair- Supplement 5, 2012: Selected articles from the First IEEE International wise statistical significance with PSSMs performs signifi- Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are cantly better than using standard substitution matrices, available online at http://www.biomedcentral.com/bmcbioinformatics/ and also better than PSI-BLAST using pre-constructed supplements/13/S5. position-specific substitution matrices. More details can Author details be found in [11]. The implementation of the proposed School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China. Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, USA. Table 1 The maximum occupancy for three strategies Authors’ contributions Threads/blocks 64 128 256 512 YZ conceptualized the study, carried out the design and implementation of Intuitive occupancy 12.5% 6.25% 6.25% 6.25% the algorithm, analyzed the results and drafted the manuscript; SM, AA, MMAP, WL, ZQ, and AC participated analysis of the results and contributed Data-reuse occupancy 12.5% 6.25% 6.25% 6.25% to revise the manuscript. All authors read and approved the final Tile-based occupancy 50% 100% 100% 100% manuscript. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 11 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 22. Agrawal A, Choudhary A, Huang X: Derived distribution points heuristic Competing interests for fast pairwise statistical significance estimation. Proceedings of the First The authors declare that they have no competing interests. ACM International Conference on Bioinformatics and Computational Biology, ACM 2010, 312-321. Published: 12 April 2012 23. Agrawal A, Misra S, Honbo D, Choudhary AN: Parallel pairwise statistical significance estimation of local sequence alignment using Message References Passing Interface library. Concurrency and Computation: Practice and 1. Roos DS: COMPUTATIONAL BIOLOGY: Bioinformatics-trying to swim in a Experience 2011, 23(17):2269-2279. sea of data. Science 2001, 291:1260-1261. 24. Yu Y, Gertz E, Agarwala R, Schäffer A, Altschul S: Retrieval accuracy, 2. Yooseph S, Sutton G, Rusch D, Halpern A, Williamson S, Remington K, statistical significance and compositional similarity in protein sequence Eisen J, Heidelberg K, Manning G, Li W, et al: The sorcerer II global ocean database searches. Nucleic Acids Research 2006, 34(20):5966. sampling expedition: expanding the universe of protein families. PLoS 25. Zuyderduyn S: Statistical analysis and significance testing of serial Biology 2007, 5(3):e16. analysis of gene expression data using a Poisson mixture model. BMC 3. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos JS, Bealer K, bioinformatics 2007, 8:282. Madden TL: BLAST+: architecture and applications. BMC Bioinformatics 26. Aleksandar P: Island method for estimating the statistical significance of 2009, 10:421. profile-profile alignment scores. BMC Bioinformatics 2009, 10. 4. Altschul S, Madden T, Schäffer A, Zhang J, Zhang Z, Miller W, Lipman D: 27. Karlin S, Altschul S: Methods for assessing the statistical significance of Gapped BLAST and PSI-BLAST: a new generation of protein database molecular sequence features by using general scoring schemes. search programs. Nucleic Acids Research 1997, 25(17):3389-3402. Proceedings of the National Academy of Sciences 1990, 87(6):2264. 5. Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, 28. O R, B R, H T: Rapid assessment of extremal statistics for gapped local Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein alignment. Proceedings of the Seventh International Conference on Intelligent database searches with composition-based statistics and other Systems for Molecular Biology 1999, 211-222. refinements. Nucleic Acids Research 2001, 29(14):2994-3005. 29. Eddy SR: Maximum likelihood fitting of extreme value distributions. 1997, 6. Yu Y, Altschul S: The construction of amino acid substitution matrices for [citeseer.ist.psu.edu/370503. html]. [Unpublished work]. the comparison of proteins with non-standard compositions. 30. Waterman M, Vingron M: Rapid and accurate estimates of statistical Bioinformatics 2005, 21(7):902-911. significance for sequence data base searches. Proceedings of the National 7. Pearson WR: Flexible sequence similarity searching with the FASTA3 Academy of Sciences 1994, 91(11):4625-4628. program package. Methods in molecular biology 2000, 132:185-219. 31. Zhang Y, Misra S, Honbo D, Agrawal A, keng Liao W, Choudhary A: Efficient 8. Mott R: Accurate formula for p-values of gapped local sequence and pairwise statistical significance estimation for local sequence alignment profile alignments. Journal of Molecular Biology 2000, 300:649-659. using GPU. IEEE 1st International Conference on Computational Advances in 9. Pearson W, Lipman D: Improved tools for biological sequence Bio and Medical Sciences (ICCABS) 2011, 226-231. comparison. Proceedings of the National Academy of Sciences 1988, 32. Samuel A, Malcolm M, Aaron G, Kevin G, Mahesh V, David R, Cesar A, 85(8):2444. James W, Owen W, Florian F: CloVR: A virtual machine for automated and 10. Pagni M, Jongeneel C: Making sense of score statistics for sequence portable sequence analysis from the desktop using cloud computing. alignments. Briefings in Bioinformatics 2001, 2:51-67. BMC Bioinformatics 2011, 12:1-15. 11. Agrawal A, Huang X: Pairwise statistical significance of local sequence 33. Liu Y, Maskell DL, Schmidt B: CUDASW++: optimizing Smith-Waterman alignment using sequence-specific and position-specific substitution sequence database searches for CUDA-enabled graphics processing matrices. IEEE/ACM Transactions on Computational Biology and units. BMC Research Notes 2009, 2:73. Bioinformatics 2011, 8:194-205. 34. Liu Y, Schmidt B, Maskell DL: CUDASW++2.0: enhanced Smith-Waterman 12. Mitrophanov A, Borodovsky M: Statistical significance in biological protein database search on CUDA-enabled GPUs based on SIMT and sequence analysis. Briefings in Bioinformatics 2006, 7:2-24. virtualized SIMD abstractions. BMC Research Notes 2010, 3:93. 13. Pearson W: Empirical statistical estimates for sequence similarity 35. Agrawal A, Misra S, Honbo D, Choudhary AN: MPIPairwiseStatSig: parallel searches. Journal of Molecular Biology 1998, 276:71-84. pairwise statistical significance estimation of local sequence alignment. 14. Altschul SF, Bundschuh R, Olsen R, Hwa T: The estimation of statistical Proceedings of the 19th ACM International Symposium on High Performance parameters for local alignment score distributions. Nucleic Acids Research Distributed Computing 2010, 470-476. 2001, 29(2):351-361. 36. Honbo D, Agrawal A, Choudhary AN: Efficient pairwise statistical 15. Agrawal A, Brendel V, Huang X: Pairwise statistical significance and significance estimation using FPGAs. Proceedings of BIOCOMP 2010, empirical determination of effective gap opening penalties for protein 2010:571-577. local sequence alignment. International Journal of Computational Biology 37. NVIDIA: NVIDIA CUDA C: Best Practices Guide 4.1 2011. and Drug Design 2008, 1(4):347-367. 38. NVIDIA: NVIDIA CUDA C: Programming Guide 4.1 2011. 16. Poleksic A, Danzer JF, Hambly K, Debe DA: Convergent island statistics: a 39. Ryoo S, Rodrigues C, Baghsorkhi S, Stone S, Kirk D, Hwu W: Optimization fast method for determining local alignment score significance. principles and application performance evaluation of a multithreaded Bioinformatics 2005, 21(12):2827-2831. GPU using CUDA. Proceedings of the 13th ACM SIGPLAN Symposium on 17. Agrawal A, Huang X: PSIBLAST PairwiseStatSig: reordering PSI-BLAST hits Principles and Practice of Parallel Programming, ACM 2008, 73-82. using pairwise statistical significance. Bioinformatics 2009, 25(8):1082-1083. 40. Smith T, Waterman M: Identification of common molecular 18. Agrawal A, Huang X: Pairwise statistical significance of local sequence subsequences. Journal of Molecular Biology 1981, 147:195-197. alignment using multiple parameter sets and empirical justification of 41. Rognes T, Seeberg E: Six-fold speed-up of Smith-Waterman sequence parameter set change penalty. BMC bioinformatics 2009, 10(Suppl 3):S1. database searches using parallel processing on common 19. Sierk ML, Smoot ME, Bass EJ, Pearson WR: Improving pairwise sequence microprocessors. Bioinformatics 2000, 16(8):699-706. alignment accuracy using near-optimal protein sequence alignments. 42. Manavski S, Valle G: CUDA compatible GPU cards as efficient hardware BMC Bioinformatics 2010, 11:146. accelerators for Smith-Waterman sequence alignment. BMC Bioinformatics 20. Agrawal A, Choudhary A, Huang X: Sequence-specific sequence 2008, 9(Suppl 2):S10. comparison using pairwise statistical significance. Software Tools and 43. Liu W, Schmidt B, Voss G, Müller-Wittig W: Streaming algorithms for Algorithms for Biological Systems, Volume 696 of Advances in Experimental biological sequence alignment on GPUs. IEEE Transactions on Parallel and Medicine and Biology Springer New York; 2011, 297-306. Distributed Systems 2007, 18(9):1270-1281. 21. Agrawal A, Brendel V, Huang X: Pairwise statistical significance versus 44. Sierk ML, Pearson WR: Sensitivity and selectivity in protein structure database statistical significance for local alignment of protein comparison. Protein Science 2004, 13(3):773-785. sequences. Bioinformatics Research and Applications 2008, 50-61. Zhang et al. BMC Bioinformatics 2012, 13(Suppl 5):S3 Page 12 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S3 45. Zhang Y, Patwary M, Misra S, Agrawal A, Liao W, Choudhary AN: Enhancing parallelism of pairwise statistical significance estimation for local sequence alignment. Workshop on Hybrid Multi-core Computing 2011, 1-8. 46. Ligowski L, Rudnicki W: An efficient implementation of Smith Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases. IEEE International Symposium on Parallel & Distributed Processing 2009, 1-8. doi:10.1186/1471-2105-13-S5-S3 Cite this article as: Zhang et al.: Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU’s power. BMC Bioinformatics 2012 13(Suppl 5):S3. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color ﬁgure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit

Journal

BMC Bioinformatics – Springer Journals

Published: Apr 12, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power

References (56)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies