Multi-dimensional characterization of electrostatic surface potential computation on graphics processors

Mayank Daga; Wu-chun Feng

doi:10.1186/1471-2105-13-S5-S4

Multi-dimensional characterization of electrostatic surface potential computation on graphics processors

Daga, Mayank; Feng, Wu-chun 2012-04-12 00:00:00 Background: Calculating the electrostatic surface potential (ESP) of a biomolecule is critical towards understanding biomolecular function. Because of its quadratic computational complexity (as a function of the number of atoms in a molecule), there have been continual efforts to reduce its complexity either by improving the algorithm or the underlying hardware on which the calculations are performed. Results: We present the combined effect of (i) a multi-scale approximation algorithm, known as hierarchical charge partitioning (HCP), when applied to the calculation of ESP and (ii) its mapping onto a graphics processing unit (GPU). To date, most molecular modeling algorithms perform an artificial partitioning of biomolecules into a grid/ lattice on the GPU. In contrast, HCP takes advantage of the natural partitioning in biomolecules, which in turn, better facilitates its mapping onto the GPU. Specifically, we characterize the effect of known GPU optimization techniques like use of shared memory. In addition, we demonstrate how the cost of divergent branching on a GPU can be amortized across algorithms like HCP in order to deliver a massive performance boon. Conclusions: We accelerated the calculation of ESP by 25-fold solely by parallelization on the GPU. Combining GPU and HCP, resulted in a speedup of at most 1,860-fold for our largest molecular structure. The baseline for these speedups is an implementation that has been hand-tuned SSE-optimized and parallelized across 16 cores on the CPU. The use of GPU does not deteriorate the accuracy of our results. Background hierarchical charge partitioning (HCP) [11]). The Electrostatic interactions in a molecule are of utmost approximation algorithms can be parallelized on increas- importance for analyzing its structure [1-3] as well as ingly ubiquitous multi- and many-core architectures to functional activities like ligand binding [4], complex for- deliver even greater performance benefits. mation [5] and proton transport [6]. The calculation of Widespread adoption of general-purpose graphics pro- electrostatic interactions continues to be a computa- cessing units (GPUs) has made them popular as accel- tional bottleneck primarily because they are long-range erators for parallel programs [12]. The increased by nature of the potential [7]. As a consequence, effi- popularity has been assisted by (i) phenomenal comput- cient approximation algorithms have been developed to ing power, (ii) superior performance/dollar ratio, and reduce this computational complexity (e.g., the spherical (iii) compelling performance/watt ratio. For example, an cut-off method [8], the particle mesh Ewald (PME) 8-GPU cluster, costing a few thousand dollars, can method [9], the fast multipole method [10] and the simulate 52 ns/day of the JAC Benchmark as compared to 46 ns/day on the Kraken supercomputer, housed at * Correspondence: [email protected]; [email protected] Oak Ridge National Lab and which costs millions of Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, USA dollars [13]. The emergence of GPUs as an attractive Full list of author information is available at the end of the article © 2012 Daga and Feng; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 2 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 high-performance computing platform is also evident modern 16-core CPU, without any loss in the accuracy of from the fact that three out of the top five fastest super- the results. computers on the Top500 list employ GPUs [14]. Although the use of approximation algorithms can Methods improve performance, they often lead to an increase in Electrostatics and the hierarchical charge partitioning the memory boundedness of the application. Achieving approximation optimum performance with a memory-bound applica- We use the Analytic Linearized Poisson-Boltzmann tion is challenging due to the ‘memory wall’ [15]. The (ALPB) model to perform electrostatic computations effect of the memory wall is more severe on GPUs [19]. Equation (1) computes the electrostatic potential at because of the extremely high latency for global memory a surface-point (vertex) of the molecule due to a single accesses (on the order of 600 - 800 cycles). Furthermore, point charge, q. The potential at each vertex can be for maximum performance on the GPU, execution paths computed as the summation of potentials due to all on each GPU computational unit need to be synchro- charges in the system. If there are P vertices, the total nized. However, an important class of approximation surface potential can then be found as the summation algorithms, i.e., multi-scale approximations result in of potential at each vertex. highly asynchronous execution paths due to the intro- ⎡ ⎤ in duction of a large number divergent branches, which α 1 − q 1 1+ α ε i out outside ⎣ ⎦ φ = − (1) depend upon the relative distances between interacting ε ε d r in in i 1+ α atoms. ε out To test these expectations, we present a hybrid Computing the potential at P vertices results in a time approach wherein we implement the robust multi-scale complexity of O(NP)where N is the number of atoms HCP approximation algorithm in a molecular modeling in the molecule. To reduce the time complexity, we application called GEM [7] and map it onto a GPU. We apply an approximation algorithm called hierarchical counteract the high memory boundedness of HCP by charge partitioning (HCP), which reduces the upper explicitly managing the data movement, in a way that bound of computation to O(P log N). helps us achieve significantly improved performance. In HCP [11] exploits the natural partitioning of biomole- addition, we employ the standard GPU optimization cules into constituent structural components in order to techniques, such as coalesced memory accesses and the speed-up the computation of electrostatic interactions use of shared memory, quantifying the effectiveness of with limited and controllable impact on accuracy. Biomo- each optimization in our application. HCP results in supreme performance on the GPU despite the introduc- lecules can be systematically partitioned into multiple tion of divergent branches. This is attributed to the molecular complexes, which consist of multiple polymer reduction in memory transactions that compensates for chains or subunits and which in turn are made up of divergent branching. multiple amino acid or nucleotide groups, as illustrated in Recently, several molecular modeling applications have Figure 1. Atoms represent the lowest level in the hierar- chy while the highest level depends on the problem. used the GPU to speed-up electrostatic computations. Briefly, HCP works as follows. The charge distribution of Rodrigues et al. [16] and Stone et al. [17] demonstrate components, other than at the atomic level, is approxi- that the estimation of electrostatic interactions can be mated by a small set of point charges. The electrostatic accelerated by the use of spherical cut-off method and effect of distant components is calculated using the smal- the GPU. In [18], Hardy et al. used a multi-scale summa- ler set of point charges, while the full set of atomic tion method on the GPU. Each of the aforementioned charges is used for computing electrostatic interactions implementations artificially maps the n atoms of a mole- within nearby components. The distribution of charges cule onto a m-point lattice grid and then applies their for each component, used in the computation, varies respective approximation algorithm. By doing so, they depending on distance from the point in question: the reduce the time complexity of the computation from O farther away the component, the fewer charges are used (nn)to O(nm). In contrast, we use HCP, which performs to represent the component. The actual speedup from approximations based on the natural partitioning of bio- using HCP depends on the specific hierarchical organiza- molecules. The advantage of using the natural partition- tion of the biomolecular structure as that would govern ing is that even with the movement of atoms during the number of memory accesses, computations and molecular dynamics simulations, the hierarchical nature divergent branches on the GPU. Under conditions con- is preserved, whereas with the lattice, atoms may move in sistent with the hierarchical organization of realistic bio- and out of the lattice during the simulation. Our imple- mentation realizes a maximum of 1,860-fold speedup molecular structures, the top-down HCP algorithm over a hand-tuned SSE optimized implementation on a (Figure 2) scales as O(N log N), where N is the number of Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 3 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 1 Illustration of the hierarchical charge partitioning (HCP) of biomolecular structures. In this illustration a biomolecular structure is partitioned into multiple hierarchical levels components based on the natural organization of biomolecules. The charge distribution of distant components are approximated by a small number of charges, while nearby atoms are treated exactly. atoms in the structure. For large structures, the HCP can GPU very suitable for applications that exhibit data par- be several orders of magnitude faster than the exact O allelism, i.e., the operation on one data element is inde- (N ) all-atom computation. A detailed description of the pendent of the operations on other data elements. HCP algorithm can be found in Anandakrishnan et. al. Therefore, it is well suited for molecular modeling [11]. where the potential at one vertex can be computed independently of all others. GPU architecture and programming interface On NVIDIA GPUs, threads are organized into groups For this study, we have used state-of-art NVIDIA GPUs of 32,referredtoasa warp.Whenthreads within a based on the Compute Unified Device Architecture or warp follow different execution paths, such as when CUDA framework. CUDA is a framework developed by encountering a conditional, a divergent branch takes NVIDIA, which facilitates the implementation of gen- place. Execution of these group of threads is serialized, eral-purpose applications on GPUs. Below is a brief thereby, affecting performance. On a GPU, computa- tions are much faster compared to a typical CPU, but description of the NVIDIA GPU hardware architecture and the CUDA programming interface. memory accesses and divergent branching instructions NVIDIA GPUs consist of 240-512 execution units, are slower. The effect of slower memory access and which are grouped into 16 and 30 streaming multipro- divergent branching can be mitigated by initiating thou- cessors (SMs) on Fermi and GT200 architectures, sands of threads on a GPU, such that when one of the respectively. An overview of these architectures is threads is waiting on a memory access, other threads shown in Figure 3. Multiple threads on a GPU execute can perform meaningful computations. the same instruction, resulting in asingleinstruction, Every GPU operates in a memory space known as glo- multiple thread (SIMT) architecture. This is what makes bal memory. Data which needs to be operated on by the Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 4 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 2 Illustration of the HCP multi-scale algorithm. Predefined threshold distances (h , h , h ) are used to determine the level of 1 2 3 approximation used in the HCP approximation. This top-down algorithm results in ~ NLogN scaling compared to a ~ N scaling without HCP. GPU, needs to be first transferred to the GPU. This pro- threads is executed on a SM and share data using the cess of transferring data to GPU memory is performed shared memory present. over the PCI-e bus, making it an extremely slow pro- cess. Therefore, memory transfers should be kept to a Mapping HCP onto GPU minimum to obtain optimum performance. Also, acces- The problem of computing molecular surface potential sing data from the GPU global memory entails the cost is inherently data parallel in nature, i.e., the potential at of 400-600 cycles and hence, on-chip memory should be one point on the surface can be computed indepen- used to reduce global memory traffic. On the GT200 dently from the computation of potential at some other architecture, each SM contains a high-speed, 16 KB, point. This works to our advantage as such applications scratch-pad memory, known as the shared memory. map very well onto the GPU. We begin with offloading Shared memory enables extensive re-use of data, all the necessary data (coordinates of vertices and atoms thereby, reducing off-chip traffic. Whereas on the latest and the approximated point charges) to the GPU global Fermi architecture, each SM contains 64 KB of on-chip memory. To ensure efficient global memory accesses memory, which can be either be configured as 16 KB of patterns, we flattened the data structures. By flattening shared memoryand48KBofL1cacheor viceversa. of data structures we mean that all the arrays of struc- Each SM also consists of a L2 cache of size 128 KB. The tures were transformed into arrays of primitives so that hierarchy of caches on the Fermi architecture allows for the threads in a half warp (16 threads) access data from more efficient global memory access patterns. contiguous memory locations [20,21]. The GPU kernel CUDA provides a C/C++ language extension with is then executed, wherein each thread is assigned the application programming interfaces (APIs). A CUDA task of computing the electrostatic potential at one ver- program is executed by a kernel, which is effectively a tex. At this point the required amount of shared mem- function call to the GPU, launched from the CPU. ory, i.e, number of threads in a block times the size of CUDA logically arranges the threads into blocks which the coordinates of each vertex, is allocated on each are in turn grouped into a grid. Each thread has its own streaming multiprocessor (SM) of the GPU. The kernel ID which provides for one-one mapping. Each block of is launched multiple times as required, until all the ver- Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 5 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 3 Overview of NVIDIA GPU architectures. The Fermi architecture is shown to consist of 16 Streaming Multiprocessors (SMs). Each SM consists of 64 KB of on-chip memory, which can configured as 16 KB of shared memory and 48 KB of L1 data cache or vice versa. Also present on each SM is 128 KB of L2 data cache. The GDDR5 memory controllers facilitate data accesses to and from the global memory. The GT200 architecture consists of 30 SMs. Each SM consists of 16 KB of shared memory but no data caches, instead it contains L1/L2 texture memory space. Also present are GDDR3 memory controllers to facilitate global memory accesses. tices are exhausted, with implicit GPU synchronization residues, and atoms have to remain in global memory. in between successive kernel launches. On the GPU The HCP algorithm is then applied to compute the elec- side, each kernel thread copies the coordinates of its trostatic potential, and the result is stored in the global assigned vertex onto the shared memory. This results in memory. All the threads perform this computation in a reduction of the number of global memory loads as parallel, and after the threads finish, the computed explained in the Results section. The limited amount of potential at each vertex is transferred back to the CPU per SM shared memory does not allow us to offload the memory, where a reduce (sum) operation is performed coordinates of constituent components of the biomole- to calculate the total molecular surface potential. cule and hence, coordinates of complexes, strands, According to the algorithm, evaluation of distance Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 6 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 between the vertex and molecular components requires CUDAEventRecord function call. For a fair comparison, each thread to accesses coordinates from the global time for offloading the data onto the GPU global mem- memory. This implies that potential calculation at each ory and storing the results back onto the CPU was vertex necessitates multiple global memory accesses, taken into account along with the time for execution of which makes HCP memory-bound on the GPU. the kernel. Single precision was used on both platforms. HCP also introduces a significant number of divergent All the numbers presented are an average of 10 runs branches on the GPU. This phenomenon occurs because performed on each platform. For HCP, the 1st-level for some threads in a warp, it may be possible to apply threshold was set to 10Å and the 2nd-level threshold was fixed at 70Å. HCP approximation while for other, it may not be pos- sible to do so. Therefore, these two groups of threads would diverge and follow respective paths, resulting in a Impact of using shared memory divergent branch. In the Results section, we show how At every approximation level, HCP reuses the vertex the associated costs of divergent branching in HCP on coordinates to compute the distance between the vertex the GPU can be amortized to deliver a performance and molecule components. Therefore in the worst case boost. when no approximation can be applied, same data is accessed four times from theglobalmemory(dueto Test setup four levels in the molecule hierarchy). We used the To illustrate the scalability of our application, we have shared memory to reduce these global memory accesses. used four different structures with varied sizes. The Percentage reduction in the number of global memory characteristics of the structures used are presented in loads due to the use of shared memory on GT200 archi- Table 1. The GPU implementation was tested on the tecture, with and without HCP approximation, is shown present generation of NVIDIA GPUs. in Table 3. The base line for each of these columns is The Host Machine consists of an E8200 Intel Quad the respective implementation, i.e., without_HCP and core running at 2.33 GHz with 4 GB DDR2 SDRAM. HCP, without the use of shared memory. These num- The operating system on the host is a 64-bit version of bers were taken from the CUDA Visual Profiler pro- Ubuntu 9.04 distribution running the 2.6.28-16 generic vided by NVIDIA [22]. Linux kernel. Programming and access to the GPU was From the table, we note that the global memory loads provided by CUDA 3.1 toolkit and SDK with the NVI- are reduced by 50% for all structures, when the HCP DIA driver version 256.40. For the sake of accuracy of approximation is not used. Whereas with HCP, the results, all the processes that required a graphical user amount by which loads are reduced varies from struc- ture to structure. This can be reasoned as follows. interface were disabled to limit resource sharing of the GPU. When no approximation is applied, the coordinates of We ran our tests on a NVIDIA Tesla C1060 graphics vertices and that of all atoms are accessed from global card with GT200 GPU and the NVIDIA Fermi Tesla memory, which requires cycling through the residue C2050 graphics card. An overview of both of these groups. Therefore when shared memory is not used, the GPUs is presented in Table 2. vertex coordinate is loaded twice, once for residue and once for the atom. While when shared memory is used, Results and discussion it is loaded only once, i.e., for copying into the shared In this section, we present an analysis of (i) the impact memory, thereby, resulting in a 50% reduction in global of using shared memory, (ii) the impact of divergent memory loads. branching, (iii) the speedups realized by our implemen- But in the case of HCP, the number of times a vertex tation, and (iv) the accuracy of our results. On CPU, the coordinate is loaded from global memory depends upon timing information was gathered by placing required the structure. This is because for each structure the time-checking calls around the computational kernel, effective number of computations to be performed are excluding the I/O required for writing the results. On different. For example, for a structure with 1st level of GPU, the execution time was measured by using the approximation and no shared memory usage, vertex Table 1 Characteristics of input structures Structure #Vertices #Complexes #Strands #Residues #Atoms H helix myoglobin, 1MBO 5,884 1 1 24 382 nuclesome core particle, 1KX5 258,797 1 10 1,268 25,086 chaperonin GroEL, 2EU1 898,584 1 14 7,336 109,802 virus capsid, 1A6C 593,615 1 60 30,780 476,040 Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 7 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Table 2 Overview of GPUs used GPU Tesla C1060 Fermi Tesla C2050 Streaming Processor Cores 240 448 Streaming Multiprocessors (SMs) 30 16 Memory Bus type GDDR3 GDDR5 Device Memory size 4096 MB 3072 MB Shared Memory (per SM) 16 KB Configurable 48 KB or 16 KB L1 Cache (per SM) None Configurable 16KB or 48 KB L2 Cache None 768 KB Double Precision Floating Point Capability 30 FMA ops/clock 256 FMA ops/clock Single Precision Floating Point Capability 240 FMA ops/clock 512 FMA ops/clock Special Function Units (per SM) 2 4 Compute Capability 1.3 2.0 coordinates would be loaded three times from the global where t denotes time taken for execution path i,as memory - (i) to compute the distance to the complex, shown in Figure 4. From the figure, it can be noted that (ii) to compute the distance to the strand and (iii) finally both execution paths perform the similar task of calcu- to compute the distance to the residue. While with lating the potential. Path#1 calls the function, calcPo- shared memory it would be accessed just once. Simi- tential () just once while for Path#2, larly, for a structure with no approximation, the vertex calcPotential () is called from within a loop that would be accessed four times, without using shared iterates over all atoms of that residue. Hence, it can be memory. Therefore, the table suggests that least number inferred that time for divergent branches is directly pro- of components could be approximated for the virus cap- portional to the number of atoms in the molecule. sid, and hence, maximum percentage reduction. t t 2 1 Use of the shared memory resulted in a drastic reduc- ∴ t ≈ t divBranch 2 tion in the number of global loads and hence, provided (3) ≈ #Atoms × T calcPotential about 2.7-fold speedup to our application. ∴ t ∝ #Atoms divBranch Impact of divergent branching Thus, Total time for all the divergent branches in the Divergent branching on a GPU occurs when the threads system: of a warp follow different execution paths. In our GPU T = #DivBranches × t implementation, each thread takes charge of one vertex divBranch divBranch (4) ∴ T ∝ #Atoms and as shown in Figure 4, it is possible for threads divBranch within a warp to follow different execution paths, From (4), we can say that the cost of divergent thereby, introducing a divergent branch. Here we quan- branches would be maximum for the molecule which tify the cost of divergent branches that are introduced. has the greatest #atoms but this is not true as can be For the ease of exposition, we limited our analysis to seen from Figure 5. In the figure, we present the speed- one level of HCP only, but this can be extended. As the ups achieved due to HCP on CPU as well as the GPUs execution paths are now serialized, the time taken for with #atoms increasing from left to right. As all the the execution of a divergent branch, denoted by t div- speedups are positive, we can safely infer that GPU is can be characterized as follows: Branch indeed beneficial for HCP despite the introduction of divergent branches. We also note that speedup achieved t = t + t (2) divBranch 1 2 duetoHCP on GPUs,increases with theincreasein #atoms in the structure, thus, dissatisfying (4). Hence, there exists an aspect which compensates for the cost of Table 3 Percentage reduction in the number of global introduction of divergent branches. As HCP is a mem- memory loads ory-bound application, number of memory transactions Structure Without HCP With HCP dominate the execution time. Looking back at the algo- H Helix myoglobin 50% 32% rithm (in the Methods section), we observe that HCP nucleosome core paricle 50% 62% does reduce the number of memory transactions. It chaperonin GroEL 50% 84% does so by applying the approximation, which results in virus capsid 50% 96% reduced fetching of coordinates from the global Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 8 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 4 Divergent branching due to the HCP approximation. This illustration shows that our GPU implementation causes divergent branches to occur. This is because for every thread, there are probable execution paths. Divergent branches occur if threads within a warp take different paths. memory. Now, coordinates of only the higher level com- HCP would guarantee improved performance on the ponent are required and hence, compensating for the GPU if the gain in time due to reduced memory trans- cost of divergent branches. Execution time for the entire actions is greater than the cost of divergent branches. application with HCP can be derived as follows: However, computing memory access times on the GPU is an extremely difficult task as one has no knowledge T = T T + T With HCP Without HCP Mem divBranch of how warps are scheduled, which is essential as the (5) where T = global memory access time Mem warps send access requests to memory controllers. Figure 5 Speedup due to the HCP approximation. Even with the occurrence of divergent branches, speedup due to the HCP approximation is positive on the GPU. This alludes to the fact there is some aspect with amortizes the cost of the introduction of these divergent branches. Speedup is maximum for the largest structure, i.e., virus capsid. Baseline: Corresponding implementation on each platform without the use of HCP approximation. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 9 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Table 4 Impact of the HCP approximation on memory transactions and divergent branches Structure Decrease in # of mem. Transactions Increase in # of divergent branches H Helix myoglobin 95,800 34 nucleosome core particle 119,507,436 4,635 chaperonin GroEL 1,831,793,578 25,730 virus capsid 5,321,785,506 22,651 Hence, no direct method to measure global memory Speedup access times exists. We used an indirect approach and Figures 6 and 7 present speedups achieved by our implementation on NVIDIA Tesla C1060 and NVIDIA found out the reduction in memory transactions as well Fermi Tesla C2050 GPUs respectively. Both the figures as the increase in divergent branches for our application. present speedup over the CPU implementation opti- These numbers have been taken using from the CUDA mized by hand-tuned SSE Intrinsics and parallelized Visual Profiler provided by NVIDIA and are presented across 16 cores, without the use of any approximation in Table 4 [22]. The memory transactions in the table portray the sum of 32-, 64- and 128-byte load and store algorithm. Speedups achieved due to the use of GPU transactions per SM. Also,the numberof divergent alone as well as that due to the combination of GPU branches represent the divergent branches introduced and HCP are presented for all four structures. on one SM. From the table, it is seen that the reduction From both these figures, we note that speedup due to in memory transactions is orders of magnitude greater the GPU alone is almost constant for all three structures than the increase in divergent branches. From the table, barring Mb.Hhelix. This is because Mb.Hhelix is an we note that the number of memory transactions extremely small structure and it does not requires reduced per one divergent branch is maximum for the enough GPU threads for the computation of its molecu- capsid, which results in the fact that HCP+GPU is most lar surface potential, thereby, leaving the GPU under effective for capsid. Figures 6 and 7 corroborate this fact utilized. This phenomenon is prominent in case of the and hence, we can attest that it is the reduction in Fermi Tesla C2050 which actually results in a slowdown memory transactions which help make GPUs favorable due to under-utilization of the GPU. For other struc- for HCP. This proves that even an algorithm with diver- tures the threshold of the number of threads is met and gent branching can be benefited by the GPU, provided almost similar speedup is achieved across both the fig- there is some aspect which amortizes the cost of the ures. The observed speedup is around 11-fold on Tesla divergent branches introduced. C1060, whereas on Tesla C2050 the speedup is around Figure 6 Speedup on NVIDIA Tesla C1060. Speedup due the GPU alone is almost constant because once the threshold for the number of threads that can be launched is met, there is no further increase in speedup. Speedup due to HCP+GPU increases with the increase in the size of the structure due to the O(nlogn) scaling of the HCP approximation. Baseline: No-approximation CPU implementation optimized by hand- tuned SSE Intrinsics and parallelized across 16 cores. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 10 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 7 Speedup on NVIDIA Tesla Fermi C2050.Speedup on theTesla Fermi C2050is greaterthanTesla C1060dueto thepresenceof hierarchy of caches on the C2050 GPU. Baseline: No-approximation CPU implementation optimized by hand-tuned SSE Intrinsics and parallelized across 16 cores. 25-fold. The increased speedup on C2050 may be attrib- less potent than before. Speedups achieved across all fig- uted to several architectural differences between Fermi ures for without_HCP version are almost consistent for and GT200 GPUs, like the ability for concurrent kernel both the GPUs, which is because it does not introduce execution, ECC support and fewer SMs. However, the divergent branches. Whereas, the version with HCP, architectural feature that we feel has the most impact results in divergent branches and also varying amounts of for this algorithm, is the presence of a hierarchy of speedups across structures, depending upon how much caches on Fermi, as they allow for greater exploitation cost of the divergent branches can be amortized by the of the locality of data. For no approximation, all atoms corresponding reduction in memory transactions. need to be accessed sequentially, thereby, making the caches play an important role and hence, Fermi Tesla Accuracy C2050 is deemed to be more effective. To get the best performance on GPUs, we used single As explained in a previous section, application precision as double precision on GT200 architecture speedup due to the combination of GPU and HCP adversely impacts the performance by as much as 8- increases with the increase in number of memory trans- times. Although double precision on Fermi is almost actions reduced per divergent branch increased.There- half as fast as single precision, we decided to stick with fore, from Table 4, number of memory transactions single precision for greater performance than accuracy. reduced is maximum for virus Capsid and hence, it To get an estimate of the accuracy of our results, we attains the maximum speedup. Next highest reduction compute the relative root mean squared error (RMSE) in the number of memory transactions is for 2eu1 and of the single-precision-GPU implementation against the hence, next highest speedup and so on. Our application double-precision-CPU implementation. The results are manages to achieve up to 1,860-fold speedup with HCP shown in Table 5. We also present the error due to on Tesla C1060 for Capsid while the corresponding HCP both on CPU and the GPU. HCP, being an speedup on Fermi Tesla C2050 is approximately 1,600- approximation algorithm, does introduce some error on fold. The actual execution time of our implementation the CPU. From the table, we note that the error intro- on both GPUs is <1 sec. duced by the GPU itself is fairly negligible when com- Speedup achieved with HCP on Tesla C2050 is less pared to the error introduced by HCP alone on the than that achieved on Tesla C1060 due to the fact that CPU. Thus, the total error due to HCP and GPU is the algorithm fails to take the advantage of the caches almost equivalent to the error on the CPU. Therefore, present on Fermi, as before. With HCP, not all memory we can safely conclude that single precision on GPU requests are sequential as coordinates of both atoms and does not jeopardize the accuracy of our computed high level components are required, making the caches results. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 11 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Table 5 Relative RMS (root-mean-squared) error compared to that of the hand-tuned SSE implementation on the 16 cores of the CPU. Furthermore, we ensured Structure Version Relative RMSE that the use of single-precision arithmetic on the GPU, H helix myoglobin CPU with HCP 0.215821 combined with the HCP multi-scale approximation, did GPU 0.000030 not significantly affect the accuracy of our results. GPU with HCP 0.236093 For future work, we will apply our HCP approxima- nuclesome core particle CPU with HCP 0.022950 tion algorithm to molecular dynamics (MD) simulations GPU 0.000062 on the GPU, given how well it performs in the case of GPU with HCP 0.022853 molecular modeling. For MD simulations, the use of chaperonin GroEL, 2eu1 CPU with HCP 0.008799 double precision is mandatory as the error incurred in GPU 0.000042 each time-step would accumulate over time, thereby GPU with HCP 0.008816 immensely affecting the accuracy of the MD results. In virus capsid CPU with HCP 0.015376 addition, we plan to exploit the use of the cache hierar- GPU 0.000173 chy on the NVIDIA Fermi to accelerate the memory- GPU with HCP 0.015273 bounded aspect of our application. Due to the paltry error introduced by single precision Acknowledgements on the GPU, it may be deemed acceptable for the com- This work was supported in part by NSF grants CNS-0915861, CNS-0916719 putation of molecular surface potential on the GPU but and a NVIDIA Professor Partnership Award. We thank Tom Scogland for helping us with the initial implementation of GPU-GEM and are also grateful may be unsatisfactory for molecular dynamics. In case to Alexey Onufriev and his group for making us familiar with HCP of molecular dynamics simulations, even a minute error approximation. in one time step can have a substantial effect on the This article has been published as part of BMC Bioinformatics Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International results as the error would accumulate during the course Conference on Computational Advances in Bio and medical Sciences of the simulation. It is here that superior double preci- (ICCABS 2011): Bioinformatics. The full contents of the supplement are sion support of Fermi would come in handy. available online at http://www.biomedcentral.com/bmcbioinformatics/ supplements/13/S5. Conclusions Author details With the emergence of GPU computing, there have Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, USA. Department of Electrical and Computer Engineering, Virginia Tech, been many attempts at accelerating the electrostatic sur- Blacksburg, VA 24061, USA. Virginia Bioinformatics Institute, Virginia Tech, face potential (ESP) computations for biomolecules. In Blacksburg, VA 24061, USA. our work, we demonstrate the combined effect of using Authors’ contributions a multi-scale approximation algorithm called hierarchi- MD implemented and optimized the HCP approximation on the GPU. MD cal charge partitioning (HCP) and mapping it onto a also studied the impact of divergence and memory transactions on the graphics processing unit (GPU). While mainstream GPU, collected all the required results and drafted the manuscript. WF conceived the study, co-designed the GPU mapping of HCP and helped molecular modeling algorithms impose an artificial par- draft the manuscript. Both authors read and approved the final manuscript. titioning of biomolecules into a grid/lattice to map it onto a GPU, HCP is significantly different in that it Competing interests The authors declare that they have no competing interests. takes advantage of the natural partitioning in biomole- cules, which facilitates a data-parallel mapping onto the Published: 12 April 2012 GPU. We then presented our methodology for mapping and References 1. Perutz M: Electrostatic effects in proteins. Science 1978, 201:1187-1191. optimizing the performance of HCP on the GPU when 2. Baker NA, McCammon JA: Electrostaic Interactions In Structural Bioinformatics applied to the calculation of ESP. Despite being a mem- New York: John Wiley & Sons, Inc; 2002. ory-bound application, we leveraged many known optimi- 3. Honig B, Nicholls A: Classical Electrostatics in Biology and Chemistry. Science 1995, 268:1144-1149. zation techniques to accelerate performance. In addition, 4. Szabo G, Eisenman G, McLaughlin S, Krasne S: Ionic probes of membrane we demonstrated the effectiveness of the introduction of structures. In: Membrane Structure and Its Biological Applications. Ann divergent branching on GPUs when it reduces the number NY Acad Sci 1972, 195:273-290. 5. Sheinerman FB, Norel R, Honig B: Electrostatic aspects of protein-protein of instructional and memory transactions. interactions. Curr Opin Struct Biol 2000, 10(2):153-9. For a fairer comparison between the CPU and GPU, we 6. Onufriev A, Smondyrev A, Bashford D: Proton affinity changes during optimized the CPU implementation by using hand-tuned unidirectional proton transport in the bacteriorhodopsin photocycle. J Mol Biol 2003, 332:1183-1193. SSE intrinsics to handle the SIMD nature of the applica- 7. Gordon JC, Fenley AT, Onufriev A: An analytical approach to computing tion on the CPU. We then demonstrated a 1,860-fold biomolecular electrostatic potential. II. Validation and applications. The reduction in the execution time of the application when Journal of Chemical Physics 2008, 129(7):075102. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 12 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 8. Ruvinsky AM, Vakser IA: Interaction cutoff effect on ruggedness of protein-protein energy landscape. Proteins: Structure, Function, and Bioinformatics 2008, 70(4):1498-1505. 9. Darden T, York D, Pedersen L: Particle mesh Ewald: An N.log(N) method for Ewald sums in large systems. The Journal of Chemical Physics 1993, 98(12):10089-10092. 10. Cai W, Deng S, Jacobs D: Extending the fast multipole method to charges inside or outside a dielectric sphere. J Comp Phys 2006, 223:846-864. 11. Anandakrishnan R, Onufriev A: An N log N approximation based on the natural organization of biomolecules for speeding up the computation of long range interactions. Journal of Computational Chemistry 2010, 31(4):691-706. 12. Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P: Brook for GPUs: stream computing on graphics hardware. International Conference on Computer Graphics and Interactive Techniques ACM New York, NY, USA; 2004, 777-786. 13. Huang JH: Opening Keynote, NVIDIA GTC, 2010. 2010 [http://livesmooth. istreamplanet.com/nvidia100921/]. 14. The Top500 Supercomputer Sites. [http://www.top500.org]. 15. Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA: The Landscape of Parallel Computing Research: A View from Berkeley. Tech Rep UCB/ EECS-2006-183, EECS Department, University of California, Berkeley; 2006 [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html]. 16. Rodrigues CI, Hardy DJ, Stone JE, Schulten K, Hwu WMW: GPU acceleration of cutoff pair potentials for molecular modeling applications. Proceedings of the 5th conference on Computing frontiers, CF ‘08 New York, NY, USA: ACM; 2008, 273-282[http://doi.acm.org/10.1145/1366230.1366277]. 17. Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG, Schulten K: Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry 2007, 28(16):2618-2640. 18. Hardy DJ, Stone JE, Schulten K: Multilevel summation of electrostatic potentials using graphics processing units. Parallel Computing 2009, 35(3):164-177. 19. Fenley AT, Gordon JC, Onufriev A: An analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysis. The Journal of Chemical Physics 2008, 129(7):075101[http://scitation.aip.org/ getabs/servlet/GetabsServlet?prog=normal \&id=JCPSA6000129000007075101000001\&idtype=cvips\&gifs=yes]. 20. NVIDIA: NVIDIA CUDA Programming Guide-3.2. 2010 [http://developer. download.nvidia.com/compute/cuda/3_2/toolkit/docs/ CUDA_C_Programming_Guide.pdf]. 21. Anandakrishnan R, Fenley A, Gordon J, Feng W, Onufriev A: Accelerating electrostatic surface potential calculation with multiscale approximation on graphical processing units. Journal of Molecular Graphics and Modelling 2009, Submitted. 22. NVIDIA: CUDA Visual Profiler. 2009 [http://developer.download.nvidia.com/ compute/cuda/2_2/toolkit/docs/cudaprof_1.2_readme.html]. doi:10.1186/1471-2105-13-S5-S4 Cite this article as: Daga and Feng: Multi-dimensional characterization of electrostatic surface potential computation on graphics processors. BMC Bioinformatics 2012 13(Suppl 5):S4. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color ﬁgure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals http://www.deepdyve.com/lp/springer-journals/multi-dimensional-characterization-of-electrostatic-surface-potential-KbSDXKGkC7

Loading next page...

References (43)

(RuvinskyAMVakserIAInteraction cutoff effect on ruggedness of protein-protein energy landscapeProteins: Structure, Function, and Bioinformatics200870414981505)
RuvinskyAMVakserIAInteraction cutoff effect on ruggedness of protein-protein energy landscapeProteins: Structure, Function, and Bioinformatics200870414981505
RuvinskyAMVakserIAInteraction cutoff effect on ruggedness of protein-protein energy landscapeProteins: Structure, Function, and Bioinformatics200870414981505, RuvinskyAMVakserIAInteraction cutoff effect on ruggedness of protein-protein energy landscapeProteins: Structure, Function, and Bioinformatics200870414981505
(PerutzMElectrostatic effects in proteinsScience19782011187119110.1126/science.694508694508)
PerutzMElectrostatic effects in proteinsScience19782011187119110.1126/science.694508694508
PerutzMElectrostatic effects in proteinsScience19782011187119110.1126/science.694508694508, PerutzMElectrostatic effects in proteinsScience19782011187119110.1126/science.694508694508
AT Fenley, JC Gordon, A Onufriev (2008)
An analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysis
The Journal of Chemical Physics, 129
(RodriguesCIHardyDJStoneJESchultenKHwuWMWGPU acceleration of cutoff pair potentials for molecular modeling applicationsProceedings of the 5th conference on Computing frontiers, CF '082008New York, NY, USA: ACM273282http://doi.acm.org/10.1145/1366230.1366277)
RodriguesCIHardyDJStoneJESchultenKHwuWMWGPU acceleration of cutoff pair potentials for molecular modeling applicationsProceedings of the 5th conference on Computing frontiers, CF '082008New York, NY, USA: ACM273282http://doi.acm.org/10.1145/1366230.1366277
RodriguesCIHardyDJStoneJESchultenKHwuWMWGPU acceleration of cutoff pair potentials for molecular modeling applicationsProceedings of the 5th conference on Computing frontiers, CF '082008New York, NY, USA: ACM273282http://doi.acm.org/10.1145/1366230.1366277, RodriguesCIHardyDJStoneJESchultenKHwuWMWGPU acceleration of cutoff pair potentials for molecular modeling applicationsProceedings of the 5th conference on Computing frontiers, CF '082008New York, NY, USA: ACM273282http://doi.acm.org/10.1145/1366230.1366277
R Anandakrishnan, A Onufriev (2010)
An N log N approximation based on the natural organization of biomolecules for speeding up the computation of long range interactions
Journal of Computational Chemistry, 31
T Darden, D York, L Pedersen (1993)
Particle mesh Ewald: An N.log(N) method for Ewald sums in large systems
The Journal of Chemical Physics, 98
FB Sheinerman, R Norel, B Honig (2000)
Electrostatic aspects of protein-protein interactions
Curr Opin Struct Biol, 10
DJ Hardy, JE Stone, K Schulten (2009)
Multilevel summation of electrostatic potentials using graphics processing units
Parallel Computing, 35
(CaiWDengSJacobsDExtending the fast multipole method to charges inside or outside a dielectric sphereJ Comp Phys2006223846864)
CaiWDengSJacobsDExtending the fast multipole method to charges inside or outside a dielectric sphereJ Comp Phys2006223846864
CaiWDengSJacobsDExtending the fast multipole method to charges inside or outside a dielectric sphereJ Comp Phys2006223846864, CaiWDengSJacobsDExtending the fast multipole method to charges inside or outside a dielectric sphereJ Comp Phys2006223846864
(OnufrievASmondyrevABashfordDProton affinity changes during unidirectional proton transport in the bacteriorhodopsin photocycleJ Mol Biol20033321183119310.1016/S0022-2836(03)00903-314499620)
OnufrievASmondyrevABashfordDProton affinity changes during unidirectional proton transport in the bacteriorhodopsin photocycleJ Mol Biol20033321183119310.1016/S0022-2836(03)00903-314499620
OnufrievASmondyrevABashfordDProton affinity changes during unidirectional proton transport in the bacteriorhodopsin photocycleJ Mol Biol20033321183119310.1016/S0022-2836(03)00903-314499620, OnufrievASmondyrevABashfordDProton affinity changes during unidirectional proton transport in the bacteriorhodopsin photocycleJ Mol Biol20033321183119310.1016/S0022-2836(03)00903-314499620
(HardyDJStoneJESchultenKMultilevel summation of electrostatic potentials using graphics processing unitsParallel Computing200935316417710.1016/j.parco.2008.12.00520161132)
HardyDJStoneJESchultenKMultilevel summation of electrostatic potentials using graphics processing unitsParallel Computing200935316417710.1016/j.parco.2008.12.00520161132
HardyDJStoneJESchultenKMultilevel summation of electrostatic potentials using graphics processing unitsParallel Computing200935316417710.1016/j.parco.2008.12.00520161132, HardyDJStoneJESchultenKMultilevel summation of electrostatic potentials using graphics processing unitsParallel Computing200935316417710.1016/j.parco.2008.12.00520161132
(2010)
NVIDIA CUDA Programming Guide-3.2
M Perutz (1978)
Electrostatic effects in proteins
Science, 201
(DardenTYorkDPedersenLParticle mesh Ewald: An N.log(N) method for Ewald sums in large systemsThe Journal of Chemical Physics19939812100891009210.1063/1.464397)
DardenTYorkDPedersenLParticle mesh Ewald: An N.log(N) method for Ewald sums in large systemsThe Journal of Chemical Physics19939812100891009210.1063/1.464397
DardenTYorkDPedersenLParticle mesh Ewald: An N.log(N) method for Ewald sums in large systemsThe Journal of Chemical Physics19939812100891009210.1063/1.464397, DardenTYorkDPedersenLParticle mesh Ewald: An N.log(N) method for Ewald sums in large systemsThe Journal of Chemical Physics19939812100891009210.1063/1.464397
(AsanovicKBodikRCatanzaroBCGebisJJHusbandsPKeutzerKPattersonDAPlishkerWLShalfJWilliamsSWYelickKAThe Landscape of Parallel Computing Research: A View from Berkeley2006Tech Rep UCB/EECS-2006-183, EECS Department, University of California, Berkeleyhttp://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html)
AsanovicKBodikRCatanzaroBCGebisJJHusbandsPKeutzerKPattersonDAPlishkerWLShalfJWilliamsSWYelickKAThe Landscape of Parallel Computing Research: A View from Berkeley2006Tech Rep UCB/EECS-2006-183, EECS Department, University of California, Berkeleyhttp://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
AsanovicKBodikRCatanzaroBCGebisJJHusbandsPKeutzerKPattersonDAPlishkerWLShalfJWilliamsSWYelickKAThe Landscape of Parallel Computing Research: A View from Berkeley2006Tech Rep UCB/EECS-2006-183, EECS Department, University of California, Berkeleyhttp://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html, AsanovicKBodikRCatanzaroBCGebisJJHusbandsPKeutzerKPattersonDAPlishkerWLShalfJWilliamsSWYelickKAThe Landscape of Parallel Computing Research: A View from Berkeley2006Tech Rep UCB/EECS-2006-183, EECS Department, University of California, Berkeleyhttp://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
A Onufriev, A Smondyrev, D Bashford (2003)
Proton affinity changes during unidirectional proton transport in the bacteriorhodopsin photocycle
J Mol Biol, 332
(GordonJCFenleyATOnufrievAAn analytical approach to computing biomolecular electrostatic potential. II. Validation and applicationsThe Journal of Chemical Physics2008129707510210.1063/1.295649919044803)
GordonJCFenleyATOnufrievAAn analytical approach to computing biomolecular electrostatic potential. II. Validation and applicationsThe Journal of Chemical Physics2008129707510210.1063/1.295649919044803
GordonJCFenleyATOnufrievAAn analytical approach to computing biomolecular electrostatic potential. II. Validation and applicationsThe Journal of Chemical Physics2008129707510210.1063/1.295649919044803, GordonJCFenleyATOnufrievAAn analytical approach to computing biomolecular electrostatic potential. II. Validation and applicationsThe Journal of Chemical Physics2008129707510210.1063/1.295649919044803
(HuangJHOpening Keynote, NVIDIA GTC, 20102010http://livesmooth.istreamplanet.com/nvidia100921/)
HuangJHOpening Keynote, NVIDIA GTC, 20102010http://livesmooth.istreamplanet.com/nvidia100921/
HuangJHOpening Keynote, NVIDIA GTC, 20102010http://livesmooth.istreamplanet.com/nvidia100921/, HuangJHOpening Keynote, NVIDIA GTC, 20102010http://livesmooth.istreamplanet.com/nvidia100921/
(NVIDIACUDA Visual Profiler2009http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/cudaprof_1.2_readme.html)
NVIDIACUDA Visual Profiler2009http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/cudaprof_1.2_readme.html
NVIDIACUDA Visual Profiler2009http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/cudaprof_1.2_readme.html, NVIDIACUDA Visual Profiler2009http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/cudaprof_1.2_readme.html
(SheinermanFBNorelRHonigBElectrostatic aspects of protein-protein interactionsCurr Opin Struct Biol2000102153910.1016/S0959-440X(00)00065-810753808)
SheinermanFBNorelRHonigBElectrostatic aspects of protein-protein interactionsCurr Opin Struct Biol2000102153910.1016/S0959-440X(00)00065-810753808
SheinermanFBNorelRHonigBElectrostatic aspects of protein-protein interactionsCurr Opin Struct Biol2000102153910.1016/S0959-440X(00)00065-810753808, SheinermanFBNorelRHonigBElectrostatic aspects of protein-protein interactionsCurr Opin Struct Biol2000102153910.1016/S0959-440X(00)00065-810753808
(AnandakrishnanROnufrievAAn N log N approximation based on the natural organization of biomolecules for speeding up the computation of long range interactionsJournal of Computational Chemistry201031469170619569183)
AnandakrishnanROnufrievAAn N log N approximation based on the natural organization of biomolecules for speeding up the computation of long range interactionsJournal of Computational Chemistry201031469170619569183
AnandakrishnanROnufrievAAn N log N approximation based on the natural organization of biomolecules for speeding up the computation of long range interactionsJournal of Computational Chemistry201031469170619569183, AnandakrishnanROnufrievAAn N log N approximation based on the natural organization of biomolecules for speeding up the computation of long range interactionsJournal of Computational Chemistry201031469170619569183
G Szabo, G Eisenman, S McLaughlin, S Krasne (1972)
Ionic probes of membrane structures. In: Membrane Structure and Its Biological Applications
Ann NY Acad Sci, 195
(BuckIFoleyTHornDSugermanJFatahalianKHoustonMHanrahanPBrook for GPUs: stream computing on graphics hardwareInternational Conference on Computer Graphics and Interactive Techniques2004ACM New York, NY, USA777786)
BuckIFoleyTHornDSugermanJFatahalianKHoustonMHanrahanPBrook for GPUs: stream computing on graphics hardwareInternational Conference on Computer Graphics and Interactive Techniques2004ACM New York, NY, USA777786
BuckIFoleyTHornDSugermanJFatahalianKHoustonMHanrahanPBrook for GPUs: stream computing on graphics hardwareInternational Conference on Computer Graphics and Interactive Techniques2004ACM New York, NY, USA777786, BuckIFoleyTHornDSugermanJFatahalianKHoustonMHanrahanPBrook for GPUs: stream computing on graphics hardwareInternational Conference on Computer Graphics and Interactive Techniques2004ACM New York, NY, USA777786
I Buck, T Foley, D Horn, J Sugerman, K Fatahalian, M Houston, P Hanrahan (2004)
International Conference on Computer Graphics and Interactive Techniques
NA Baker, JA McCammon (2002)
Electrostaic Interactions In Structural Bioinformatics
JH Huang (2010)
Opening Keynote, NVIDIA GTC, 2010
(BakerNAMcCammonJAElectrostaic Interactions In Structural Bioinformatics2002New York: John Wiley & Sons, Inc)
BakerNAMcCammonJAElectrostaic Interactions In Structural Bioinformatics2002New York: John Wiley & Sons, Inc
BakerNAMcCammonJAElectrostaic Interactions In Structural Bioinformatics2002New York: John Wiley & Sons, Inc, BakerNAMcCammonJAElectrostaic Interactions In Structural Bioinformatics2002New York: John Wiley & Sons, Inc
B Honig, A Nicholls (1995)
Classical Electrostatics in Biology and Chemistry
Science, 268
(HonigBNichollsAClassical Electrostatics in Biology and ChemistryScience19952681144114910.1126/science.77618297761829)
HonigBNichollsAClassical Electrostatics in Biology and ChemistryScience19952681144114910.1126/science.77618297761829
HonigBNichollsAClassical Electrostatics in Biology and ChemistryScience19952681144114910.1126/science.77618297761829, HonigBNichollsAClassical Electrostatics in Biology and ChemistryScience19952681144114910.1126/science.77618297761829
(SzaboGEisenmanGMcLaughlinSKrasneSIonic probes of membrane structures. In: Membrane Structure and Its Biological ApplicationsAnn NY Acad Sci197219527329010.1111/j.1749-6632.1972.tb54807.x4504092)
SzaboGEisenmanGMcLaughlinSKrasneSIonic probes of membrane structures. In: Membrane Structure and Its Biological ApplicationsAnn NY Acad Sci197219527329010.1111/j.1749-6632.1972.tb54807.x4504092
SzaboGEisenmanGMcLaughlinSKrasneSIonic probes of membrane structures. In: Membrane Structure and Its Biological ApplicationsAnn NY Acad Sci197219527329010.1111/j.1749-6632.1972.tb54807.x4504092, SzaboGEisenmanGMcLaughlinSKrasneSIonic probes of membrane structures. In: Membrane Structure and Its Biological ApplicationsAnn NY Acad Sci197219527329010.1111/j.1749-6632.1972.tb54807.x4504092
(The Top500 Supercomputer Siteshttp://www.top500.org)
The Top500 Supercomputer Siteshttp://www.top500.org
The Top500 Supercomputer Siteshttp://www.top500.org, The Top500 Supercomputer Siteshttp://www.top500.org
(AnandakrishnanRFenleyAGordonJFengWOnufrievAAccelerating electrostatic surface potential calculation with multiscale approximation on graphical processing unitsJournal of Molecular Graphics and Modelling2009Submitted)
AnandakrishnanRFenleyAGordonJFengWOnufrievAAccelerating electrostatic surface potential calculation with multiscale approximation on graphical processing unitsJournal of Molecular Graphics and Modelling2009Submitted
AnandakrishnanRFenleyAGordonJFengWOnufrievAAccelerating electrostatic surface potential calculation with multiscale approximation on graphical processing unitsJournal of Molecular Graphics and Modelling2009Submitted, AnandakrishnanRFenleyAGordonJFengWOnufrievAAccelerating electrostatic surface potential calculation with multiscale approximation on graphical processing unitsJournal of Molecular Graphics and Modelling2009Submitted
AM Ruvinsky, IA Vakser (2008)
Interaction cutoff effect on ruggedness of protein-protein energy landscape
Proteins: Structure, Function, and Bioinformatics, 70
W Cai, S Deng, D Jacobs (2006)
Extending the fast multipole method to charges inside or outside a dielectric sphere
J Comp Phys, 223
K Asanovic, R Bodik, BC Catanzaro, JJ Gebis, P Husbands, K Keutzer, DA Patterson, WL Plishker, J Shalf, SW Williams, KA Yelick (2006)
The Landscape of Parallel Computing Research: A View from Berkeley
CI Rodrigues, DJ Hardy, JE Stone, K Schulten, WMW Hwu (2008)
Proceedings of the 5th conference on Computing frontiers, CF '08
JC Gordon, AT Fenley, A Onufriev (2008)
An analytical approach to computing biomolecular electrostatic potential. II. Validation and applications
The Journal of Chemical Physics, 129
(NVIDIANVIDIA CUDA Programming Guide-3.22010http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_Guide.pdf)
NVIDIANVIDIA CUDA Programming Guide-3.22010http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_Guide.pdf
NVIDIANVIDIA CUDA Programming Guide-3.22010http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_Guide.pdf, NVIDIANVIDIA CUDA Programming Guide-3.22010http://developer.download.nvidia.com/compute/cuda/3_2/toolkit/docs/CUDA_C_Programming_Guide.pdf
(StoneJEPhillipsJCFreddolinoPLHardyDJTrabucoLGSchultenKAccelerating molecular modeling applications with graphics processorsJournal of Computational Chemistry200728162618264010.1002/jcc.2082917894371)
StoneJEPhillipsJCFreddolinoPLHardyDJTrabucoLGSchultenKAccelerating molecular modeling applications with graphics processorsJournal of Computational Chemistry200728162618264010.1002/jcc.2082917894371
StoneJEPhillipsJCFreddolinoPLHardyDJTrabucoLGSchultenKAccelerating molecular modeling applications with graphics processorsJournal of Computational Chemistry200728162618264010.1002/jcc.2082917894371, StoneJEPhillipsJCFreddolinoPLHardyDJTrabucoLGSchultenKAccelerating molecular modeling applications with graphics processorsJournal of Computational Chemistry200728162618264010.1002/jcc.2082917894371
(2009)
CUDA Visual Profiler
(FenleyATGordonJCOnufrievAAn analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysisThe Journal of Chemical Physics20081297075101http://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal\&id=JCPSA6000129000007075101000001\&idtype=cvips\&gifs=yes10.1063/1.295649719044802)
FenleyATGordonJCOnufrievAAn analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysisThe Journal of Chemical Physics20081297075101http://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal\&id=JCPSA6000129000007075101000001\&idtype=cvips\&gifs=yes10.1063/1.295649719044802
FenleyATGordonJCOnufrievAAn analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysisThe Journal of Chemical Physics20081297075101http://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal\&id=JCPSA6000129000007075101000001\&idtype=cvips\&gifs=yes10.1063/1.295649719044802, FenleyATGordonJCOnufrievAAn analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysisThe Journal of Chemical Physics20081297075101http://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal\&id=JCPSA6000129000007075101000001\&idtype=cvips\&gifs=yes10.1063/1.295649719044802
JE Stone, JC Phillips, PL Freddolino, DJ Hardy, LG Trabuco, K Schulten (2007)
Accelerating molecular modeling applications with graphics processors
Journal of Computational Chemistry, 28
R Anandakrishnan, A Fenley, J Gordon, W Feng, A Onufriev (2009)
Journal of Molecular Graphics and Modelling

Publisher: Springer Journals
Copyright: Copyright © 2012 by Daga and Feng; licensee BioMed Central Ltd.
Subject: Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN: 1471-2105
DOI: 10.1186/1471-2105-13-S5-S4
pmid: 22537008
Publisher site: See Article on Publisher Site

Abstract

Background: Calculating the electrostatic surface potential (ESP) of a biomolecule is critical towards understanding biomolecular function. Because of its quadratic computational complexity (as a function of the number of atoms in a molecule), there have been continual efforts to reduce its complexity either by improving the algorithm or the underlying hardware on which the calculations are performed. Results: We present the combined effect of (i) a multi-scale approximation algorithm, known as hierarchical charge partitioning (HCP), when applied to the calculation of ESP and (ii) its mapping onto a graphics processing unit (GPU). To date, most molecular modeling algorithms perform an artificial partitioning of biomolecules into a grid/ lattice on the GPU. In contrast, HCP takes advantage of the natural partitioning in biomolecules, which in turn, better facilitates its mapping onto the GPU. Specifically, we characterize the effect of known GPU optimization techniques like use of shared memory. In addition, we demonstrate how the cost of divergent branching on a GPU can be amortized across algorithms like HCP in order to deliver a massive performance boon. Conclusions: We accelerated the calculation of ESP by 25-fold solely by parallelization on the GPU. Combining GPU and HCP, resulted in a speedup of at most 1,860-fold for our largest molecular structure. The baseline for these speedups is an implementation that has been hand-tuned SSE-optimized and parallelized across 16 cores on the CPU. The use of GPU does not deteriorate the accuracy of our results. Background hierarchical charge partitioning (HCP) [11]). The Electrostatic interactions in a molecule are of utmost approximation algorithms can be parallelized on increas- importance for analyzing its structure [1-3] as well as ingly ubiquitous multi- and many-core architectures to functional activities like ligand binding [4], complex for- deliver even greater performance benefits. mation [5] and proton transport [6]. The calculation of Widespread adoption of general-purpose graphics pro- electrostatic interactions continues to be a computa- cessing units (GPUs) has made them popular as accel- tional bottleneck primarily because they are long-range erators for parallel programs [12]. The increased by nature of the potential [7]. As a consequence, effi- popularity has been assisted by (i) phenomenal comput- cient approximation algorithms have been developed to ing power, (ii) superior performance/dollar ratio, and reduce this computational complexity (e.g., the spherical (iii) compelling performance/watt ratio. For example, an cut-off method [8], the particle mesh Ewald (PME) 8-GPU cluster, costing a few thousand dollars, can method [9], the fast multipole method [10] and the simulate 52 ns/day of the JAC Benchmark as compared to 46 ns/day on the Kraken supercomputer, housed at * Correspondence: [email protected]; [email protected] Oak Ridge National Lab and which costs millions of Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, USA dollars [13]. The emergence of GPUs as an attractive Full list of author information is available at the end of the article © 2012 Daga and Feng; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 2 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 high-performance computing platform is also evident modern 16-core CPU, without any loss in the accuracy of from the fact that three out of the top five fastest super- the results. computers on the Top500 list employ GPUs [14]. Although the use of approximation algorithms can Methods improve performance, they often lead to an increase in Electrostatics and the hierarchical charge partitioning the memory boundedness of the application. Achieving approximation optimum performance with a memory-bound applica- We use the Analytic Linearized Poisson-Boltzmann tion is challenging due to the ‘memory wall’ [15]. The (ALPB) model to perform electrostatic computations effect of the memory wall is more severe on GPUs [19]. Equation (1) computes the electrostatic potential at because of the extremely high latency for global memory a surface-point (vertex) of the molecule due to a single accesses (on the order of 600 - 800 cycles). Furthermore, point charge, q. The potential at each vertex can be for maximum performance on the GPU, execution paths computed as the summation of potentials due to all on each GPU computational unit need to be synchro- charges in the system. If there are P vertices, the total nized. However, an important class of approximation surface potential can then be found as the summation algorithms, i.e., multi-scale approximations result in of potential at each vertex. highly asynchronous execution paths due to the intro- ⎡ ⎤ in duction of a large number divergent branches, which α 1 − q 1 1+ α ε i out outside ⎣ ⎦ φ = − (1) depend upon the relative distances between interacting ε ε d r in in i 1+ α atoms. ε out To test these expectations, we present a hybrid Computing the potential at P vertices results in a time approach wherein we implement the robust multi-scale complexity of O(NP)where N is the number of atoms HCP approximation algorithm in a molecular modeling in the molecule. To reduce the time complexity, we application called GEM [7] and map it onto a GPU. We apply an approximation algorithm called hierarchical counteract the high memory boundedness of HCP by charge partitioning (HCP), which reduces the upper explicitly managing the data movement, in a way that bound of computation to O(P log N). helps us achieve significantly improved performance. In HCP [11] exploits the natural partitioning of biomole- addition, we employ the standard GPU optimization cules into constituent structural components in order to techniques, such as coalesced memory accesses and the speed-up the computation of electrostatic interactions use of shared memory, quantifying the effectiveness of with limited and controllable impact on accuracy. Biomo- each optimization in our application. HCP results in supreme performance on the GPU despite the introduc- lecules can be systematically partitioned into multiple tion of divergent branches. This is attributed to the molecular complexes, which consist of multiple polymer reduction in memory transactions that compensates for chains or subunits and which in turn are made up of divergent branching. multiple amino acid or nucleotide groups, as illustrated in Recently, several molecular modeling applications have Figure 1. Atoms represent the lowest level in the hierar- chy while the highest level depends on the problem. used the GPU to speed-up electrostatic computations. Briefly, HCP works as follows. The charge distribution of Rodrigues et al. [16] and Stone et al. [17] demonstrate components, other than at the atomic level, is approxi- that the estimation of electrostatic interactions can be mated by a small set of point charges. The electrostatic accelerated by the use of spherical cut-off method and effect of distant components is calculated using the smal- the GPU. In [18], Hardy et al. used a multi-scale summa- ler set of point charges, while the full set of atomic tion method on the GPU. Each of the aforementioned charges is used for computing electrostatic interactions implementations artificially maps the n atoms of a mole- within nearby components. The distribution of charges cule onto a m-point lattice grid and then applies their for each component, used in the computation, varies respective approximation algorithm. By doing so, they depending on distance from the point in question: the reduce the time complexity of the computation from O farther away the component, the fewer charges are used (nn)to O(nm). In contrast, we use HCP, which performs to represent the component. The actual speedup from approximations based on the natural partitioning of bio- using HCP depends on the specific hierarchical organiza- molecules. The advantage of using the natural partition- tion of the biomolecular structure as that would govern ing is that even with the movement of atoms during the number of memory accesses, computations and molecular dynamics simulations, the hierarchical nature divergent branches on the GPU. Under conditions con- is preserved, whereas with the lattice, atoms may move in sistent with the hierarchical organization of realistic bio- and out of the lattice during the simulation. Our imple- mentation realizes a maximum of 1,860-fold speedup molecular structures, the top-down HCP algorithm over a hand-tuned SSE optimized implementation on a (Figure 2) scales as O(N log N), where N is the number of Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 3 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 1 Illustration of the hierarchical charge partitioning (HCP) of biomolecular structures. In this illustration a biomolecular structure is partitioned into multiple hierarchical levels components based on the natural organization of biomolecules. The charge distribution of distant components are approximated by a small number of charges, while nearby atoms are treated exactly. atoms in the structure. For large structures, the HCP can GPU very suitable for applications that exhibit data par- be several orders of magnitude faster than the exact O allelism, i.e., the operation on one data element is inde- (N ) all-atom computation. A detailed description of the pendent of the operations on other data elements. HCP algorithm can be found in Anandakrishnan et. al. Therefore, it is well suited for molecular modeling [11]. where the potential at one vertex can be computed independently of all others. GPU architecture and programming interface On NVIDIA GPUs, threads are organized into groups For this study, we have used state-of-art NVIDIA GPUs of 32,referredtoasa warp.Whenthreads within a based on the Compute Unified Device Architecture or warp follow different execution paths, such as when CUDA framework. CUDA is a framework developed by encountering a conditional, a divergent branch takes NVIDIA, which facilitates the implementation of gen- place. Execution of these group of threads is serialized, eral-purpose applications on GPUs. Below is a brief thereby, affecting performance. On a GPU, computa- tions are much faster compared to a typical CPU, but description of the NVIDIA GPU hardware architecture and the CUDA programming interface. memory accesses and divergent branching instructions NVIDIA GPUs consist of 240-512 execution units, are slower. The effect of slower memory access and which are grouped into 16 and 30 streaming multipro- divergent branching can be mitigated by initiating thou- cessors (SMs) on Fermi and GT200 architectures, sands of threads on a GPU, such that when one of the respectively. An overview of these architectures is threads is waiting on a memory access, other threads shown in Figure 3. Multiple threads on a GPU execute can perform meaningful computations. the same instruction, resulting in asingleinstruction, Every GPU operates in a memory space known as glo- multiple thread (SIMT) architecture. This is what makes bal memory. Data which needs to be operated on by the Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 4 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 2 Illustration of the HCP multi-scale algorithm. Predefined threshold distances (h , h , h ) are used to determine the level of 1 2 3 approximation used in the HCP approximation. This top-down algorithm results in ~ NLogN scaling compared to a ~ N scaling without HCP. GPU, needs to be first transferred to the GPU. This pro- threads is executed on a SM and share data using the cess of transferring data to GPU memory is performed shared memory present. over the PCI-e bus, making it an extremely slow pro- cess. Therefore, memory transfers should be kept to a Mapping HCP onto GPU minimum to obtain optimum performance. Also, acces- The problem of computing molecular surface potential sing data from the GPU global memory entails the cost is inherently data parallel in nature, i.e., the potential at of 400-600 cycles and hence, on-chip memory should be one point on the surface can be computed indepen- used to reduce global memory traffic. On the GT200 dently from the computation of potential at some other architecture, each SM contains a high-speed, 16 KB, point. This works to our advantage as such applications scratch-pad memory, known as the shared memory. map very well onto the GPU. We begin with offloading Shared memory enables extensive re-use of data, all the necessary data (coordinates of vertices and atoms thereby, reducing off-chip traffic. Whereas on the latest and the approximated point charges) to the GPU global Fermi architecture, each SM contains 64 KB of on-chip memory. To ensure efficient global memory accesses memory, which can be either be configured as 16 KB of patterns, we flattened the data structures. By flattening shared memoryand48KBofL1cacheor viceversa. of data structures we mean that all the arrays of struc- Each SM also consists of a L2 cache of size 128 KB. The tures were transformed into arrays of primitives so that hierarchy of caches on the Fermi architecture allows for the threads in a half warp (16 threads) access data from more efficient global memory access patterns. contiguous memory locations [20,21]. The GPU kernel CUDA provides a C/C++ language extension with is then executed, wherein each thread is assigned the application programming interfaces (APIs). A CUDA task of computing the electrostatic potential at one ver- program is executed by a kernel, which is effectively a tex. At this point the required amount of shared mem- function call to the GPU, launched from the CPU. ory, i.e, number of threads in a block times the size of CUDA logically arranges the threads into blocks which the coordinates of each vertex, is allocated on each are in turn grouped into a grid. Each thread has its own streaming multiprocessor (SM) of the GPU. The kernel ID which provides for one-one mapping. Each block of is launched multiple times as required, until all the ver- Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 5 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 3 Overview of NVIDIA GPU architectures. The Fermi architecture is shown to consist of 16 Streaming Multiprocessors (SMs). Each SM consists of 64 KB of on-chip memory, which can configured as 16 KB of shared memory and 48 KB of L1 data cache or vice versa. Also present on each SM is 128 KB of L2 data cache. The GDDR5 memory controllers facilitate data accesses to and from the global memory. The GT200 architecture consists of 30 SMs. Each SM consists of 16 KB of shared memory but no data caches, instead it contains L1/L2 texture memory space. Also present are GDDR3 memory controllers to facilitate global memory accesses. tices are exhausted, with implicit GPU synchronization residues, and atoms have to remain in global memory. in between successive kernel launches. On the GPU The HCP algorithm is then applied to compute the elec- side, each kernel thread copies the coordinates of its trostatic potential, and the result is stored in the global assigned vertex onto the shared memory. This results in memory. All the threads perform this computation in a reduction of the number of global memory loads as parallel, and after the threads finish, the computed explained in the Results section. The limited amount of potential at each vertex is transferred back to the CPU per SM shared memory does not allow us to offload the memory, where a reduce (sum) operation is performed coordinates of constituent components of the biomole- to calculate the total molecular surface potential. cule and hence, coordinates of complexes, strands, According to the algorithm, evaluation of distance Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 6 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 between the vertex and molecular components requires CUDAEventRecord function call. For a fair comparison, each thread to accesses coordinates from the global time for offloading the data onto the GPU global mem- memory. This implies that potential calculation at each ory and storing the results back onto the CPU was vertex necessitates multiple global memory accesses, taken into account along with the time for execution of which makes HCP memory-bound on the GPU. the kernel. Single precision was used on both platforms. HCP also introduces a significant number of divergent All the numbers presented are an average of 10 runs branches on the GPU. This phenomenon occurs because performed on each platform. For HCP, the 1st-level for some threads in a warp, it may be possible to apply threshold was set to 10Å and the 2nd-level threshold was fixed at 70Å. HCP approximation while for other, it may not be pos- sible to do so. Therefore, these two groups of threads would diverge and follow respective paths, resulting in a Impact of using shared memory divergent branch. In the Results section, we show how At every approximation level, HCP reuses the vertex the associated costs of divergent branching in HCP on coordinates to compute the distance between the vertex the GPU can be amortized to deliver a performance and molecule components. Therefore in the worst case boost. when no approximation can be applied, same data is accessed four times from theglobalmemory(dueto Test setup four levels in the molecule hierarchy). We used the To illustrate the scalability of our application, we have shared memory to reduce these global memory accesses. used four different structures with varied sizes. The Percentage reduction in the number of global memory characteristics of the structures used are presented in loads due to the use of shared memory on GT200 archi- Table 1. The GPU implementation was tested on the tecture, with and without HCP approximation, is shown present generation of NVIDIA GPUs. in Table 3. The base line for each of these columns is The Host Machine consists of an E8200 Intel Quad the respective implementation, i.e., without_HCP and core running at 2.33 GHz with 4 GB DDR2 SDRAM. HCP, without the use of shared memory. These num- The operating system on the host is a 64-bit version of bers were taken from the CUDA Visual Profiler pro- Ubuntu 9.04 distribution running the 2.6.28-16 generic vided by NVIDIA [22]. Linux kernel. Programming and access to the GPU was From the table, we note that the global memory loads provided by CUDA 3.1 toolkit and SDK with the NVI- are reduced by 50% for all structures, when the HCP DIA driver version 256.40. For the sake of accuracy of approximation is not used. Whereas with HCP, the results, all the processes that required a graphical user amount by which loads are reduced varies from struc- ture to structure. This can be reasoned as follows. interface were disabled to limit resource sharing of the GPU. When no approximation is applied, the coordinates of We ran our tests on a NVIDIA Tesla C1060 graphics vertices and that of all atoms are accessed from global card with GT200 GPU and the NVIDIA Fermi Tesla memory, which requires cycling through the residue C2050 graphics card. An overview of both of these groups. Therefore when shared memory is not used, the GPUs is presented in Table 2. vertex coordinate is loaded twice, once for residue and once for the atom. While when shared memory is used, Results and discussion it is loaded only once, i.e., for copying into the shared In this section, we present an analysis of (i) the impact memory, thereby, resulting in a 50% reduction in global of using shared memory, (ii) the impact of divergent memory loads. branching, (iii) the speedups realized by our implemen- But in the case of HCP, the number of times a vertex tation, and (iv) the accuracy of our results. On CPU, the coordinate is loaded from global memory depends upon timing information was gathered by placing required the structure. This is because for each structure the time-checking calls around the computational kernel, effective number of computations to be performed are excluding the I/O required for writing the results. On different. For example, for a structure with 1st level of GPU, the execution time was measured by using the approximation and no shared memory usage, vertex Table 1 Characteristics of input structures Structure #Vertices #Complexes #Strands #Residues #Atoms H helix myoglobin, 1MBO 5,884 1 1 24 382 nuclesome core particle, 1KX5 258,797 1 10 1,268 25,086 chaperonin GroEL, 2EU1 898,584 1 14 7,336 109,802 virus capsid, 1A6C 593,615 1 60 30,780 476,040 Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 7 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Table 2 Overview of GPUs used GPU Tesla C1060 Fermi Tesla C2050 Streaming Processor Cores 240 448 Streaming Multiprocessors (SMs) 30 16 Memory Bus type GDDR3 GDDR5 Device Memory size 4096 MB 3072 MB Shared Memory (per SM) 16 KB Configurable 48 KB or 16 KB L1 Cache (per SM) None Configurable 16KB or 48 KB L2 Cache None 768 KB Double Precision Floating Point Capability 30 FMA ops/clock 256 FMA ops/clock Single Precision Floating Point Capability 240 FMA ops/clock 512 FMA ops/clock Special Function Units (per SM) 2 4 Compute Capability 1.3 2.0 coordinates would be loaded three times from the global where t denotes time taken for execution path i,as memory - (i) to compute the distance to the complex, shown in Figure 4. From the figure, it can be noted that (ii) to compute the distance to the strand and (iii) finally both execution paths perform the similar task of calcu- to compute the distance to the residue. While with lating the potential. Path#1 calls the function, calcPo- shared memory it would be accessed just once. Simi- tential () just once while for Path#2, larly, for a structure with no approximation, the vertex calcPotential () is called from within a loop that would be accessed four times, without using shared iterates over all atoms of that residue. Hence, it can be memory. Therefore, the table suggests that least number inferred that time for divergent branches is directly pro- of components could be approximated for the virus cap- portional to the number of atoms in the molecule. sid, and hence, maximum percentage reduction. t t 2 1 Use of the shared memory resulted in a drastic reduc- ∴ t ≈ t divBranch 2 tion in the number of global loads and hence, provided (3) ≈ #Atoms × T calcPotential about 2.7-fold speedup to our application. ∴ t ∝ #Atoms divBranch Impact of divergent branching Thus, Total time for all the divergent branches in the Divergent branching on a GPU occurs when the threads system: of a warp follow different execution paths. In our GPU T = #DivBranches × t implementation, each thread takes charge of one vertex divBranch divBranch (4) ∴ T ∝ #Atoms and as shown in Figure 4, it is possible for threads divBranch within a warp to follow different execution paths, From (4), we can say that the cost of divergent thereby, introducing a divergent branch. Here we quan- branches would be maximum for the molecule which tify the cost of divergent branches that are introduced. has the greatest #atoms but this is not true as can be For the ease of exposition, we limited our analysis to seen from Figure 5. In the figure, we present the speed- one level of HCP only, but this can be extended. As the ups achieved due to HCP on CPU as well as the GPUs execution paths are now serialized, the time taken for with #atoms increasing from left to right. As all the the execution of a divergent branch, denoted by t div- speedups are positive, we can safely infer that GPU is can be characterized as follows: Branch indeed beneficial for HCP despite the introduction of divergent branches. We also note that speedup achieved t = t + t (2) divBranch 1 2 duetoHCP on GPUs,increases with theincreasein #atoms in the structure, thus, dissatisfying (4). Hence, there exists an aspect which compensates for the cost of Table 3 Percentage reduction in the number of global introduction of divergent branches. As HCP is a mem- memory loads ory-bound application, number of memory transactions Structure Without HCP With HCP dominate the execution time. Looking back at the algo- H Helix myoglobin 50% 32% rithm (in the Methods section), we observe that HCP nucleosome core paricle 50% 62% does reduce the number of memory transactions. It chaperonin GroEL 50% 84% does so by applying the approximation, which results in virus capsid 50% 96% reduced fetching of coordinates from the global Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 8 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 4 Divergent branching due to the HCP approximation. This illustration shows that our GPU implementation causes divergent branches to occur. This is because for every thread, there are probable execution paths. Divergent branches occur if threads within a warp take different paths. memory. Now, coordinates of only the higher level com- HCP would guarantee improved performance on the ponent are required and hence, compensating for the GPU if the gain in time due to reduced memory trans- cost of divergent branches. Execution time for the entire actions is greater than the cost of divergent branches. application with HCP can be derived as follows: However, computing memory access times on the GPU is an extremely difficult task as one has no knowledge T = T T + T With HCP Without HCP Mem divBranch of how warps are scheduled, which is essential as the (5) where T = global memory access time Mem warps send access requests to memory controllers. Figure 5 Speedup due to the HCP approximation. Even with the occurrence of divergent branches, speedup due to the HCP approximation is positive on the GPU. This alludes to the fact there is some aspect with amortizes the cost of the introduction of these divergent branches. Speedup is maximum for the largest structure, i.e., virus capsid. Baseline: Corresponding implementation on each platform without the use of HCP approximation. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 9 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Table 4 Impact of the HCP approximation on memory transactions and divergent branches Structure Decrease in # of mem. Transactions Increase in # of divergent branches H Helix myoglobin 95,800 34 nucleosome core particle 119,507,436 4,635 chaperonin GroEL 1,831,793,578 25,730 virus capsid 5,321,785,506 22,651 Hence, no direct method to measure global memory Speedup access times exists. We used an indirect approach and Figures 6 and 7 present speedups achieved by our implementation on NVIDIA Tesla C1060 and NVIDIA found out the reduction in memory transactions as well Fermi Tesla C2050 GPUs respectively. Both the figures as the increase in divergent branches for our application. present speedup over the CPU implementation opti- These numbers have been taken using from the CUDA mized by hand-tuned SSE Intrinsics and parallelized Visual Profiler provided by NVIDIA and are presented across 16 cores, without the use of any approximation in Table 4 [22]. The memory transactions in the table portray the sum of 32-, 64- and 128-byte load and store algorithm. Speedups achieved due to the use of GPU transactions per SM. Also,the numberof divergent alone as well as that due to the combination of GPU branches represent the divergent branches introduced and HCP are presented for all four structures. on one SM. From the table, it is seen that the reduction From both these figures, we note that speedup due to in memory transactions is orders of magnitude greater the GPU alone is almost constant for all three structures than the increase in divergent branches. From the table, barring Mb.Hhelix. This is because Mb.Hhelix is an we note that the number of memory transactions extremely small structure and it does not requires reduced per one divergent branch is maximum for the enough GPU threads for the computation of its molecu- capsid, which results in the fact that HCP+GPU is most lar surface potential, thereby, leaving the GPU under effective for capsid. Figures 6 and 7 corroborate this fact utilized. This phenomenon is prominent in case of the and hence, we can attest that it is the reduction in Fermi Tesla C2050 which actually results in a slowdown memory transactions which help make GPUs favorable due to under-utilization of the GPU. For other struc- for HCP. This proves that even an algorithm with diver- tures the threshold of the number of threads is met and gent branching can be benefited by the GPU, provided almost similar speedup is achieved across both the fig- there is some aspect which amortizes the cost of the ures. The observed speedup is around 11-fold on Tesla divergent branches introduced. C1060, whereas on Tesla C2050 the speedup is around Figure 6 Speedup on NVIDIA Tesla C1060. Speedup due the GPU alone is almost constant because once the threshold for the number of threads that can be launched is met, there is no further increase in speedup. Speedup due to HCP+GPU increases with the increase in the size of the structure due to the O(nlogn) scaling of the HCP approximation. Baseline: No-approximation CPU implementation optimized by hand- tuned SSE Intrinsics and parallelized across 16 cores. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 10 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Figure 7 Speedup on NVIDIA Tesla Fermi C2050.Speedup on theTesla Fermi C2050is greaterthanTesla C1060dueto thepresenceof hierarchy of caches on the C2050 GPU. Baseline: No-approximation CPU implementation optimized by hand-tuned SSE Intrinsics and parallelized across 16 cores. 25-fold. The increased speedup on C2050 may be attrib- less potent than before. Speedups achieved across all fig- uted to several architectural differences between Fermi ures for without_HCP version are almost consistent for and GT200 GPUs, like the ability for concurrent kernel both the GPUs, which is because it does not introduce execution, ECC support and fewer SMs. However, the divergent branches. Whereas, the version with HCP, architectural feature that we feel has the most impact results in divergent branches and also varying amounts of for this algorithm, is the presence of a hierarchy of speedups across structures, depending upon how much caches on Fermi, as they allow for greater exploitation cost of the divergent branches can be amortized by the of the locality of data. For no approximation, all atoms corresponding reduction in memory transactions. need to be accessed sequentially, thereby, making the caches play an important role and hence, Fermi Tesla Accuracy C2050 is deemed to be more effective. To get the best performance on GPUs, we used single As explained in a previous section, application precision as double precision on GT200 architecture speedup due to the combination of GPU and HCP adversely impacts the performance by as much as 8- increases with the increase in number of memory trans- times. Although double precision on Fermi is almost actions reduced per divergent branch increased.There- half as fast as single precision, we decided to stick with fore, from Table 4, number of memory transactions single precision for greater performance than accuracy. reduced is maximum for virus Capsid and hence, it To get an estimate of the accuracy of our results, we attains the maximum speedup. Next highest reduction compute the relative root mean squared error (RMSE) in the number of memory transactions is for 2eu1 and of the single-precision-GPU implementation against the hence, next highest speedup and so on. Our application double-precision-CPU implementation. The results are manages to achieve up to 1,860-fold speedup with HCP shown in Table 5. We also present the error due to on Tesla C1060 for Capsid while the corresponding HCP both on CPU and the GPU. HCP, being an speedup on Fermi Tesla C2050 is approximately 1,600- approximation algorithm, does introduce some error on fold. The actual execution time of our implementation the CPU. From the table, we note that the error intro- on both GPUs is <1 sec. duced by the GPU itself is fairly negligible when com- Speedup achieved with HCP on Tesla C2050 is less pared to the error introduced by HCP alone on the than that achieved on Tesla C1060 due to the fact that CPU. Thus, the total error due to HCP and GPU is the algorithm fails to take the advantage of the caches almost equivalent to the error on the CPU. Therefore, present on Fermi, as before. With HCP, not all memory we can safely conclude that single precision on GPU requests are sequential as coordinates of both atoms and does not jeopardize the accuracy of our computed high level components are required, making the caches results. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 11 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 Table 5 Relative RMS (root-mean-squared) error compared to that of the hand-tuned SSE implementation on the 16 cores of the CPU. Furthermore, we ensured Structure Version Relative RMSE that the use of single-precision arithmetic on the GPU, H helix myoglobin CPU with HCP 0.215821 combined with the HCP multi-scale approximation, did GPU 0.000030 not significantly affect the accuracy of our results. GPU with HCP 0.236093 For future work, we will apply our HCP approxima- nuclesome core particle CPU with HCP 0.022950 tion algorithm to molecular dynamics (MD) simulations GPU 0.000062 on the GPU, given how well it performs in the case of GPU with HCP 0.022853 molecular modeling. For MD simulations, the use of chaperonin GroEL, 2eu1 CPU with HCP 0.008799 double precision is mandatory as the error incurred in GPU 0.000042 each time-step would accumulate over time, thereby GPU with HCP 0.008816 immensely affecting the accuracy of the MD results. In virus capsid CPU with HCP 0.015376 addition, we plan to exploit the use of the cache hierar- GPU 0.000173 chy on the NVIDIA Fermi to accelerate the memory- GPU with HCP 0.015273 bounded aspect of our application. Due to the paltry error introduced by single precision Acknowledgements on the GPU, it may be deemed acceptable for the com- This work was supported in part by NSF grants CNS-0915861, CNS-0916719 putation of molecular surface potential on the GPU but and a NVIDIA Professor Partnership Award. We thank Tom Scogland for helping us with the initial implementation of GPU-GEM and are also grateful may be unsatisfactory for molecular dynamics. In case to Alexey Onufriev and his group for making us familiar with HCP of molecular dynamics simulations, even a minute error approximation. in one time step can have a substantial effect on the This article has been published as part of BMC Bioinformatics Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International results as the error would accumulate during the course Conference on Computational Advances in Bio and medical Sciences of the simulation. It is here that superior double preci- (ICCABS 2011): Bioinformatics. The full contents of the supplement are sion support of Fermi would come in handy. available online at http://www.biomedcentral.com/bmcbioinformatics/ supplements/13/S5. Conclusions Author details With the emergence of GPU computing, there have Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, USA. Department of Electrical and Computer Engineering, Virginia Tech, been many attempts at accelerating the electrostatic sur- Blacksburg, VA 24061, USA. Virginia Bioinformatics Institute, Virginia Tech, face potential (ESP) computations for biomolecules. In Blacksburg, VA 24061, USA. our work, we demonstrate the combined effect of using Authors’ contributions a multi-scale approximation algorithm called hierarchi- MD implemented and optimized the HCP approximation on the GPU. MD cal charge partitioning (HCP) and mapping it onto a also studied the impact of divergence and memory transactions on the graphics processing unit (GPU). While mainstream GPU, collected all the required results and drafted the manuscript. WF conceived the study, co-designed the GPU mapping of HCP and helped molecular modeling algorithms impose an artificial par- draft the manuscript. Both authors read and approved the final manuscript. titioning of biomolecules into a grid/lattice to map it onto a GPU, HCP is significantly different in that it Competing interests The authors declare that they have no competing interests. takes advantage of the natural partitioning in biomole- cules, which facilitates a data-parallel mapping onto the Published: 12 April 2012 GPU. We then presented our methodology for mapping and References 1. Perutz M: Electrostatic effects in proteins. Science 1978, 201:1187-1191. optimizing the performance of HCP on the GPU when 2. Baker NA, McCammon JA: Electrostaic Interactions In Structural Bioinformatics applied to the calculation of ESP. Despite being a mem- New York: John Wiley & Sons, Inc; 2002. ory-bound application, we leveraged many known optimi- 3. Honig B, Nicholls A: Classical Electrostatics in Biology and Chemistry. Science 1995, 268:1144-1149. zation techniques to accelerate performance. In addition, 4. Szabo G, Eisenman G, McLaughlin S, Krasne S: Ionic probes of membrane we demonstrated the effectiveness of the introduction of structures. In: Membrane Structure and Its Biological Applications. Ann divergent branching on GPUs when it reduces the number NY Acad Sci 1972, 195:273-290. 5. Sheinerman FB, Norel R, Honig B: Electrostatic aspects of protein-protein of instructional and memory transactions. interactions. Curr Opin Struct Biol 2000, 10(2):153-9. For a fairer comparison between the CPU and GPU, we 6. Onufriev A, Smondyrev A, Bashford D: Proton affinity changes during optimized the CPU implementation by using hand-tuned unidirectional proton transport in the bacteriorhodopsin photocycle. J Mol Biol 2003, 332:1183-1193. SSE intrinsics to handle the SIMD nature of the applica- 7. Gordon JC, Fenley AT, Onufriev A: An analytical approach to computing tion on the CPU. We then demonstrated a 1,860-fold biomolecular electrostatic potential. II. Validation and applications. The reduction in the execution time of the application when Journal of Chemical Physics 2008, 129(7):075102. Daga and Feng BMC Bioinformatics 2012, 13(Suppl 5):S4 Page 12 of 12 http://www.biomedcentral.com/1471-2105/13/S5/S4 8. Ruvinsky AM, Vakser IA: Interaction cutoff effect on ruggedness of protein-protein energy landscape. Proteins: Structure, Function, and Bioinformatics 2008, 70(4):1498-1505. 9. Darden T, York D, Pedersen L: Particle mesh Ewald: An N.log(N) method for Ewald sums in large systems. The Journal of Chemical Physics 1993, 98(12):10089-10092. 10. Cai W, Deng S, Jacobs D: Extending the fast multipole method to charges inside or outside a dielectric sphere. J Comp Phys 2006, 223:846-864. 11. Anandakrishnan R, Onufriev A: An N log N approximation based on the natural organization of biomolecules for speeding up the computation of long range interactions. Journal of Computational Chemistry 2010, 31(4):691-706. 12. Buck I, Foley T, Horn D, Sugerman J, Fatahalian K, Houston M, Hanrahan P: Brook for GPUs: stream computing on graphics hardware. International Conference on Computer Graphics and Interactive Techniques ACM New York, NY, USA; 2004, 777-786. 13. Huang JH: Opening Keynote, NVIDIA GTC, 2010. 2010 [http://livesmooth. istreamplanet.com/nvidia100921/]. 14. The Top500 Supercomputer Sites. [http://www.top500.org]. 15. Asanovic K, Bodik R, Catanzaro BC, Gebis JJ, Husbands P, Keutzer K, Patterson DA, Plishker WL, Shalf J, Williams SW, Yelick KA: The Landscape of Parallel Computing Research: A View from Berkeley. Tech Rep UCB/ EECS-2006-183, EECS Department, University of California, Berkeley; 2006 [http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html]. 16. Rodrigues CI, Hardy DJ, Stone JE, Schulten K, Hwu WMW: GPU acceleration of cutoff pair potentials for molecular modeling applications. Proceedings of the 5th conference on Computing frontiers, CF ‘08 New York, NY, USA: ACM; 2008, 273-282[http://doi.acm.org/10.1145/1366230.1366277]. 17. Stone JE, Phillips JC, Freddolino PL, Hardy DJ, Trabuco LG, Schulten K: Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry 2007, 28(16):2618-2640. 18. Hardy DJ, Stone JE, Schulten K: Multilevel summation of electrostatic potentials using graphics processing units. Parallel Computing 2009, 35(3):164-177. 19. Fenley AT, Gordon JC, Onufriev A: An analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysis. The Journal of Chemical Physics 2008, 129(7):075101[http://scitation.aip.org/ getabs/servlet/GetabsServlet?prog=normal \&id=JCPSA6000129000007075101000001\&idtype=cvips\&gifs=yes]. 20. NVIDIA: NVIDIA CUDA Programming Guide-3.2. 2010 [http://developer. download.nvidia.com/compute/cuda/3_2/toolkit/docs/ CUDA_C_Programming_Guide.pdf]. 21. Anandakrishnan R, Fenley A, Gordon J, Feng W, Onufriev A: Accelerating electrostatic surface potential calculation with multiscale approximation on graphical processing units. Journal of Molecular Graphics and Modelling 2009, Submitted. 22. NVIDIA: CUDA Visual Profiler. 2009 [http://developer.download.nvidia.com/ compute/cuda/2_2/toolkit/docs/cudaprof_1.2_readme.html]. doi:10.1186/1471-2105-13-S5-S4 Cite this article as: Daga and Feng: Multi-dimensional characterization of electrostatic surface potential computation on graphics processors. BMC Bioinformatics 2012 13(Suppl 5):S4. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color ﬁgure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit

Journal

BMC Bioinformatics – Springer Journals

Published: Apr 12, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Multi-dimensional characterization of electrostatic surface potential computation on graphics processors

Multi-dimensional characterization of electrostatic surface potential computation on graphics processors

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Multi-dimensional characterization of electrostatic surface potential computation on graphics processors

Multi-dimensional characterization of electrostatic surface potential computation on graphics processors

References (43)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies