Creating protein models from electron-density maps using particle-filtering methods

Frank DiMaio; Dmitry A. Kondrashov; Eduard Bitto; Ameet Soni; Craig A. Bingman; George N. Phillips; Jude W. Shavlik

doi:10.1093/bioinformatics/btm480

Creating protein models from electron-density maps using particle-filtering methods

DiMaio, Frank; Kondrashov, Dmitry A.; Bitto, Eduard; Soni, Ameet; Bingman, Craig A.; Phillips, George N.; Shavlik, Jude W. 2007-10-12 00:00:00 Vol. 23 no. 21 2007, pages 2851–2858 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm480 Structural bioinformatics Creating protein models from electron-density maps using particle-filtering methods 1,2, 3 4 1,2 Frank DiMaio , Dmitry A. Kondrashov , Eduard Bitto , Ameet Soni , 4 1,3,4 1,2 Craig A. Bingman , George N. Phillips, Jr and Jude W. Shavlik 1 2 3 Department of Computer Sciences, Department of Biostatistics and Medical Informatics, Department of Biochemistry and Center for Eukaryotic Structural Genomics, University of Wisconsin, Madison, WI 53706, USA Received and revised on August 31, 2007; accepted on September 20, 2007 Advance Access publication October 12, 2007 Associate Editor: Burkhard Rost ABSTRACT process, that may require weeks or months of an expert crystallographer’s time. Motivation: One bottleneck in high-throughput protein crystallo- Our previous work (DiMaio et al., 2006) developed the graphy is interpreting an electron-density map, that is, fitting a automatic interpretation tool ACMI (Automatic Crystallo- molecular model to the 3D picture crystallography produces. graphic Map Interpreter). ACMI employs probabilistic inference Previously, we developed ACMI (Automatic Crystallographic Map to compute a probability distribution of the coordinates of Interpreter), an algorithm that uses a probabilistic model to infer an each amino acid, given the electron-density map. However, accurate protein backbone layout. Here, we use a sampling method ACMI makes several simplifications, such as reducing each known as particle filtering to produce a set of all-atom protein amino acid to a single atom and confining the locations to a models. We use the output of ACMI to guide the particle filter’s coarse grid. In this work we introduce the use of a statistical sampling, producing an accurate, physically feasible set of sampling method called particle filtering (PF) (Doucet et al., structures. 2000) to construct all-atom protein models, by stepwise Results: We test our algorithm on 10 poor-quality experimental extension of a set of incomplete models drawn from a distri- density maps. We show that particle filtering produces accurate bution computed by ACMI. This results in a set of probability- all-atom models, resulting in fewer chains, lower sidechain RMS weighted all-atom protein models. The method interprets the error and reduced R factor, compared to simply placing the best- density map by generating a number of distinct protein confor- matching sidechains on ACMI’s trace. We show that our approach mations consistent with the data. We compare the single model produces a more accurate model than three leading methods— that best matches the density map (without knowing the true TEXTAL,RESOLVE and ARP/WARP—in terms of main chain complete- solution) with the output of existing automated methods, on ness, sidechain identification and crystallographic R factor. multiple sets of crystallographic data that required considerable Availability: Source code and experimental density maps available human effort to solve. We also show that modeling the data at http://ftp.cs.wisc.edu/machine-learning/shavlik-group/programs/ with a set of structures, obtained from several particle-filtering acmi/ runs, results in a better fit than using one structure from a single Contact: [email protected] particle-filtering run. Particle filtering enables the automated building of detailed atomic models for challenging protein crystal data with a more realistic representation of conforma- tional variation in the crystal. 1 INTRODUCTION Knowledge of the spatial arrangement of constituent atoms in a complex biomolecules, such as proteins, is vital for under- 2 PROBLEM OVERVIEW AND RELATED WORK standing their function. X-ray crystallography is the primary In recent years, considerable investment into structural geno- technique for determination of atomic positions, or the mics (i.e. high-throughput determination of protein structures) structure, of biomolecules. A beam of X-rays is diffracted by has yielded a wealth of new data (Berman and Westbrook, a crystal, resulting in a set of reflections that contain 2004; Chandonia and Brenner, 2006). The demand for rapid information about the molecular structure. This information structure solution is growing and automated methods are being can be interpreted to produce a 3D image of the macro- deployed at all stages of the structural determination process. molecule, which is usually represented by an electron-density These new technologies include cell-free methods for protein map. Interpretation of these maps requires locating the atoms production (Sawasaki et al., 2002), the use of robotics to in complex 3D images. This is a difficult, time-consuming screen thousands of crystallization conditions (Snell et al., 2004) and new software for automated building of macro- *To whom correspondence should be addressed. molecular models based on the electron-density map The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2851 F.DiMaio et al. GLU SER ALA THR ALA Fig. 3. A sample undirected graphical model corresponding to some protein. The probability of some backbone model is proportional to the product of potential functions: one associated with each vertex and one with each edge in the fully connected graph. Fig. 1. An overview of density-map interpretation. The density map is illustrated with contours enclosing regions of higher density; the protein a tryptophan sidechain at varying resolution with ‘ideal’ phases model uses sticks to indicate bonds between atoms. This figure shows computed from a complete all-atom model. Note that at 1 A two protein models fit to the density map, one darker and one lighter. resolution, the spheres of individual atoms are clearly visible, while at 4 A even the overall shape of the tryptophan sidechain is distorted. Typical resolution for protein structures lies in the 1.5–2.5 A range. Another factor that affects the quality of an electron-density map is the accuracy of the computed phases. To obtain an initial approximation of the phases, crystallographers use techniques based on the special features in X-ray scattering produced by heavy atoms, such as multiple-wavelength or single-wavelength anomalous diffraction (MAD or SAD) and Fig. 2. The effect of varying resolution on electron density of a trypto- multiple isomorphous replacement (MIR). This allows the phan sidechain, with phases computed from a final atomic model. computation of an initial electron-density map, the quality of The effects of phase error are similar to worsening the resolution. which greatly depends on the fidelity of the initial phasing. Artifacts produced by phase error are similar to those of (Cowtan, 2006; DiMaio et al., 2006; Ioerger and Sacchettini, worsening resolution; additionally, high spatial frequency noise 2003; Morris et al., 2003; Terwilliger, 2002). The last problem is also present. The interpretation of a poorly phased map can is addressed in this study. be very difficult even for a trained expert. 2.1 Density-map interpretation 2.2 ACMI’s probabilistic protein backbone tracing A beam of X-rays scattered by a crystalline lattice produces We previously developed a method, ACMI, which produces a pattern of reflections, which are measured by a detector. high-confidence backbone traces from poor-quality maps. Given complete information, i.e. both the amplitudes and Given a density map and the protein’s amino acid sequence, the phases of the reflected photons, one can reconstruct ACMI constructs a probabilistic model of the location of each C. Statistical inference on this model gives the most probable the electron-density map as the Fourier transform of these backbone trace of the given sequence in the density map. complex-valued reflections. However, the detector can only ACMI models a protein using a pairwise Markov field. measure the intensities of the reflections and not the phases. As illustrated in Figure 3, this approach defines the probability Thus, a fundamental problem of crystallography lies in approx- distribution of a set of variables on an undirected graph. imating the unknown phases. Our aim is the construction of Each vertex in the graph is associated with one or more an all-atom protein model that best fits a given electron-density variables and the probability of some setting of these variables map based on approximate phasing. is the product of potential functions associated with vertices The electron-density map is defined on a 3D grid of points and edges in the graph. covering the unit cell, which is the basic repeating unit in In ACMI’s protein model, vertices correspond to individual the protein crystal. A crystallographer, given the amino acid amino acid residues and the variables associated with each sequence of the protein, attempts to place the amino acids vertex correspond to an amino acid’s C location and in the unit cell, based on the shape of the electron-density orientation. The vertex potential at each node i can be contours. Figure 1 shows the electron-density map as an thought of as a ‘prior probability’ on each alpha carbon’s isocontoured surface. This figure also shows two models of location given the density map and ignoring the locations of atomic positions consistent with the electron density, where other amino acids. In this model, the probability of some sticks indicate bonds between atoms. backbone conformation B¼ {b , ... , b }, given density map M The quality of an electron-density map is limited by its 1 N is given as resolution, which, at its high limit, corresponds to the smallest Y Y interplanar distance between diffracting planes. The highest PðBjMÞ¼ ðbjMÞ ðb ; bÞ i i ij i j resolution for a dataset depends on the order in the crystal- amino acid i amino acids i;j line packing, the detector sensitivity and the brightness of the ð1Þ X-ray source. Figure 2 illustrates the electron density around 2852 Creating protein models ACMI’s considers a 5mer (5 amino acid sequences) are merged using a heuristic. BUCCANEER (Cowtan, 2006) is a centered at each position in the protein sequence and searches newer probabilistic approach to interpreting poor quality maps; a non-redundant subset of the Protein Data Bank (PDB) currently, the algorithm only constructs a main chain trace. (Wang and Dunbrack, 2003) for observed conformations At lower resolution and with greater phase error, these of that 5mer, using the computed density (conditioned on the methods have difficulty in chain tracing and especially in map resolution) of each conformation as a search target. correctly identifying amino acids. Unlike ACMI’s model-based An improvement to our original approach (DiMaio et al., 2007) approach, they first build a backbone model, then align the protein sequence to it. At low resolutions, this alignment often uses spherical harmonic decomposition to rapidly search over fails, resulting in the inability to correctly identify sidechain all rotations of each search target. The edge potentials associated with each edge model types. These approaches have a tendency to produce disjointed ij chains in poor-resolution maps, which requires significant global spatial constraints on the protein. ACMI defines two human labor to repair. types of edge potentials: adjacency constraints model adj interactions between adjacent residues (in the primary sequence), while occupancy constraints model interactions occ 3 METHODS between residues distant on the protein chain (though not necessarily spatially distant in the folded structure). Adjacency For each amino acid i,ACMI’s probabilistic inference returns the marginal probability distribution p^ ðbÞ of that amino acid’s C constraints make sure that C’s of adjacent amino acids i i position. Previously, we computed the backbone trace B¼ {b , .. . , b } are 3.8 A apart; occupancy constraints make sure no two 1 N (where b describes the position and rotation of amino acid i) as the amino acids occupy the same 3D space. Multiple subunits position of each C that maximized ACMI’s belief, in the asymmetric unit are handled by fully connected each subunit with occupancy-constraining edges. b ¼ arg max p^ ðbÞð2Þ i i A fast approximate-inference algorithm finds likely locations of each C, given the vertex and edge potentials. For each One obvious shortcoming in this previous approach is that biologists are interested in not just the position b of each C, but in the location amino acid in the provided protein sequence, ACMI’s inference of every (non-hydrogen) atom in the protein. Naı¨vely, we could take algorithm returns the marginal distribution of that amino ACMI’s most-probable backbone model and simply attach the best- acid’s C location: that is, the probability distribution taking matching sidechain from a library of conformations to each of the into account the position of all other amino acids. Our previous model’s C positions. In Section 4, we show that such a method works work shows that ACMI produces more accurate backbone traces reasonably well. Another issue is that the marginal distributions are than alternative approaches (DiMaio et al., 2006). Also, ACMI computed on a grid, which may lead to non-physical distances between is less prone to missing pieces in the model, because locations residues when C’s are placed on the nearest grid points. Additionally, of amino acids not visible in the density map are inferred from ACMI’s inference is approximate and errors due to these approximations the locations of neighboring residues. may produce an incorrect backbone trace, with two adjacent residues located on opposite sides of the map. Another problem deals with using a ‘maximum-marginal backbone 2.3 Other approaches trace’, i.e. independently selecting the position of each residue to Several methods have been developed to automatically maximize the marginal. A density map that contains a mixture of (physically feasible) protein conformations may have a maximum- interpret electron-density maps. Given high-quality data marginal conformation that is physically unrealistic. Representing each (up to 2.3 A resolution), one widely used algorithm amino acid’s position as a distribution over the map is very expressive. is ARP/WARP (Morris et al., 2003). This atom-based method Simply returning the C position that maximizes the marginal ignores a heuristically places atoms in the map, connects them and lot of information. This section details the application of particle refines their positions. To handle poor phasing, ARP/WARP filtering to ‘explain’ the density map using multiple, physically feasible alternates steps in which (a) a model is built based on a density models. map and (b) the map is improved using phases from iteratively refined model. 3.1 Particle-filtering overview Other methods have been developed to handle low-resolution We will use a particle-filtering method called statistical importance density maps, where atom-based approaches like ARP/WARP resampling (SIR) (Arulampalam et al., 2001; Doucet et al., 2000), fail to produce a reasonable model. Ioerger’s TEXTAL (Ioerger which approximates the posterior probability distribution of a state and Sacchettini, 2003) and CAPRA (Ioerger and Sacchettini, sequence x ¼ {x , ... , x } given observations y as the weighted 1:K 1 K 1:K 2002) interpret poor-resolution density maps using ideas from ðiÞ sum of a finite number of point estimates x 1:K pattern recognition. Given a sphere of density from an uninter- preted density map, both employ a set of rotation-invariant ðiÞ pðx jy Þ w x x ð3Þ 1:K 1:K i 1:K 1:K statistical features to aid in interpretation. CAPRA uses a trained i¼1 neural network to identify C locations. TEXTAL performs Here, i is the particle index, w is particle i’s weight, K is the number a rotational search to place sidechains, using the rotation- of states (here the number of amino acids) and is the Dirac delta invariant features to identify sidechain types. RESOLVE’s function. In our application, x describes the position of every non- automated model-building routine (Terwilliger, 2002) uses a hydrogen atom in amino acid k; y is a 3D region of density in the map. hierarchical procedure in which helices and strands are located In our work, the technical term ‘particle’ refers to one specific by an exhaustive search. High-scoring matches are extended 3D layout of all the non-hydrogen atoms in a contiguous subsequence iteratively using a library of tripeptides; these growing chains of the protein (e.g. from amino acid 21 to 25). PF represents the 2853 F.DiMaio et al. distribution of some subsequence’s layout using a set of distinct layouts for that subsequence (in other words, what we are doing is illustrated Algorithm 1: ACMI-PF’s algorithm for growing a protein model in Fig. 1, where each protein model is a single particle). At each iteration of particle filtering, we advance the extent of each input: density map y, amino acid marginals p^ ðb Þ k k particle by one amino acid. For example, given x {x , .. . , x } the 21:25¼ 21 25 ðiÞ ðiÞ output: set of protein models x and weights w 1:K K position of all atoms in amino acids 21 through 25 (we will use this //start at some AA with high certainty about its location shorthand notation for a particle throughout the article), PF samples ðiÞ choose k such that p ðb Þ has minimum entropy the position of the next amino acid, in this case x . Ideally, particle foreach particle i¼ 1.. .N do filtering would sample these positions from the posterior distribution: ðiÞ ðiÞ choose b at random from p^ ðb Þ k k the probability of x ’s layout given the current particle and the map. ðiÞ w 1=N SIR is based on the assumption that this posterior is difficult to sample k end directly, but easy to evaluate (up to proportionality). Given some other foreach residue k do function (called the importance function that approximates the poste- foreach particle i¼ 1 .. . N do rior, particle filtering samples from this function instead, then uses ðiÞ //choose b (or b ) given b the ratio of the posterior to the importance function to reweight the kþ 1 k 1 ðiÞ *m fb g choose M samples from ðb ; b Þ particles. adj kþ1 kþ1 k *m *m To give an example of an importance function, particle-filtering w belief p^ ðb Þ kþ1 ðiÞ *m *m applications often use the prior conditional distribution p(x jx ) as the b choose b with probability / w k k 1 kþ1 kþ1 importance function. After sampling the data, y will be used to weight ðiÞ ðiÞ M *m w w w m¼1 kþ1 k each particle. In our application, this is analogous to placing an amino ðiÞ //choose s given b acid’s atoms using only the layout of the previous amino acid, then k k1:kþ1 *l fs g sidechain conformations for amino acid k reweighting by how well it fits the density map. k *l *l We use a particle resampling step to address the problem of degen- p prob ccðs ; EDM½b Þ occurred by chance null k eracy in the particle ensemble (Kong et al., 1994). As particles are *l *l s choose s with probability / 1=p 1 k null extended, the variance of particle weights tends to increase, until there P ðiÞ ðiÞ L *l w w 1=p 1 kþ1 k l¼1 null are few particles with non-negligible weights and much effort is spent end updating particles with little or no weight. To ameliorate this problem, end an optional resampling step samples (with replacement) a new set of N particles at each iteration, with the probability of selecting a particle proportional to its weight. This ensures most particles remain on high- While it is intractable to compute Equation (4) exactly, it is likelihood trajectories in state space. straightforward to estimate using ACMI’s Markov field model What makes SIR (and particle filtering methods in general) different from Markov chain Monte Carlo (MCMC) is that MCMC is concerned ðiÞ ðiÞ ðiÞ pðb jb ; y Þ/ pðb ; b jyÞ=pðb jyÞ kþ1 k kþ1 k k k with the stationary distribution of the Markov chain. In particle ðiÞ ¼ p^ ðb Þ ðb ; bÞð5Þ filtering, one is not concerned with convergence of the point estimates, kþ1 kþ1 adj kþ1 rather, the distribution is simply modeled by the ensemble of particles, Here, p^ ðb Þ is the ACMI-computed marginal distribution for amino- kþ1 kþ1 whether or not they converge. acid kþ 1 ðp^ ’s dependence on y dropped for clarity). We sample C kþ1 kþ 1’s location from the product of (a) kþ 1’s marginal distribution and (b) the adjacency potential between C k and C kþ 1. 3.2 Protein particle model The optimal sampling function has a corresponding weight update An overview of our entire algorithm appears in Algorithm 1. For ðiÞ i i density-map interpretation, we use the variable x to denote the w / w pðy jb ; b Þ db ð6Þ kþ1 kþ1 k kþ1 k k position of every atom in amino acid k. We want to find the complete (all-atom) protein model x that best explains the observed electron- 1:K This integral, too, is intractable to compute exactly, but can be density map y. To simplify, we parameterize x as a C location b k k approximated using ACMI’s marginals [the same as b in Equation equation (2)] and a sidechain placement s . i k Z ðiÞ i i The sidechain placement identifies the 3D location of every non- w / w p^ ðb Þ ðb ; b Þ db ð7Þ kþ1 kþ1 adj kþ1 k kþ1 k k hydrogen sidechain atom in amino acid k, as well as the position of backbone atoms C, N and O. Equations (5) and (7) suggest a sampling approach to the problem Given this parameterization, the Markov process alternates between of choosing location of C kþ 1 and reweighting each particle. This placing: (a) C positions and (b) sidechain atoms. That is, an iteration sampling approach, shown in Algorithm 1, is illustrated pictorially of particle filtering first samples b (C of amino acid kþ 1) given b , kþ 1 k in Figure 4. ðiÞ or alternatively, growing our particle toward the N-terminus would We sample M potential C locations from ðb ; b Þ, the adj kþ1 sample b given b . Then, given the triple b , we sample k 1 k k 1:kþ 1 adjacency potential between k and kþ 1, which models the allowable sidechain conformation s . conformations between two adjacent C’s. We assign each sample a weight: the approximate marginal probability p^ at each of these kþ1 3.2.1 Using ACMI-computed marginals to place C’s. In our sampled locations. We select a sample from this weighted distribution, algorithm’s backbone step we want to sample the C position b approximating Equation (5). Finally, we reweight our particle as the kþ 1 ðiÞ (or b ), given our growing trace b , for each particle i. That is, we sum of weights of all the samples we considered. This sum approximates j 1 j:k ðiÞ ðiÞ want to define our sampling function qðb jb ; .. .; b ; yÞ. Doucet the integral in Equation (7). kþ1 j et al. (2000) defines the optimal sampling function as the conditional This process, in which we consider M potential C locations, is repeated for every particle in the particle filter for each C in marginal distribution the protein. For every particle, we begin by sampling locations for the ðiÞ ðiÞ ðiÞ qðb jb ; .. .; b ; yÞ¼ pðb jb ; yÞð4Þ kþ1 kþ1 j k k amino acid k whose marginal distribution has the lowest entropy 2854 Creating protein models Fig. 4. An overview of the backbone forward-sampling step. Given positions b and b , we sample M positions for b using the empirically k 1 k k 1 *m derived distribution of C—C—C pseudoangles. Each potential b is weighted by the belief p^ðb jyÞ. We choose a single location from this kþ 1 kþ1 distribution; the particle weight is multiplied by the sum of these weights in order to approximate Equation (6). Fig. 5. An overview of the sidechain sampling step. Given positions b , we consider L sidechain conformations s . Each potential k 1:kþ 1 conformation is weighted by the probability of the map given the sidechain conformation, as given in Equation (9). We choose a sidechain from this distribution; the particle weight is multiplied by the sum of these weights. (we use a soft-minimum to introduce randomness in the order in conformation’s calculated density and a region around b in the which amino acids are placed). This corresponds to the amino acid that density map. ACMI is most sure of the location. The direction we sample at each Figure 5 illustrates the process of choosing a sidechain conformation iteration (i.e. toward the N- or C-terminus) is also decided by the for a single particle i. We consider each of L different sidechain conformations for amino acid k. For each sidechain conformation entropy of the marginal distributions. s , l2 {1, .. . , L}, we compute the correlation coefficient between the conformation and the map 3.2.2 Using sidechain templates to sample sidechains Once ðiÞ l *l CC ¼ cross correlationðs ; EDM½b Þ our particle filter has placed C’s k 1, k and kþ 1 at 3D locations k ðiÞ b , it is ready to place all the sidechain atoms in amino acid k. k1:kþ1 EDM[b ] denotes a region of density in the neighborhood of b . k k We denote the position of these sidechain atoms s . Given the primary ðiÞ To assign a probability pðEDM½b js Þ to each sidechain conforma- amino acid sequence around k, we consider all previously observed tion, we compute the probability that a cross-correlation value was conformations (i.e. those in the PDB) of sidechain k. Thus, s consists not generated by chance. That is, we assume that the distribution of (a) an index into a database of known sidechain 3D structures of the cross correlation of two random functions is normally and (b) a rotation. 2 distributed with mean and variance . We learn these parameters To further simplify, each sidechain template models the position by computing correlation coefficients between randomly sampled k 1 kþ 1 of every atom from C to C . Then, given three consecutive locations in the map. Given some cross correlation x , we compute ðiÞ c backbone positions b , the orientation of sidechain s is deter- k1:kþ1 the expected probability that we would see score x or higher by ðiÞ c mined by aligning the three C’s in the sidechain template to b . k1:kþ1 random chance, As Algorithm 1 shows, sidechain placement is quite similar to the C p ðx Þ¼ PðX x ;; Þ¼ 1 ðx =Þð8Þ null c c c placement in the previous section. One key difference is that sidechain Here, (x) is the normal cumulative distribution function. The placement cannot take advantage of ACMI’s marginal distribution, probability of a particular sidechain conformation is then as ACMI’s probability distributions have marginalized away sidechain conformations. Instead, the probability of a sidechain is calculated ðiÞ *l pðEDM½b js Þ/ð1=p Þ 1 ð9Þ null k k on-the-fly using the correlation coefficient between a potential 2855 F.DiMaio et al. Table 1. Summary of crystallographic data available, ACMI used the location of selenium atom peaks as a soft constraint on the positions of methionine residues. Particle filtering was run 10 times; in each run, the single highest-weight model was PDB ID AAs in Molecules Resolution Phase returned, producing a total of 10 ACMI-PF protein models. Predicted o a ASU in ASU (A) error ( ) models were refined for 10 iterations using REFMAC5 (Murshudov et al., 1997) with no modification or added solvent. The first step is the most 2NXF 322 1 1.9 58 computationally expensive, but is efficiently divided across multiple 2Q7A 316 2 2.6 49 processors. Computation time varied depending on protein size; XXXX 566 2 2.65 54 the entire process took at most a week of CPU time on 10 processors. 1XRI 430 2 3.3 39 We compare ACMI-PF to four different approaches using the same 1ZTP 753 3 2.5 42 10 density maps. To test the utility of the particle-filtering method 1Y0Z 660 2 2.4 (3.7)58 for building all-atom models, we use the structure that results from 2A3Q 340 2 2.3 (3.5)66 independently placing the best-matching sidechain on each C 2IFU 1220 4 3.5 50 predicted by ACMI, which we term ACMI-NAIIVE. The other three 2BDU 594 2 2.35 55 approaches are the commonly used density-map interpretation 2AB1 244 2 2.6 (4.0)66 algorithms ARP/WARP (version 7), TEXTAL (in PHENIX version 1.1a) and RESOLVE (version 2.10). Refinement for all algorithms uses the same Averaged over all resolution shells. protocol as ACMI-PF, refining the predicted models for 10 iterations Different dataset was used to solve the PDB structure. in REFMAC5 (ARP/WARP, which integrates refinement and model Phasing was extended from lower resolution. building, was not further refined). PDB file not yet released. To assess the prediction quality of each algorithm, we consider three different performance metrics: (a) backbone completeness, (b) sidechain identification and (c) R factor. The first metric compares the predicted Since we are drawing sidechain conformations from the distribution model to the deposited model, counting the fraction of C’s placed of all solved structures, we assume a uniform prior distribution on within 2 A of some C in the PDB-deposited model. The second ðiÞ ðiÞ *l *l ˚ sidechain conformations, so pðs jEDM½b Þ / pðEDM½b js Þ. measure counts the fraction of C’s both correctly placed within 2 A k k k k As illustrated in Figure 5, sidechain sampling uses a similar method and whose sidechain type matches the PDB-deposited structure. to the previous section’s backbone sampling. We consider extending our Finally, the R factor is a measure of deviation between the reflection *1 *L particle by each of the L sidechain conformationsfs ; .. .; s g sampled intensities predicted by the model and those experimentally measured. k k from our sidechain database. After computing the correlation between A lower R factor indicates a better model. The R factor is computed each sidechain conformation’s density and the density map around using only peptide atoms (i.e. no added water molecules). The compari- b , each conformation is weighted using Equation (9). We choose a son uses the so-called free R factor (Brunger, 1992), which is based on single conformation at random from this weighted distribution, reflections that were not used in refinement. updating each particle’s weight by the sum of weights of all considered sidechain conformations. Finally, our model takes into account the partial model x when j:k 1 4 RESULTS AND DISCUSSION placing sidechain s . If any atom in sidechain s overlaps a previously k k 4.1 ACMI-NAIVE versus ACMI-PF placed atom or any symmetric copy, particle weight is set to zero. We first compare protein models produced by ACMI-PF to 3.3 Crystallographic data ¨ those produced by ACMI-NAIVE. The key advantage of particle filtering is the ability to produce multiple protein structures Ten experimentally phased electron-density maps, provided by the Center for Eukaryotic Structural Genomics (CESG) at UW-Madison, using ensembles of particles. Since the density map is an have been used to test ACMI-PF. The maps were initially phased using average over many molecules of the protein in the crystal, it is AUTOSHARP (Terwilliger, 2002) with non-crystallographic symmetry natural to use multiple conformations to model this data. averaging and solvent flattening (in RESOLVE) used to improve the map There is evidence that a single conformation is insufficient to quality where possible. The 10 maps were selected as the ‘most difficult’ model protein electron density (Burling and Brunger, 1994; from a larger dataset of 20 maps provided by CESG. These structures DePristo et al., 2004; Furnham et al., 2006; Levin et al., 2007). have been previously solved and deposited to the PDB, enabling a direct As comparison, we take ACMI-NAIVE, which uses the maximum- comparison with the final refined model. All 10 required a great deal of marginal trace to produce a single model. human effort to build and refine the final atomic model. We use ACMI-PF to generate multiple physically feasible The data are summarized in Table 1 with quality described by the models, by performing 10 different ACMI-PF runs of 100 resolution and phase error. The resolution from the initial phasing may not have reached the resolution limit of the dataset. Initial low- particles each. Each run sampled amino acids in a different resolution phasing was computationally extended in three structures order; amino acids whose belief had lowest entropy (i.e. those (using an algorithm in RESOLVE). Mean phase error was computed using we are most confident we know) were stochastically preferred. CCP4 (Collaborative Computational Project, 1994) by comparing Figure 6 summarizes the results. The y axis shows the average calculated phases from the deposited model with those in the initially (over the 10 maps) R of the final refined model; the x axis free phased dataset. indicates the number of ACMI-PF runs. This plot shows that a single ACMI-PF model has an R approximately equal to free 3.4 Computational methodology the R of ACMI-NAIVE. Model completeness is also very close free between the two (data not shown). As additional structures Models in ACMI-PF are built in three phases: (a) prior distributions are are added ACMI-PF’s model, average R decreases. The plot computed, (b) ACMI infers posterior distributions for each C location free and (c) all-atom models are constructed using particle filtering. Where shows ACMI-NAIVE’s model as a straight line, since there is no 2856 Creating protein models 0.50 100% % backbone placed % sidechains identified 80% 0.40 60% 0.30 40% Acmi-PF Acmi-Naive 20% 0.20 12 34 56789 10 0% Number of structures in model ACMI-PF ARP/wARP Resolve Textal Fig. 6. A comparison of the R of ACMI-NAIVE and ACMI-PF, as the free Fig. 7. A comparison of ACMI-PF to other automatic interpretation number of protein models produced varies. Multiple models are methods in terms of average backbone completeness and sidechain produced by independent ACMI-PF runs (ACMI-NAIVE only produces a identification. single model). Since R in deposited structures is typically 0.20–0.25, free we use 0.20 as the lowest value on the y axis. Scatterplots in Figure 8 compare the R of ACMI-PF’s free mechanism to generate multiple conformations. We believe a complete (10-structure) model to each of the three alternative key reason for this result is that particle filtering occasionally approaches, for each density map. Any point below the makes mistakes when tracing the main chain, but it is unlikely diagonal corresponds to a map for which ACMI-PF’s solution for multiple PF runs to repeat the same mistake. The mistakes has a lower (i.e. better) R . These plots show that for all free average out in the ensemble, producing a lower R factor. but one map ACMI-PF’s solution has the lowest R factor. The Individual models in ACMI-PF offer additional advantages singular exception for which ARP/WARP has a lower R factor over ACMI-NAIVE. Comparing the ACMI-PF model with lowest is 2NXF, a high (1.9 A) resolution but poorly phased density map in which ARP/WARP automatically traces 90%, while R (the ‘training set’ R factor) to ACMI-NAIVE’s model work ACMI-PF’s best model correctly predicts only 74%. Our results shows that particle filter produces fewer chains on average illustrate both the limitations and the advantages of ACMI-PF: (28 versus 10) and lower all-atom RMS error (1.60 A versus 1.72 A). This trend held in all 10 maps in our test set: ACMI-PF’s it is consistently superior at interpretation of poorly phased, best model contains fewer predicted chains and lower RMS lower resolution maps, while an iterative phase-improvement error than ACMI-NAIVE. Additionally, the structures particle algorithm like ARP/WARP may be better suited for a poorly filtering returns are physically feasible with no overlapping phased but higher-resolution data. sidechains or invalid bond lengths. 5 CONCLUSION 4.2 Comparison to other algorithms We develop ACMI-PF, an algorithm that uses particle filtering We further compare the models produced by particle filtering to produce a set of all-atom protein models for a given on the 10 maps to those produced by three other methods density map. Particle filtering considers growing stepwise for automatic density-map interpretation, including two well- an ensemble of all-atom protein models. The method builds established lower-resolution algorithms, TEXTAL and RESOLVE on our previous work, where we infer a probability distribution and the atom-based ARP/WARP (although most of our maps of each amino acid’s C location. ACMI-PF addresses short- are outside of its recommended resolution). comings of our previous work, producing a set of physically Figure 7 compares all four methods in terms of backbone feasible protein structures that best explain the density map. completeness and sidechain identification, averaged over Our results indicate that ACMI-PF generates more accurate and all 10 structures. To provide a fair comparison, we compute more complete models than other state-of-the-art automated completeness of a single ACMI-PF structure (of the 10 interpretation methods for poor-resolution density map data. produced). The ACMI-PF model chosen was that with the ACMI-PF produces accurate interpretations, on average finding lowest refined R . Under both of these metrics, ACMI-PF and identifying 80% of the protein structure in poorly phased work locates a greater fraction of the protein than the other 2.5–3.5 A resolution maps. approaches. ACMI-PF performs particularly well at sidechain Using ACMI-PF, an ensemble of conformations may be easily identification, correctly identifying close to 80% of sidechains generated using multiple runs of particle filtering. We show that over these 10 poor-quality maps. The least accurate model that sets of multiple structures generated from multiple particle ACMI-PF generated (for 2AB1) had 62% backbone complete- filtering runs better fit the density map than a single structure. ness and 55% sidechain identification. In contrast, the three This is consistent with recent observations of the inadequacy comparison methods all return at least five structures of the single-model paradigm for modeling flexible protein with540% backbone completeness and at least eight structures molecules (Burling and Brunger, 1994; DePristo et al., 2004; with520% sidechain identification. Furnham et al., 2006) and with the encouraging results Average R free Percent of true model F.DiMaio et al. (a) 0.65 (b) 0.65 (c) 0.65 0.55 0.55 0.55 0.45 0.45 0.45 0.35 0.35 0.35 0.25 0.25 0.25 0.25 0.35 0.45 0.55 0.65 0.25 0.35 0.45 0.55 0.65 0.25 0.35 0.45 0.55 0.65 ARP/wARP 7 R Textal R Resolve R free free free Fig. 8. A comparison of the free R factor of ACMI-PF’s interpretation for each of the 10 maps versus (a) ARP/WARP, (b)TEXTAL and (c)RESOLVE. The scatterplots show each interpreted map as a point, where the x axis measures the R of ACMI-PF and the y axis the alternative approach. free Burling,F.T. and Brunger,A.T. (1994) Thermal motion and conformational of the ensemble refinement approach (Levin et al., 2007). The disorder in protein crystal-structures – comparison of multi-conformer and ensemble description may also provide valuable information time-averaging models. Isr. J. Chem., 34, 165–175. about protein conformational dynamics. As well, multiple Chandonia,J.M. and Brenner,S.E. (2006) The impact of structural genomics: conformations may be valuable for application of ACMI-PF expectations and outcomes. Science, 311, 347–351. Collaborative Computational Project, Number 4 (1994) The CCP4 suite: in an iterative approach, where computed phases from an programs for protein crystallography. Acta Crystallogr., D50, 760–763. ACMI-PF model are used build an updated density map, Cowtan,K. (2006) The Buccaneer software for automated model building.1. which is fed back into the ACMI pipeline. Tracing protein chains. Acta Crystallogr., D62, 1002–1011. ACMI-PF’s model-based approach is very flexible and allows DePristo,M.A. et al. (2004) Heterogeneity and inaccuracy in protein structures integration of multiple sources of ‘fuzzy’ information, such as solved by X-ray crystallography. Structure, 12, 911–917. DiMaio,F. et al. (2006) A probabilistic approach to protein backbone tracing in locations of selenium peaks. In the future, it may be productive electron-density maps. Bioinformatics, 22, e81–e89. to integrate other sources of information in our model. A more DiMaio,F. et al. (2007) Improved methods for template-matching in electron- complicated reweighting function based on physical or statis- density maps using spherical harmonics. In Proceedings of the IEEE tical energy could better overcome ambiguities of unclear Conference on Bioinformatics and Biomedicine. IEEE Press, Fremont, CA. Doucet,A. et al. (2000) On sequential Monte Carlo sampling methods for regions in the density map. The inclusion of these and other Bayesian filtering. Stat. Comput., 10, 197–208. sources of information is possible, so long as they can be Furnham,N. et al. (2006) Is one solution good enough? Nat. Struct. Mol. Biol., expressed in the probabilistic framework proposed here. This 13, 184–185. could further extend the resolution in which automated Geman,S. and Geman,D. (1984) Stochastic relaxation, Gibbs distributions and interpretation of density maps is possible. the Bayesian restoration of images. IEEE Trans. PAMI, 6, 721–741. Ioerger,T.R. and Sacchettini,J.C. (2002) Automatic modeling of protein back- bones in electron density maps. Acta Crystallogr, D58, 2043–2054. Ioerger,T.R. and Sacchettini,J.C. (2003) The TEXTAL system: artificial ACKNOWLEDGEMENTS intelligence techniques for automated protein model building. Methods We acknowledge support from NLM T15-LM007359 Enzymol., 374, 244–270. Kong,A. et al. (1994) Sequential imputations and Bayesian missing data (F.D., A.S., D.A.K.), NLM R01-LM008796 (F.D., J.W.S., problems. J. Am. Stat. Assoc., 89, 278–288. G.N.P., D.A.K.) and NIH Protein Structure Initiative Grant Levin,E.J. et al. (2007) Ensemble refinement of protein crystal structures. In GM074901 (E.B., C.A.B., G.N.P.). Structure, 15, 1040–1052. Morris,R. et al. (2003) ARP/wARP and automatic interpretation of protein Conflict of Interest: none declared. electron density maps. Methods Enzymol., 374, 229–244. Murshudov,G.N. et al. (1997) Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr., D53, 240–255. Sawasaki,T. et al. (2002) A cell-free protein synthesis system for high-throughput REFERENCES proteomics. Proc. Natl Acad. Sci. USA, 99, 14652–14657. Snell,G. et al. (2004) Automated sample mounting and alignment system Arulampalam,M.S. et al. (2001) A tutorial on particle filters. IEEE Trans. Signal for biological crystallography at a synchrotron source. Structure, 12, 537–545. Process., 50, 174–188. Terwilliger,T.C. (2002) Automated main-chain model building by template- Berman,H.M. and Westbrook,J.D. (2004) The impact of structural genomics on matching and iterative fragment extension. Acta Crystallogr., D59, 38–44. the protein data bank. Am. J. Pharmacogenomics, 4, 247–252. Wang,G. and Dunbrack,R.L. (2003) PISCES: a protein sequence culling server. Brunger,A.T. (1992) Free R value: a novel statistical quantity for assessing the Bioinformatics, 19, 1589–1591. accuracy of crystal structures. Nature, 355, 472–475. ACMI-PF R free ACMI-PF R free ACMI-PF R free http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/creating-protein-models-from-electron-density-maps-using-particle-kI208UTDdM

Loading next page...

References (26)

T. Sawasaki, T. Ogasawara, R. Morishita, Y. Endo (2002)
A cell-free protein synthesis system for high-throughput proteomics
Proceedings of the National Academy of Sciences of the United States of America, 99
H. Berman, J. Westbrook (2004)
The Impact of Structural Genomics on the Protein Data Bank
American Journal of Pharmacogenomics, 4
R. Morris, A. Perrakis, V. Lamzin (2003)
ARP/wARP and automatic interpretation of protein electron density maps.
Methods in enzymology, 374
E. Levin, D. Kondrashov, G. Wesenberg, G. Phillips (2007)
Ensemble refinement of protein crystal structures: validation and application.
Structure, 15 9
T. Ioerger, J. Sacchettini (2002)
Automatic modeling of protein backbones in electron-density maps via prediction of Calpha coordinates.
Acta crystallographica. Section D, Biological crystallography, 58 Pt 12
Collaborative Computational (1994)
The CCP4 suite: programs for protein crystallography.
Acta crystallographica. Section D, Biological crystallography, 50 Pt 5
Guoli Wang, Roland Dunbrack (2003)
PISCES: a protein sequence culling server
Bioinformatics, 19 12
M. Arulampalam, S. Maskell, N. Gordon, T. Clapp (2002)
A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking
IEEE Trans. Signal Process., 50
G. Snell, C. Cork, R. Nordmeyer, E. Cornell, G. Meigs, D. Yegian, J. Jaklevic, Jian Jin, R. Stevens, T. Earnest (2004)
Automated sample mounting and alignment system for biological crystallography at a synchrotron source.
Structure, 12 4
T. Ioerger, J. Sacchettini (2003)
TEXTAL system: artificial intelligence techniques for automated protein model building.
Methods in enzymology, 374
A. Brünger (1992)
Free R value: a novel statistical quantity for assessing the accuracy of crystal structures
Nature, 355
F. DiMaio, Ameet Soni, G. Phillips, J. Shavlik (2007)
Improved Methods for Template-Matching in Electron-Density Maps Using Spherical Harmonics
2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007)
F. DiMaio, J. Shavlik, G. Phillips (2006)
A probabilistic approach to protein backbone tracing in electron density maps
Bioinformatics, 22 14
J. Chandonia, S. Brenner (2005)
The Impact of Structural Genomics: Expectations and Outcomes
Science, 311
A. Kong, Jun Liu, W. Wong (1994)
Sequential Imputations and Bayesian Missing Data Problems
Journal of the American Statistical Association, 89
Nicholas Furnham, T. Blundell, M. DePristo, T. Terwilliger (2006)
Is one solution good enough?
Nature Structural &Molecular Biology, 13
T. Terwilliger (2002)
Automated main-chain model building by template matching and iterative fragment extension
Acta Crystallographica Section D: Biological Crystallography, 59
F. Burling, A. Brünger (1994)
Thermal Motion and Conformational Disorder in Protein Crystal Structures: Comparison of Multi‐Conformer and Time‐Averaging Models
Israel Journal of Chemistry, 34
K. Cowtan (2006)
The Buccaneer software for automated model building. 1. Tracing protein chains.
Acta crystallographica. Section D, Biological crystallography, 62 Pt 9
K. Cowtan (2006)
The Buccaneer software for automated model building
G. Murshudov, A. Vagin, E. Dodson (1997)
Refinement of macromolecular structures by the maximum-likelihood method.
Acta crystallographica. Section D, Biological crystallography, 53 Pt 3
M. DePristo, P. Bakker, T. Blundell (2004)
Heterogeneity and inaccuracy in protein structures solved by X-ray crystallography.
Structure, 12 5
A. Doucet, S. Godsill, C. Andrieu (2000)
On sequential Monte Carlo sampling methods for Bayesian filtering
Statistics and Computing, 10
T. Ioerger, J. Sacchettini (2002)
Automatic modeling of protein backbones in electron-density maps via prediction of Cα coordinates
Acta Crystallographica Section D-biological Crystallography, 58
Shuangquan Zang, Yang Su, Ruojie Tao (2006)
N‐(4‐Nitrobenzyl)quinolinium bis(2‐thioxo‐1,3‐dithiole‐4,5‐dithiolato)palladium(III) acetone solvate
Acta Crystallographica Section E-structure Reports Online, 62
S. Geman, D. Geman (1984)
Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images
IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6

Publisher: Oxford University Press
Copyright: © The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
eISSN: 1367-4811
DOI: 10.1093/bioinformatics/btm480
pmid: 17933855
Publisher site: See Article on Publisher Site

Abstract

Vol. 23 no. 21 2007, pages 2851–2858 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm480 Structural bioinformatics Creating protein models from electron-density maps using particle-filtering methods 1,2, 3 4 1,2 Frank DiMaio , Dmitry A. Kondrashov , Eduard Bitto , Ameet Soni , 4 1,3,4 1,2 Craig A. Bingman , George N. Phillips, Jr and Jude W. Shavlik 1 2 3 Department of Computer Sciences, Department of Biostatistics and Medical Informatics, Department of Biochemistry and Center for Eukaryotic Structural Genomics, University of Wisconsin, Madison, WI 53706, USA Received and revised on August 31, 2007; accepted on September 20, 2007 Advance Access publication October 12, 2007 Associate Editor: Burkhard Rost ABSTRACT process, that may require weeks or months of an expert crystallographer’s time. Motivation: One bottleneck in high-throughput protein crystallo- Our previous work (DiMaio et al., 2006) developed the graphy is interpreting an electron-density map, that is, fitting a automatic interpretation tool ACMI (Automatic Crystallo- molecular model to the 3D picture crystallography produces. graphic Map Interpreter). ACMI employs probabilistic inference Previously, we developed ACMI (Automatic Crystallographic Map to compute a probability distribution of the coordinates of Interpreter), an algorithm that uses a probabilistic model to infer an each amino acid, given the electron-density map. However, accurate protein backbone layout. Here, we use a sampling method ACMI makes several simplifications, such as reducing each known as particle filtering to produce a set of all-atom protein amino acid to a single atom and confining the locations to a models. We use the output of ACMI to guide the particle filter’s coarse grid. In this work we introduce the use of a statistical sampling, producing an accurate, physically feasible set of sampling method called particle filtering (PF) (Doucet et al., structures. 2000) to construct all-atom protein models, by stepwise Results: We test our algorithm on 10 poor-quality experimental extension of a set of incomplete models drawn from a distri- density maps. We show that particle filtering produces accurate bution computed by ACMI. This results in a set of probability- all-atom models, resulting in fewer chains, lower sidechain RMS weighted all-atom protein models. The method interprets the error and reduced R factor, compared to simply placing the best- density map by generating a number of distinct protein confor- matching sidechains on ACMI’s trace. We show that our approach mations consistent with the data. We compare the single model produces a more accurate model than three leading methods— that best matches the density map (without knowing the true TEXTAL,RESOLVE and ARP/WARP—in terms of main chain complete- solution) with the output of existing automated methods, on ness, sidechain identification and crystallographic R factor. multiple sets of crystallographic data that required considerable Availability: Source code and experimental density maps available human effort to solve. We also show that modeling the data at http://ftp.cs.wisc.edu/machine-learning/shavlik-group/programs/ with a set of structures, obtained from several particle-filtering acmi/ runs, results in a better fit than using one structure from a single Contact: [email protected] particle-filtering run. Particle filtering enables the automated building of detailed atomic models for challenging protein crystal data with a more realistic representation of conforma- tional variation in the crystal. 1 INTRODUCTION Knowledge of the spatial arrangement of constituent atoms in a complex biomolecules, such as proteins, is vital for under- 2 PROBLEM OVERVIEW AND RELATED WORK standing their function. X-ray crystallography is the primary In recent years, considerable investment into structural geno- technique for determination of atomic positions, or the mics (i.e. high-throughput determination of protein structures) structure, of biomolecules. A beam of X-rays is diffracted by has yielded a wealth of new data (Berman and Westbrook, a crystal, resulting in a set of reflections that contain 2004; Chandonia and Brenner, 2006). The demand for rapid information about the molecular structure. This information structure solution is growing and automated methods are being can be interpreted to produce a 3D image of the macro- deployed at all stages of the structural determination process. molecule, which is usually represented by an electron-density These new technologies include cell-free methods for protein map. Interpretation of these maps requires locating the atoms production (Sawasaki et al., 2002), the use of robotics to in complex 3D images. This is a difficult, time-consuming screen thousands of crystallization conditions (Snell et al., 2004) and new software for automated building of macro- *To whom correspondence should be addressed. molecular models based on the electron-density map The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2851 F.DiMaio et al. GLU SER ALA THR ALA Fig. 3. A sample undirected graphical model corresponding to some protein. The probability of some backbone model is proportional to the product of potential functions: one associated with each vertex and one with each edge in the fully connected graph. Fig. 1. An overview of density-map interpretation. The density map is illustrated with contours enclosing regions of higher density; the protein a tryptophan sidechain at varying resolution with ‘ideal’ phases model uses sticks to indicate bonds between atoms. This figure shows computed from a complete all-atom model. Note that at 1 A two protein models fit to the density map, one darker and one lighter. resolution, the spheres of individual atoms are clearly visible, while at 4 A even the overall shape of the tryptophan sidechain is distorted. Typical resolution for protein structures lies in the 1.5–2.5 A range. Another factor that affects the quality of an electron-density map is the accuracy of the computed phases. To obtain an initial approximation of the phases, crystallographers use techniques based on the special features in X-ray scattering produced by heavy atoms, such as multiple-wavelength or single-wavelength anomalous diffraction (MAD or SAD) and Fig. 2. The effect of varying resolution on electron density of a trypto- multiple isomorphous replacement (MIR). This allows the phan sidechain, with phases computed from a final atomic model. computation of an initial electron-density map, the quality of The effects of phase error are similar to worsening the resolution. which greatly depends on the fidelity of the initial phasing. Artifacts produced by phase error are similar to those of (Cowtan, 2006; DiMaio et al., 2006; Ioerger and Sacchettini, worsening resolution; additionally, high spatial frequency noise 2003; Morris et al., 2003; Terwilliger, 2002). The last problem is also present. The interpretation of a poorly phased map can is addressed in this study. be very difficult even for a trained expert. 2.1 Density-map interpretation 2.2 ACMI’s probabilistic protein backbone tracing A beam of X-rays scattered by a crystalline lattice produces We previously developed a method, ACMI, which produces a pattern of reflections, which are measured by a detector. high-confidence backbone traces from poor-quality maps. Given complete information, i.e. both the amplitudes and Given a density map and the protein’s amino acid sequence, the phases of the reflected photons, one can reconstruct ACMI constructs a probabilistic model of the location of each C. Statistical inference on this model gives the most probable the electron-density map as the Fourier transform of these backbone trace of the given sequence in the density map. complex-valued reflections. However, the detector can only ACMI models a protein using a pairwise Markov field. measure the intensities of the reflections and not the phases. As illustrated in Figure 3, this approach defines the probability Thus, a fundamental problem of crystallography lies in approx- distribution of a set of variables on an undirected graph. imating the unknown phases. Our aim is the construction of Each vertex in the graph is associated with one or more an all-atom protein model that best fits a given electron-density variables and the probability of some setting of these variables map based on approximate phasing. is the product of potential functions associated with vertices The electron-density map is defined on a 3D grid of points and edges in the graph. covering the unit cell, which is the basic repeating unit in In ACMI’s protein model, vertices correspond to individual the protein crystal. A crystallographer, given the amino acid amino acid residues and the variables associated with each sequence of the protein, attempts to place the amino acids vertex correspond to an amino acid’s C location and in the unit cell, based on the shape of the electron-density orientation. The vertex potential at each node i can be contours. Figure 1 shows the electron-density map as an thought of as a ‘prior probability’ on each alpha carbon’s isocontoured surface. This figure also shows two models of location given the density map and ignoring the locations of atomic positions consistent with the electron density, where other amino acids. In this model, the probability of some sticks indicate bonds between atoms. backbone conformation B¼ {b , ... , b }, given density map M The quality of an electron-density map is limited by its 1 N is given as resolution, which, at its high limit, corresponds to the smallest Y Y interplanar distance between diffracting planes. The highest PðBjMÞ¼ ðbjMÞ ðb ; bÞ i i ij i j resolution for a dataset depends on the order in the crystal- amino acid i amino acids i;j line packing, the detector sensitivity and the brightness of the ð1Þ X-ray source. Figure 2 illustrates the electron density around 2852 Creating protein models ACMI’s considers a 5mer (5 amino acid sequences) are merged using a heuristic. BUCCANEER (Cowtan, 2006) is a centered at each position in the protein sequence and searches newer probabilistic approach to interpreting poor quality maps; a non-redundant subset of the Protein Data Bank (PDB) currently, the algorithm only constructs a main chain trace. (Wang and Dunbrack, 2003) for observed conformations At lower resolution and with greater phase error, these of that 5mer, using the computed density (conditioned on the methods have difficulty in chain tracing and especially in map resolution) of each conformation as a search target. correctly identifying amino acids. Unlike ACMI’s model-based An improvement to our original approach (DiMaio et al., 2007) approach, they first build a backbone model, then align the protein sequence to it. At low resolutions, this alignment often uses spherical harmonic decomposition to rapidly search over fails, resulting in the inability to correctly identify sidechain all rotations of each search target. The edge potentials associated with each edge model types. These approaches have a tendency to produce disjointed ij chains in poor-resolution maps, which requires significant global spatial constraints on the protein. ACMI defines two human labor to repair. types of edge potentials: adjacency constraints model adj interactions between adjacent residues (in the primary sequence), while occupancy constraints model interactions occ 3 METHODS between residues distant on the protein chain (though not necessarily spatially distant in the folded structure). Adjacency For each amino acid i,ACMI’s probabilistic inference returns the marginal probability distribution p^ ðbÞ of that amino acid’s C constraints make sure that C’s of adjacent amino acids i i position. Previously, we computed the backbone trace B¼ {b , .. . , b } are 3.8 A apart; occupancy constraints make sure no two 1 N (where b describes the position and rotation of amino acid i) as the amino acids occupy the same 3D space. Multiple subunits position of each C that maximized ACMI’s belief, in the asymmetric unit are handled by fully connected each subunit with occupancy-constraining edges. b ¼ arg max p^ ðbÞð2Þ i i A fast approximate-inference algorithm finds likely locations of each C, given the vertex and edge potentials. For each One obvious shortcoming in this previous approach is that biologists are interested in not just the position b of each C, but in the location amino acid in the provided protein sequence, ACMI’s inference of every (non-hydrogen) atom in the protein. Naı¨vely, we could take algorithm returns the marginal distribution of that amino ACMI’s most-probable backbone model and simply attach the best- acid’s C location: that is, the probability distribution taking matching sidechain from a library of conformations to each of the into account the position of all other amino acids. Our previous model’s C positions. In Section 4, we show that such a method works work shows that ACMI produces more accurate backbone traces reasonably well. Another issue is that the marginal distributions are than alternative approaches (DiMaio et al., 2006). Also, ACMI computed on a grid, which may lead to non-physical distances between is less prone to missing pieces in the model, because locations residues when C’s are placed on the nearest grid points. Additionally, of amino acids not visible in the density map are inferred from ACMI’s inference is approximate and errors due to these approximations the locations of neighboring residues. may produce an incorrect backbone trace, with two adjacent residues located on opposite sides of the map. Another problem deals with using a ‘maximum-marginal backbone 2.3 Other approaches trace’, i.e. independently selecting the position of each residue to Several methods have been developed to automatically maximize the marginal. A density map that contains a mixture of (physically feasible) protein conformations may have a maximum- interpret electron-density maps. Given high-quality data marginal conformation that is physically unrealistic. Representing each (up to 2.3 A resolution), one widely used algorithm amino acid’s position as a distribution over the map is very expressive. is ARP/WARP (Morris et al., 2003). This atom-based method Simply returning the C position that maximizes the marginal ignores a heuristically places atoms in the map, connects them and lot of information. This section details the application of particle refines their positions. To handle poor phasing, ARP/WARP filtering to ‘explain’ the density map using multiple, physically feasible alternates steps in which (a) a model is built based on a density models. map and (b) the map is improved using phases from iteratively refined model. 3.1 Particle-filtering overview Other methods have been developed to handle low-resolution We will use a particle-filtering method called statistical importance density maps, where atom-based approaches like ARP/WARP resampling (SIR) (Arulampalam et al., 2001; Doucet et al., 2000), fail to produce a reasonable model. Ioerger’s TEXTAL (Ioerger which approximates the posterior probability distribution of a state and Sacchettini, 2003) and CAPRA (Ioerger and Sacchettini, sequence x ¼ {x , ... , x } given observations y as the weighted 1:K 1 K 1:K 2002) interpret poor-resolution density maps using ideas from ðiÞ sum of a finite number of point estimates x 1:K pattern recognition. Given a sphere of density from an uninter- preted density map, both employ a set of rotation-invariant ðiÞ pðx jy Þ w x x ð3Þ 1:K 1:K i 1:K 1:K statistical features to aid in interpretation. CAPRA uses a trained i¼1 neural network to identify C locations. TEXTAL performs Here, i is the particle index, w is particle i’s weight, K is the number a rotational search to place sidechains, using the rotation- of states (here the number of amino acids) and is the Dirac delta invariant features to identify sidechain types. RESOLVE’s function. In our application, x describes the position of every non- automated model-building routine (Terwilliger, 2002) uses a hydrogen atom in amino acid k; y is a 3D region of density in the map. hierarchical procedure in which helices and strands are located In our work, the technical term ‘particle’ refers to one specific by an exhaustive search. High-scoring matches are extended 3D layout of all the non-hydrogen atoms in a contiguous subsequence iteratively using a library of tripeptides; these growing chains of the protein (e.g. from amino acid 21 to 25). PF represents the 2853 F.DiMaio et al. distribution of some subsequence’s layout using a set of distinct layouts for that subsequence (in other words, what we are doing is illustrated Algorithm 1: ACMI-PF’s algorithm for growing a protein model in Fig. 1, where each protein model is a single particle). At each iteration of particle filtering, we advance the extent of each input: density map y, amino acid marginals p^ ðb Þ k k particle by one amino acid. For example, given x {x , .. . , x } the 21:25¼ 21 25 ðiÞ ðiÞ output: set of protein models x and weights w 1:K K position of all atoms in amino acids 21 through 25 (we will use this //start at some AA with high certainty about its location shorthand notation for a particle throughout the article), PF samples ðiÞ choose k such that p ðb Þ has minimum entropy the position of the next amino acid, in this case x . Ideally, particle foreach particle i¼ 1.. .N do filtering would sample these positions from the posterior distribution: ðiÞ ðiÞ choose b at random from p^ ðb Þ k k the probability of x ’s layout given the current particle and the map. ðiÞ w 1=N SIR is based on the assumption that this posterior is difficult to sample k end directly, but easy to evaluate (up to proportionality). Given some other foreach residue k do function (called the importance function that approximates the poste- foreach particle i¼ 1 .. . N do rior, particle filtering samples from this function instead, then uses ðiÞ //choose b (or b ) given b the ratio of the posterior to the importance function to reweight the kþ 1 k 1 ðiÞ *m fb g choose M samples from ðb ; b Þ particles. adj kþ1 kþ1 k *m *m To give an example of an importance function, particle-filtering w belief p^ ðb Þ kþ1 ðiÞ *m *m applications often use the prior conditional distribution p(x jx ) as the b choose b with probability / w k k 1 kþ1 kþ1 importance function. After sampling the data, y will be used to weight ðiÞ ðiÞ M *m w w w m¼1 kþ1 k each particle. In our application, this is analogous to placing an amino ðiÞ //choose s given b acid’s atoms using only the layout of the previous amino acid, then k k1:kþ1 *l fs g sidechain conformations for amino acid k reweighting by how well it fits the density map. k *l *l We use a particle resampling step to address the problem of degen- p prob ccðs ; EDM½b Þ occurred by chance null k eracy in the particle ensemble (Kong et al., 1994). As particles are *l *l s choose s with probability / 1=p 1 k null extended, the variance of particle weights tends to increase, until there P ðiÞ ðiÞ L *l w w 1=p 1 kþ1 k l¼1 null are few particles with non-negligible weights and much effort is spent end updating particles with little or no weight. To ameliorate this problem, end an optional resampling step samples (with replacement) a new set of N particles at each iteration, with the probability of selecting a particle proportional to its weight. This ensures most particles remain on high- While it is intractable to compute Equation (4) exactly, it is likelihood trajectories in state space. straightforward to estimate using ACMI’s Markov field model What makes SIR (and particle filtering methods in general) different from Markov chain Monte Carlo (MCMC) is that MCMC is concerned ðiÞ ðiÞ ðiÞ pðb jb ; y Þ/ pðb ; b jyÞ=pðb jyÞ kþ1 k kþ1 k k k with the stationary distribution of the Markov chain. In particle ðiÞ ¼ p^ ðb Þ ðb ; bÞð5Þ filtering, one is not concerned with convergence of the point estimates, kþ1 kþ1 adj kþ1 rather, the distribution is simply modeled by the ensemble of particles, Here, p^ ðb Þ is the ACMI-computed marginal distribution for amino- kþ1 kþ1 whether or not they converge. acid kþ 1 ðp^ ’s dependence on y dropped for clarity). We sample C kþ1 kþ 1’s location from the product of (a) kþ 1’s marginal distribution and (b) the adjacency potential between C k and C kþ 1. 3.2 Protein particle model The optimal sampling function has a corresponding weight update An overview of our entire algorithm appears in Algorithm 1. For ðiÞ i i density-map interpretation, we use the variable x to denote the w / w pðy jb ; b Þ db ð6Þ kþ1 kþ1 k kþ1 k k position of every atom in amino acid k. We want to find the complete (all-atom) protein model x that best explains the observed electron- 1:K This integral, too, is intractable to compute exactly, but can be density map y. To simplify, we parameterize x as a C location b k k approximated using ACMI’s marginals [the same as b in Equation equation (2)] and a sidechain placement s . i k Z ðiÞ i i The sidechain placement identifies the 3D location of every non- w / w p^ ðb Þ ðb ; b Þ db ð7Þ kþ1 kþ1 adj kþ1 k kþ1 k k hydrogen sidechain atom in amino acid k, as well as the position of backbone atoms C, N and O. Equations (5) and (7) suggest a sampling approach to the problem Given this parameterization, the Markov process alternates between of choosing location of C kþ 1 and reweighting each particle. This placing: (a) C positions and (b) sidechain atoms. That is, an iteration sampling approach, shown in Algorithm 1, is illustrated pictorially of particle filtering first samples b (C of amino acid kþ 1) given b , kþ 1 k in Figure 4. ðiÞ or alternatively, growing our particle toward the N-terminus would We sample M potential C locations from ðb ; b Þ, the adj kþ1 sample b given b . Then, given the triple b , we sample k 1 k k 1:kþ 1 adjacency potential between k and kþ 1, which models the allowable sidechain conformation s . conformations between two adjacent C’s. We assign each sample a weight: the approximate marginal probability p^ at each of these kþ1 3.2.1 Using ACMI-computed marginals to place C’s. In our sampled locations. We select a sample from this weighted distribution, algorithm’s backbone step we want to sample the C position b approximating Equation (5). Finally, we reweight our particle as the kþ 1 ðiÞ (or b ), given our growing trace b , for each particle i. That is, we sum of weights of all the samples we considered. This sum approximates j 1 j:k ðiÞ ðiÞ want to define our sampling function qðb jb ; .. .; b ; yÞ. Doucet the integral in Equation (7). kþ1 j et al. (2000) defines the optimal sampling function as the conditional This process, in which we consider M potential C locations, is repeated for every particle in the particle filter for each C in marginal distribution the protein. For every particle, we begin by sampling locations for the ðiÞ ðiÞ ðiÞ qðb jb ; .. .; b ; yÞ¼ pðb jb ; yÞð4Þ kþ1 kþ1 j k k amino acid k whose marginal distribution has the lowest entropy 2854 Creating protein models Fig. 4. An overview of the backbone forward-sampling step. Given positions b and b , we sample M positions for b using the empirically k 1 k k 1 *m derived distribution of C—C—C pseudoangles. Each potential b is weighted by the belief p^ðb jyÞ. We choose a single location from this kþ 1 kþ1 distribution; the particle weight is multiplied by the sum of these weights in order to approximate Equation (6). Fig. 5. An overview of the sidechain sampling step. Given positions b , we consider L sidechain conformations s . Each potential k 1:kþ 1 conformation is weighted by the probability of the map given the sidechain conformation, as given in Equation (9). We choose a sidechain from this distribution; the particle weight is multiplied by the sum of these weights. (we use a soft-minimum to introduce randomness in the order in conformation’s calculated density and a region around b in the which amino acids are placed). This corresponds to the amino acid that density map. ACMI is most sure of the location. The direction we sample at each Figure 5 illustrates the process of choosing a sidechain conformation iteration (i.e. toward the N- or C-terminus) is also decided by the for a single particle i. We consider each of L different sidechain conformations for amino acid k. For each sidechain conformation entropy of the marginal distributions. s , l2 {1, .. . , L}, we compute the correlation coefficient between the conformation and the map 3.2.2 Using sidechain templates to sample sidechains Once ðiÞ l *l CC ¼ cross correlationðs ; EDM½b Þ our particle filter has placed C’s k 1, k and kþ 1 at 3D locations k ðiÞ b , it is ready to place all the sidechain atoms in amino acid k. k1:kþ1 EDM[b ] denotes a region of density in the neighborhood of b . k k We denote the position of these sidechain atoms s . Given the primary ðiÞ To assign a probability pðEDM½b js Þ to each sidechain conforma- amino acid sequence around k, we consider all previously observed tion, we compute the probability that a cross-correlation value was conformations (i.e. those in the PDB) of sidechain k. Thus, s consists not generated by chance. That is, we assume that the distribution of (a) an index into a database of known sidechain 3D structures of the cross correlation of two random functions is normally and (b) a rotation. 2 distributed with mean and variance . We learn these parameters To further simplify, each sidechain template models the position by computing correlation coefficients between randomly sampled k 1 kþ 1 of every atom from C to C . Then, given three consecutive locations in the map. Given some cross correlation x , we compute ðiÞ c backbone positions b , the orientation of sidechain s is deter- k1:kþ1 the expected probability that we would see score x or higher by ðiÞ c mined by aligning the three C’s in the sidechain template to b . k1:kþ1 random chance, As Algorithm 1 shows, sidechain placement is quite similar to the C p ðx Þ¼ PðX x ;; Þ¼ 1 ðx =Þð8Þ null c c c placement in the previous section. One key difference is that sidechain Here, (x) is the normal cumulative distribution function. The placement cannot take advantage of ACMI’s marginal distribution, probability of a particular sidechain conformation is then as ACMI’s probability distributions have marginalized away sidechain conformations. Instead, the probability of a sidechain is calculated ðiÞ *l pðEDM½b js Þ/ð1=p Þ 1 ð9Þ null k k on-the-fly using the correlation coefficient between a potential 2855 F.DiMaio et al. Table 1. Summary of crystallographic data available, ACMI used the location of selenium atom peaks as a soft constraint on the positions of methionine residues. Particle filtering was run 10 times; in each run, the single highest-weight model was PDB ID AAs in Molecules Resolution Phase returned, producing a total of 10 ACMI-PF protein models. Predicted o a ASU in ASU (A) error ( ) models were refined for 10 iterations using REFMAC5 (Murshudov et al., 1997) with no modification or added solvent. The first step is the most 2NXF 322 1 1.9 58 computationally expensive, but is efficiently divided across multiple 2Q7A 316 2 2.6 49 processors. Computation time varied depending on protein size; XXXX 566 2 2.65 54 the entire process took at most a week of CPU time on 10 processors. 1XRI 430 2 3.3 39 We compare ACMI-PF to four different approaches using the same 1ZTP 753 3 2.5 42 10 density maps. To test the utility of the particle-filtering method 1Y0Z 660 2 2.4 (3.7)58 for building all-atom models, we use the structure that results from 2A3Q 340 2 2.3 (3.5)66 independently placing the best-matching sidechain on each C 2IFU 1220 4 3.5 50 predicted by ACMI, which we term ACMI-NAIIVE. The other three 2BDU 594 2 2.35 55 approaches are the commonly used density-map interpretation 2AB1 244 2 2.6 (4.0)66 algorithms ARP/WARP (version 7), TEXTAL (in PHENIX version 1.1a) and RESOLVE (version 2.10). Refinement for all algorithms uses the same Averaged over all resolution shells. protocol as ACMI-PF, refining the predicted models for 10 iterations Different dataset was used to solve the PDB structure. in REFMAC5 (ARP/WARP, which integrates refinement and model Phasing was extended from lower resolution. building, was not further refined). PDB file not yet released. To assess the prediction quality of each algorithm, we consider three different performance metrics: (a) backbone completeness, (b) sidechain identification and (c) R factor. The first metric compares the predicted Since we are drawing sidechain conformations from the distribution model to the deposited model, counting the fraction of C’s placed of all solved structures, we assume a uniform prior distribution on within 2 A of some C in the PDB-deposited model. The second ðiÞ ðiÞ *l *l ˚ sidechain conformations, so pðs jEDM½b Þ / pðEDM½b js Þ. measure counts the fraction of C’s both correctly placed within 2 A k k k k As illustrated in Figure 5, sidechain sampling uses a similar method and whose sidechain type matches the PDB-deposited structure. to the previous section’s backbone sampling. We consider extending our Finally, the R factor is a measure of deviation between the reflection *1 *L particle by each of the L sidechain conformationsfs ; .. .; s g sampled intensities predicted by the model and those experimentally measured. k k from our sidechain database. After computing the correlation between A lower R factor indicates a better model. The R factor is computed each sidechain conformation’s density and the density map around using only peptide atoms (i.e. no added water molecules). The compari- b , each conformation is weighted using Equation (9). We choose a son uses the so-called free R factor (Brunger, 1992), which is based on single conformation at random from this weighted distribution, reflections that were not used in refinement. updating each particle’s weight by the sum of weights of all considered sidechain conformations. Finally, our model takes into account the partial model x when j:k 1 4 RESULTS AND DISCUSSION placing sidechain s . If any atom in sidechain s overlaps a previously k k 4.1 ACMI-NAIVE versus ACMI-PF placed atom or any symmetric copy, particle weight is set to zero. We first compare protein models produced by ACMI-PF to 3.3 Crystallographic data ¨ those produced by ACMI-NAIVE. The key advantage of particle filtering is the ability to produce multiple protein structures Ten experimentally phased electron-density maps, provided by the Center for Eukaryotic Structural Genomics (CESG) at UW-Madison, using ensembles of particles. Since the density map is an have been used to test ACMI-PF. The maps were initially phased using average over many molecules of the protein in the crystal, it is AUTOSHARP (Terwilliger, 2002) with non-crystallographic symmetry natural to use multiple conformations to model this data. averaging and solvent flattening (in RESOLVE) used to improve the map There is evidence that a single conformation is insufficient to quality where possible. The 10 maps were selected as the ‘most difficult’ model protein electron density (Burling and Brunger, 1994; from a larger dataset of 20 maps provided by CESG. These structures DePristo et al., 2004; Furnham et al., 2006; Levin et al., 2007). have been previously solved and deposited to the PDB, enabling a direct As comparison, we take ACMI-NAIVE, which uses the maximum- comparison with the final refined model. All 10 required a great deal of marginal trace to produce a single model. human effort to build and refine the final atomic model. We use ACMI-PF to generate multiple physically feasible The data are summarized in Table 1 with quality described by the models, by performing 10 different ACMI-PF runs of 100 resolution and phase error. The resolution from the initial phasing may not have reached the resolution limit of the dataset. Initial low- particles each. Each run sampled amino acids in a different resolution phasing was computationally extended in three structures order; amino acids whose belief had lowest entropy (i.e. those (using an algorithm in RESOLVE). Mean phase error was computed using we are most confident we know) were stochastically preferred. CCP4 (Collaborative Computational Project, 1994) by comparing Figure 6 summarizes the results. The y axis shows the average calculated phases from the deposited model with those in the initially (over the 10 maps) R of the final refined model; the x axis free phased dataset. indicates the number of ACMI-PF runs. This plot shows that a single ACMI-PF model has an R approximately equal to free 3.4 Computational methodology the R of ACMI-NAIVE. Model completeness is also very close free between the two (data not shown). As additional structures Models in ACMI-PF are built in three phases: (a) prior distributions are are added ACMI-PF’s model, average R decreases. The plot computed, (b) ACMI infers posterior distributions for each C location free and (c) all-atom models are constructed using particle filtering. Where shows ACMI-NAIVE’s model as a straight line, since there is no 2856 Creating protein models 0.50 100% % backbone placed % sidechains identified 80% 0.40 60% 0.30 40% Acmi-PF Acmi-Naive 20% 0.20 12 34 56789 10 0% Number of structures in model ACMI-PF ARP/wARP Resolve Textal Fig. 6. A comparison of the R of ACMI-NAIVE and ACMI-PF, as the free Fig. 7. A comparison of ACMI-PF to other automatic interpretation number of protein models produced varies. Multiple models are methods in terms of average backbone completeness and sidechain produced by independent ACMI-PF runs (ACMI-NAIVE only produces a identification. single model). Since R in deposited structures is typically 0.20–0.25, free we use 0.20 as the lowest value on the y axis. Scatterplots in Figure 8 compare the R of ACMI-PF’s free mechanism to generate multiple conformations. We believe a complete (10-structure) model to each of the three alternative key reason for this result is that particle filtering occasionally approaches, for each density map. Any point below the makes mistakes when tracing the main chain, but it is unlikely diagonal corresponds to a map for which ACMI-PF’s solution for multiple PF runs to repeat the same mistake. The mistakes has a lower (i.e. better) R . These plots show that for all free average out in the ensemble, producing a lower R factor. but one map ACMI-PF’s solution has the lowest R factor. The Individual models in ACMI-PF offer additional advantages singular exception for which ARP/WARP has a lower R factor over ACMI-NAIVE. Comparing the ACMI-PF model with lowest is 2NXF, a high (1.9 A) resolution but poorly phased density map in which ARP/WARP automatically traces 90%, while R (the ‘training set’ R factor) to ACMI-NAIVE’s model work ACMI-PF’s best model correctly predicts only 74%. Our results shows that particle filter produces fewer chains on average illustrate both the limitations and the advantages of ACMI-PF: (28 versus 10) and lower all-atom RMS error (1.60 A versus 1.72 A). This trend held in all 10 maps in our test set: ACMI-PF’s it is consistently superior at interpretation of poorly phased, best model contains fewer predicted chains and lower RMS lower resolution maps, while an iterative phase-improvement error than ACMI-NAIVE. Additionally, the structures particle algorithm like ARP/WARP may be better suited for a poorly filtering returns are physically feasible with no overlapping phased but higher-resolution data. sidechains or invalid bond lengths. 5 CONCLUSION 4.2 Comparison to other algorithms We develop ACMI-PF, an algorithm that uses particle filtering We further compare the models produced by particle filtering to produce a set of all-atom protein models for a given on the 10 maps to those produced by three other methods density map. Particle filtering considers growing stepwise for automatic density-map interpretation, including two well- an ensemble of all-atom protein models. The method builds established lower-resolution algorithms, TEXTAL and RESOLVE on our previous work, where we infer a probability distribution and the atom-based ARP/WARP (although most of our maps of each amino acid’s C location. ACMI-PF addresses short- are outside of its recommended resolution). comings of our previous work, producing a set of physically Figure 7 compares all four methods in terms of backbone feasible protein structures that best explain the density map. completeness and sidechain identification, averaged over Our results indicate that ACMI-PF generates more accurate and all 10 structures. To provide a fair comparison, we compute more complete models than other state-of-the-art automated completeness of a single ACMI-PF structure (of the 10 interpretation methods for poor-resolution density map data. produced). The ACMI-PF model chosen was that with the ACMI-PF produces accurate interpretations, on average finding lowest refined R . Under both of these metrics, ACMI-PF and identifying 80% of the protein structure in poorly phased work locates a greater fraction of the protein than the other 2.5–3.5 A resolution maps. approaches. ACMI-PF performs particularly well at sidechain Using ACMI-PF, an ensemble of conformations may be easily identification, correctly identifying close to 80% of sidechains generated using multiple runs of particle filtering. We show that over these 10 poor-quality maps. The least accurate model that sets of multiple structures generated from multiple particle ACMI-PF generated (for 2AB1) had 62% backbone complete- filtering runs better fit the density map than a single structure. ness and 55% sidechain identification. In contrast, the three This is consistent with recent observations of the inadequacy comparison methods all return at least five structures of the single-model paradigm for modeling flexible protein with540% backbone completeness and at least eight structures molecules (Burling and Brunger, 1994; DePristo et al., 2004; with520% sidechain identification. Furnham et al., 2006) and with the encouraging results Average R free Percent of true model F.DiMaio et al. (a) 0.65 (b) 0.65 (c) 0.65 0.55 0.55 0.55 0.45 0.45 0.45 0.35 0.35 0.35 0.25 0.25 0.25 0.25 0.35 0.45 0.55 0.65 0.25 0.35 0.45 0.55 0.65 0.25 0.35 0.45 0.55 0.65 ARP/wARP 7 R Textal R Resolve R free free free Fig. 8. A comparison of the free R factor of ACMI-PF’s interpretation for each of the 10 maps versus (a) ARP/WARP, (b)TEXTAL and (c)RESOLVE. The scatterplots show each interpreted map as a point, where the x axis measures the R of ACMI-PF and the y axis the alternative approach. free Burling,F.T. and Brunger,A.T. (1994) Thermal motion and conformational of the ensemble refinement approach (Levin et al., 2007). The disorder in protein crystal-structures – comparison of multi-conformer and ensemble description may also provide valuable information time-averaging models. Isr. J. Chem., 34, 165–175. about protein conformational dynamics. As well, multiple Chandonia,J.M. and Brenner,S.E. (2006) The impact of structural genomics: conformations may be valuable for application of ACMI-PF expectations and outcomes. Science, 311, 347–351. Collaborative Computational Project, Number 4 (1994) The CCP4 suite: in an iterative approach, where computed phases from an programs for protein crystallography. Acta Crystallogr., D50, 760–763. ACMI-PF model are used build an updated density map, Cowtan,K. (2006) The Buccaneer software for automated model building.1. which is fed back into the ACMI pipeline. Tracing protein chains. Acta Crystallogr., D62, 1002–1011. ACMI-PF’s model-based approach is very flexible and allows DePristo,M.A. et al. (2004) Heterogeneity and inaccuracy in protein structures integration of multiple sources of ‘fuzzy’ information, such as solved by X-ray crystallography. Structure, 12, 911–917. DiMaio,F. et al. (2006) A probabilistic approach to protein backbone tracing in locations of selenium peaks. In the future, it may be productive electron-density maps. Bioinformatics, 22, e81–e89. to integrate other sources of information in our model. A more DiMaio,F. et al. (2007) Improved methods for template-matching in electron- complicated reweighting function based on physical or statis- density maps using spherical harmonics. In Proceedings of the IEEE tical energy could better overcome ambiguities of unclear Conference on Bioinformatics and Biomedicine. IEEE Press, Fremont, CA. Doucet,A. et al. (2000) On sequential Monte Carlo sampling methods for regions in the density map. The inclusion of these and other Bayesian filtering. Stat. Comput., 10, 197–208. sources of information is possible, so long as they can be Furnham,N. et al. (2006) Is one solution good enough? Nat. Struct. Mol. Biol., expressed in the probabilistic framework proposed here. This 13, 184–185. could further extend the resolution in which automated Geman,S. and Geman,D. (1984) Stochastic relaxation, Gibbs distributions and interpretation of density maps is possible. the Bayesian restoration of images. IEEE Trans. PAMI, 6, 721–741. Ioerger,T.R. and Sacchettini,J.C. (2002) Automatic modeling of protein back- bones in electron density maps. Acta Crystallogr, D58, 2043–2054. Ioerger,T.R. and Sacchettini,J.C. (2003) The TEXTAL system: artificial ACKNOWLEDGEMENTS intelligence techniques for automated protein model building. Methods We acknowledge support from NLM T15-LM007359 Enzymol., 374, 244–270. Kong,A. et al. (1994) Sequential imputations and Bayesian missing data (F.D., A.S., D.A.K.), NLM R01-LM008796 (F.D., J.W.S., problems. J. Am. Stat. Assoc., 89, 278–288. G.N.P., D.A.K.) and NIH Protein Structure Initiative Grant Levin,E.J. et al. (2007) Ensemble refinement of protein crystal structures. In GM074901 (E.B., C.A.B., G.N.P.). Structure, 15, 1040–1052. Morris,R. et al. (2003) ARP/wARP and automatic interpretation of protein Conflict of Interest: none declared. electron density maps. Methods Enzymol., 374, 229–244. Murshudov,G.N. et al. (1997) Refinement of macromolecular structures by the maximum-likelihood method. Acta Crystallogr., D53, 240–255. Sawasaki,T. et al. (2002) A cell-free protein synthesis system for high-throughput REFERENCES proteomics. Proc. Natl Acad. Sci. USA, 99, 14652–14657. Snell,G. et al. (2004) Automated sample mounting and alignment system Arulampalam,M.S. et al. (2001) A tutorial on particle filters. IEEE Trans. Signal for biological crystallography at a synchrotron source. Structure, 12, 537–545. Process., 50, 174–188. Terwilliger,T.C. (2002) Automated main-chain model building by template- Berman,H.M. and Westbrook,J.D. (2004) The impact of structural genomics on matching and iterative fragment extension. Acta Crystallogr., D59, 38–44. the protein data bank. Am. J. Pharmacogenomics, 4, 247–252. Wang,G. and Dunbrack,R.L. (2003) PISCES: a protein sequence culling server. Brunger,A.T. (1992) Free R value: a novel statistical quantity for assessing the Bioinformatics, 19, 1589–1591. accuracy of crystal structures. Nature, 355, 472–475. ACMI-PF R free ACMI-PF R free ACMI-PF R free

Journal

Bioinformatics – Oxford University Press

Published: Oct 12, 2007

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Creating protein models from electron-density maps using particle-filtering methods

Creating protein models from electron-density maps using particle-filtering methods

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Creating protein models from electron-density maps using particle-filtering methods

Creating protein models from electron-density maps using particle-filtering methods

References (26)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies