Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs)

Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs) Background: The prediction of biochemical function from the 3D structure of a protein has proved to be much more difficult than was originally foreseen. A reliable method to test the likelihood of putative annotations and to predict function from structure would add tremendous value to structural genomics data. We report on a new method, Structurally Aligned Local Sites of Activity (SALSA), for the prediction of biochemical function based on a local structural match at the predicted catalytic or binding site. Results: Implementation of the SALSA method is described. For the structural genomics protein PY01515 (PDB ID 2aqw) from Plasmodium yoelii, it is shown that the putative annotation, Orotidine 5’-monophosphate decarboxylase (OMPDC), is most likely correct. SALSA analysis of YP_001304206.1 (PDB ID 3h3l), a putative sugar hydrolase from Parabacteroides distasonis, shows that its active site does not bear close resemblance to any previously characterized member of its superfamily, the Concanavalin A-like lectins/glucanases. It is noted that three residues in the active site of the thermophilic beta-1,4-xylanase from Nonomuraea flexuosa (PDB ID 1m4w), Y78, E87, and E176, overlap with POOL-predicted residues of similar type, Y168, D153, and E232, in YP_001304206.1. The substrate recognition regions of the two proteins are rather different, suggesting that YP_001304206.1 is a new functional type within the superfamily. A structural genomics protein from Mycobacterium avium (PDB ID 3q1t) has been reported to be an enoyl-CoA hydratase (ECH), but SALSA analysis shows a poor match between the predicted residues for the SG protein and those of known ECHs. A better local structural match is obtained with Anabaena beta-diketone hydrolase (ABDH), a known b-diketone hydrolase from Cyanobacterium anabaena (PDB ID 2j5s). This suggests that the reported ECH function of the SG protein is incorrect and that it is more likely a b-diketone hydrolase. Conclusions: A local site match provides a more compelling function prediction than that obtainable from a simple 3D structure match. The present method can confirm putative annotations, identify misannotation, and in some cases suggest a more probable annotation. Background new structures of unknown function are determined, it is There are currently over 11,000 structural genomics (SG) common practice to make a tentative functional assign- protein structures in the Protein Data Bank (PDB) [1] ment from the closest sequence match or the best 3D and most of them are of unknown or uncertain function, structure match to an annotated protein. Such tentative as the inference of function from structure has proved to functional assignments are often incorrect [2]. Further- be more difficult than anticipated. Furthermore, when more, one annotation error can propagate or “percolate” [2-4] in databases as additional proteins are annotated by automated or semi-automated means. * Correspondence: [email protected] Overviews of current methods for the functional anno- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115 USA tation of proteins from their sequence and/or structure Full list of author information is available at the end of the article © 2013 Wang et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 2 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 have been given in recent reviews [5-8]. The simplest, biochemical function. The quality of the match of the and most commonly employed [6] methods seek the clo- predicted functional site in the query protein to func- sest sequence matches using a search program such as tional sites in proteins of known function is measured BLAST [9], or alternatively the closest 3D structure using a scoring function. The present method can deter- match obtained from e.g. Dali [10], Combinatorial Exten- mine whether a putative functional assignment is likely sion (CE) [11], or Topofit [12], and then just transfer the to be correct or incorrect. In some cases where a protein is shown to be misannotated, a probable functional function from the closest match to the query protein. assignment is made. However, even relatively high sequence similarity does not necessarily imply similar function [13]. Other types of sequence-based methods employ motif searching, phy- Methods logenetic profiling, or genome context. The Critical Functional residue predictions were made using POOL Assessment of Function Annotation (CAFA) experiment [22,23]. Input features for each residue in a given struc- (http://biofunctionprediction.org/) seeks to assess the ture include: electrostatics information, as contained in state of the current art of function prediction, chiefly THEMATICS metrics [24,25]; phylogenetic information from sequence. The aim of this work is to exploit struc- from INTREPID [26,27]; and geometric information tural information, together with computed chemical from ConCavity(structureonlyversion)[28]. Thetop- properties, to enhance function prediction capabilities. ranked residues in the POOL output constitute the It was hoped that SG would provide functional annota- functional site prediction. Cut-off limits are specified for tions for the protein products of newly-sequenced coding each case. genes, as indeed the 3D structure can sometimes be indi- Multiple structure alignments are made for each set cative of function. Simple protein fold comparison does of proteins. The structural alignment of multiple struc- work in some cases, as domains having a common fold tures of diverse function can be difficult and therefore sometimes do have the same function. However, many multiple alignment methods [11,12,29] may be needed folds have multiple functions. For instance, the Rossman for some cases. In the examples shown here, T-Coffee fold and the TIM barrel each represent more than 50 dif- [29] is used. For present purposes, a full alignment is ferent functions. The use of local 3D structural motifs or not necessary. A quality alignment is only required in templates, a feature of the present method, is now emer- the local spatial region of the predicted active site. ging as a more promising path for correct functional SALSA tables are constructed for the locally aligned annotation from structure [14-19]. residues in the predicted active site. In a SALSA table, the In spite of recent advances in protein function predic- rows represent individual protein structures and the col- tion, inference of biochemical function from the structure umns represent spatially aligned positions. is difficult [20,21]. Hundreds of SG structures have no Consensus signatures for a given functional subclass functional assignment at all and, for thousands of other are established using POOL predictions on a set of pre- SG proteins, functional hypotheses for SG proteins are viously characterized proteins with the same biochemical putative and uncertain. Not all such hypotheses will prove function, usually with common fold. To maximize in time to be correct, as examples below will illustrate. sequence diversity in this reference set, sets of structures The ability to determine function from the 3D structure are sought with the lowest possible sequence identity would add great value to this growing volume of SG data. among them. POOL-predicted residues of the same A different approach to functional annotation from 3D amino acid type in the same spatial position for the structure is presented here and is based on the combina- majority of the previously characterized proteins of com- tion of functional site prediction with local 3D structural mon biochemical function then constitute the consensus alignment. Functional site predictions are obtained from signature for that functional group. The consensus signa- Partial Order Optimum Likelihood (POOL) [22,23], a ture for a given biochemical function thus consists of a monotonicity-constrained maximum likelihood method, series of amino acid types in specified spatial positions. using computed chemical, electrostatic, and geometric SG proteins of unknown or uncertain function are properties, as well as phylogenetic information (if avail- analyzed by POOL and the predictions are aligned with able), as input features. POOL places all of the residues those of proteins of known function, or with the consen- in the input protein structure into an ordered list, ranked sus signature. according to probability of participation in the active site. Scoring the match between the predicted active site for The top-ranked residues constitute the active site predic- the query protein and that of the consensus signature is tion. Structural alignments are obtained for sets of these performed using the BLOSUM62 matrix [30]. Scores are local sites. Characteristic spatial patterns of predicted reported as a percentage of the maximum value (i.e.the residues at the structurally aligned local sites of activity score for the perfect match, the consensus signature with (SALSAs) are then used to identify specific types of itself). Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 3 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 Table 2 Local structural alignment of the consensus Results and discussion signature residues for the OMPDCs. Confirmation of annotation for PY01515, a putative Structurally aligned signature active site residues for OMPDC Orotidine 5’-monophosphate decarboxylase (OMPDC) Orotidine 5’-monophosphate decarboxylase (OMPDC) b1 b2 b3 b4 b7 b8 catalyzes one step in the pyrimidine biosynthesis pathway. PDB 1 2 3 4 5 6 7 8 It catalyzes the metal ion dependent decarboxylation of 1dbt D11 K33 D60 K62 D65 H88 P182 R215 orotidine monophosphate (OMP) to uridine monopho- 1dvj D20 K42 D70 K72 D75 H98 P180 R203 sphate (UMP) and CO [31,32]. OMPDC is a member of Protein 1dqw D37 K59 D91 K93 D96 H122 P202 R235 the ribulose phosphate binding barrel (RPBB) superfamily 1l2u D22 K44 D71 K73 D76 H99 P189 R222 and has a TIM barrel [33] structure, with the active site 2za1 D23 K102 D136 K138 D141 n165 P264 R294 located inside the beta barrel, spanning the eight beta SG 2aqw D23 K105 D139 K141 D144 n168 P267 R297 strands. The structural genomics protein PY01515 (PDB The first five rows represent previously-characterized OMPDCs. The sixth row ID 2aqw) is a putative OMPDC from Plasmodium yoelii is a putative OMPDC from Structural Genomics. The columns represent spatially coincident positions in the structural alignment; these positions are [34]. numbered 1-8. Known catalytic residues are shown in boldface. POOL- The POOL-predicted functional site for PY01515 was predicted residues are shown in uppercase; residues not predicted by POOL aligned with eight different functional site types predicted are shown in lowercase. The beta strand on which each position is located is given at the top of the column, above the position number. The good match by POOL for structures in the RPBB superfamily and a between the SG protein and the known OMPDCs suggests common function. strong match was found with that of the OMPDCs and not with the other seven functional types. Five previously positions in the local structural alignment. The residues characterized OMPDC structures, those from Bacillus predicted by POOL are shown in uppercase; residues in subtilis (PDB ID 1dbt), Methanothermobacter thermauto- lowercase are not in the top 9% of the POOL rankings. trophicus (PDB ID 1dvj), Saccharomyces cerevisiae (PDB The previously reported catalytic residues [35,36] are ID 1dqw), Escherichia coli (PDB ID 1l2u), and Plasmo- shown in boldface. Positions 1-8 are positions in the con- dium falciparum (PDB ID 2za1), were used to establish sensus prediction, i.e. similar residues are predicted by the consensus signature of an OMPDC active site. These POOL for the majority of the previously characterized five previously characterized OMPDCs represent consider- OMPDCs. The row above each position gives the beta able sequence diversity, as shown in Table 1. With the strand on which that position is located. For positions 1-5, exception of structures 1 and 4, which share sequence 7, and 8, an identical residue is predicted by POOL for all identity of 60%, all other pairs of structures have sequence identities in the 6% - 30% range. five previously characterized OMPDCs. At position 6, a For thefivepreviouslycharacterized OMPDCs,the histidine is predicted for four out of the five previously important residues are predicted using the top 9% of the characterized OMPDCs. For the Plasmodium falciparum residues, as ranked by POOL, for each protein structure. structure, there is an asparagine, not predicted by POOL, When these five predicted active sites are structurally at position 6. The consensus signature may be abbreviated aligned, eight spatial positions are found to have common as (D,K, D,K,D,H, P, R).Thecombination of residue predicted residues across the five diverse, previously char- types at the eight positions shown in Table 2 is unique to acterized OMPDCs. Table 2 shows this local structural OMPDC within the RPBB superfamily. For instance, the alignment. The rows in Table 2 represent individual pro- lysine in position 2 and the proline in position 7 are not tein structures, with the five previously characterized observed in the equivalent positions for any of the seven OMPDCs listed first; the last row is the query protein other functional subclasses of the RPBB superfamily. The quality of a match with the consensus signature may from SG. The columns represent spatially coincident be measured using a scoring matrix. Using the BLOSUM62 [30] matrix, the first four proteins listed in Table 2 have a Table 1 Sequence identity matrix for five previously score of 48 with the consensus signature; this score is characterized OMPDCs (structures 1-5) and the SG 100% of the maximum value. The Plasmodium falciparum protein PY01515 (PDB ID 2aqw). structure has a score of 39 (81% of the maximum value) PDB ID: 1dbt 1dvj 1dqw 1l2u 2za1 2aqw against the consensus signature. The structurally aligned residues for the SG protein 1 1dbt 0.240 0.260 0.600 0.060 0.240 PY01515 from Plasmodium yoelii areshown in thelast 2 1dvj 0.240 0.280 0.280 0.120 0.220 row of Table 2. For seven out of the eight positions, 3 1dqw 0.260 0.280 0.300 0.080 0.280 POOL predicts residues that are identical to the consensus 4 1l2u 0.600 0.280 0.300 0.060 0.200 signature residues of the previously characterized 5 2za1 0.060 0.120 0.080 0.060 0.020 OMPDCs. The only variation is in position 6, where there 6 (SG) 2aqw 0.240 0.220 0.280 0.200 0.020 is an asparagine that is not predicted by POOL, just as in Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 4 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 the Plasmodium falciparum OMPDC. PY01515 has a shows some similarity in the catalytic residues. The score of 39 (81% of the maximum value) against the con- reported active site residues [39] for thermophilic beta- sensus signature, using the BLOSUM62 scoring matrix. 1,4-xylanase from Nonomuraea flexuosa are Y78, E87, and The strong match between the predicted active site for E176. YP_001304206.1 possesses a spatially coincident PY01515 and those of the previously characterized triad in the local structural alignment consisting of the resi- OMPDCs indicates that the putative OMPDC functional dues Y168, D153, and E232. This is illustrated in Figure 2, assignment is correct. where the predicted residues for YP_001304206.1 (shown in gray) are structurally aligned with the predicted residues of the thermophilic beta-1,4-xylanase (shown in blue) from YP_001304206.1 - a probable new functional type in the Concanavalin A-like lectins/glucanases superfamily Nonomuraea flexuosa.The overlapofthree of thepre- YP_001304206.1 (PDB ID 3h3l) is a putative sugar hydro- dicted residues in the query protein, Y168, D153, and E232, lase from Parabacteroides distasonis, a commensal bacter- with those of the catalytic residues of the xylanase, Y78, ium of the human intestinal tract. YP_001304206.1 is a E87, and E176 is shown in the boxed region of Figure 2; a member of the Concanavalin A-like lectins/glucanases close-up of this region is shown in the large box on the superfamily. right side of Figure 2. This suggests that the catalytic The POOL-predicted functionally important residues for mechanism of the query protein may have similarities with YP_001304206.1 show poor spatial overlap with those of that of the xylanase. However, as Figure 2 shows, the other all of the enzymes of known function within the Concana- residues, those involved in substrate recognition in the valin A-like lectins/glucanases superfamily. Figure 1 shows xylanase, are not very well conserved in YP_001304206.1. a structural alignment of the predicted residues for Furthermore, the predicted residues D98, D255, and H256 YP_001304206.1 with those of its closest Dali [10,37] of YP_001304206.1, observed as a cluster in the center of structural match, endo-1,3-1,4-beta-D-glucan 4-glucano- Figure 2, appear to form a metal-binding motif that is not hydrolase (PDB ID 2ayh), a representative member of the present in the xylanase. This suggests that YP_001304206.1 glycoside hydrolases family 16 (GH16). The residues for is a novel functional type in the Concanavalin A-like lec- the query protein YP_001304206.1 are shown in gray and tins/glucanases superfamily. those for endo-1,3-1,4-beta-D-glucan 4-glucanohydrolase are shown in pink. Table 3 shows an alignment at the 14 An enoyl-CoA hydratase reported for Mycobacterium consensus signature positions of GH16 for the representa- avium is incorrectly annotated tive GH16, endo-1,3-1,4-beta-D-glucan 4-glucanohydro- A structural genomics protein from Mycobacterium avium lase, with the SG protein YP_001304206.1. Previously (PDB ID 3q1t), a potential target for the treatment of reported active site residues [38] are shown in boldface. infectious disease, has been reported to be an enoyl-CoA POOL-predicted residues (top 8%) are shown in upper- hydratase (ECH). This SG protein and the ECHs are mem- case; residues not predicted are shown in lowercase. Note bers of the ClpP/crotonase superfamily. The consensus that the SG protein has a gap (no residue well aligned) at signature residues for previously characterized ECHs were three of the consensus signature positions. For the align- established using POOL predictions and SALSA. These ment shown in Table 3, a negative BLOSUM62 score of -5 residues, the spatial signature of an ECH catalytic site, are is obtained, corresponding to -5% of the maximum value located in nine positions in the structural alignment. of +97. The three catalytic residues for endo-1,3-1,4-beta- Then, the residues in the consensus signature were struc- D-glucan 4-glucanohydrolase, E105, D107, and E109 [38], turally aligned with residues in the SG M. avium structure. form an EXDXE motif on a common beta sheet and are An alignment of the consensus signature residues, repre- seen forming a vertical line through the center of Figure 1. sented by enoyl-CoA hydratase from Rattus norvegicus Note that these three residues overlap spatially in the (PDB ID 1ey3), with the corresponding spatially overlap- alignment with S140, E142, and Q144 in YP_001304206.1. ping residues of the query protein, is shown in Table 4. The very poor match score (negative) suggests that the Again, the rows represent individual protein structures function of endo-1,3-1,4-beta-D-glucan 4-glucanohy- and the columns represent spatial positions in the align- drolase cannot be transferred to YP_001304206.1. ment. The known catalytic residues, A98, G141, E144, and While the predicted active residues for YP_001304206.1 E164 [40,41], are shown in boldface. Residues predicted have low scores with those of the previously characterized by POOL are shown in uppercase and residues not pre- members of the superfamily, one interesting comparison dicted are shown in lowercase. The BLOSUM62 score does emerge. The superposition of the predicted residues between the SG protein and the known ECH is only 11, or for the query protein with those of thermophilic beta- 22% of the maximum value of 51, for these nine positions. 1,4-xylanase from Nonomuraea flexuosa (PDB ID 1m4w), Note further that the SG protein is missing the catalytic a member of the xylanase/endoglucanase 11/12 family, residues that correspond to E144 and E164 in the Rattus Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 5 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 Figure 1 Structural alignment of predicted residues for YP_001304206.1 (gray) with those of endo-1,3-1,4-beta-D-glucan 4- glucanohydrolase (pink). Table 3 Local structural alignment of the residues in the GH16 consensus signature positions for the known representative GH16, endo-1,3-1,4-beta-D-glucan 4-glucanohydrolase, with the SG protein YP_001304206.1. Spatial Positions® 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2ayh (GH16) F92 Y94 W103 d104 E105 D107 E109 L111 Q119 N121 Y123 Y147 W158 W184 3h3l (SG) F125 - Y138 - s140 E142 q144 L146 A166 Y168 - t187 H198 H256 Previously reported active site residues are shown in boldface. POOL-predicted residues (top 8%) are shown in uppercase; residues not predicted are shown in lowercase. The poor match suggests different functions. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 6 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 Figure 2 Structural alignment of the POOL-predicted residues for the structural genomics protein YP_001304206.1 (gray) with those of a beta-1,4-xylanase from Nonomuraea flexuosa (blue). The overlap of the three catalytic residues, E87, Y89, and E176 of the xylanase with the aligned, predicted residues from YP_001304206.1 is highlighted in the blue box and shown in close-up in the large box on the right. norvegicus ECH structure. These results strongly suggest [42] are shown in boldface. Again, the columns represent that the reported enoyl-CoA hydratase annotation is overlapping spatial positions, but in Table 5 they are listed incorrect. in order of the POOL rank for the M. avium structure Comparison of the local site prediction for the SG pro- (D155 is ranked first, H146 second, E244 third, ...). Thus tein with those of other members of the ClpP/crotonase all of the residues listed for the SG protein in Table 5 are superfamily reveals a much better match with ABDH predicted by POOL. Residues not predicted by POOL for (Anabaena beta diketone hydrolase), a known b-diketone ABDH are shown in lowercase. Notice that four of the hydrolase from Cyanobacterium anabaena (PDB ID 2j5s). top-ranked POOL residues for the SG protein are aligned The local alignment of the top POOL-predicted residues with the known catalytic residues of ABDH: D153, H144, for the M. avium structure with residues from ABDH is E243, and H43. The BLOSUM62 score between the SG shown in Table 5. The known catalytic residues for ABDH protein and the known ABDH for these seven positions is 30, or 60% of the maximum value. These results sugg- estthatthe M. avium structure may be a b-diketone Table 4 Local structural alignment of the predicted active site residues by SALSA for a known ECH from Rattus norvegicus (PDB ID 1ey3) with predicted residues for a Table 5 Local structural alignment of the predicted Structural Genomics protein from Mycobacterium avium residues for the SG protein from Mycobacterium avium (PDB ID 3q1t), reported to be an ECH. (PDB ID 3q1t) with the corresponding residues of ABDH Spatial 12 3 4 5 6 7 8 9 from Cyanobacterium anabaena. Positions® POOL Ranking® 12 3 4 5 6 7 Known A98 g141 E144 C149 D150 E164 R178 k241 N245 SG protein “ECH” (3q1t) D155 H146 E244 D144 H44 C160 H156 ECH (1ey3) Known ABDH (2j5s) D153 H144 E243 D141 H43 l158 g154 SG protein G76 A123 V126 a131 D132 H146 C160 k223 n227 “ECH” The spatial positions 1 through 7 correspond to the ordinal values for the top (3q1t) seven residues in the POOL rank order for 3q1t. Known catalytic residues are shown in boldface. Residues predicted by POOL Known catalytic residues for ABDH are shown in boldface. Residues predicted are in uppercase; residues not predicted are in lowercase. Note the poor by POOL are in uppercase; residues not predicted are in lowercase. Note that match between the residues of the SG protein with those of the the match between the residues of the SG protein and of ABDH is better than representative ECH. that of Table 4. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 7 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 hydrolase, but perhaps with a native substrate different For any given protein structure of previously character- from that of the Cyanobacterium anabaena protein. ized function, the list of residues reported in the literature Figure 3 illustrates the structural alignment of the top to be important for the biochemical function is a subset of POOL-predicted residues for the SG M. avium structure the list of residues predicted by POOL. This longer list is a (purple) with the corresponding residues from ABDH key advantage of the present method, as it enables better (green), showing that the known catalytic residues of discrimination between the functional subclasses. ABDH have strong overlap with the top POOL-pre- To date, one prediction made by local site matching dicted residues for the SG protein. using our electrostatics-based functional site prediction has been verified experimentally by direct biochemical Conclusions assays [43]. Further experimental testing of SALSA func- Local structural matching, as implemented by the SALSA tion predictions is in progress. method, provides a more compelling prediction of bio- The BLOSUM62 scoring matrix has been used to mea- chemical function than a simple, global 3D structure sure the quality of the match between two predicted active match. SALSA can confirm putative annotations, identify sites. Whether there exists a better scoring matrix for this misannotations, suggest correct annotations, and, in some purpose is currently under investigation. At the present cases of misannotation, predict a more probable functional time, there are too few SG proteins with experimentally annotation. verified biochemical function to be able to translate the Figure 3 Structural alignment of the top POOL-predicted residues for the SG protein (purple; PDB 3q1t), reported to be an enoyl-CoA hydratase, with those of ABDH (green). H43, H144, D153, and E243 are known catalytic residues in ABDH. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 8 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 3. Gilks WR, Audit B, de Angelis D, Tsoka S, Ouzounis CA: Percolation of match score into a confidence metric, but as experimental annotation errors through hierarchically structured protein sequence testing progresses, this will become possible. databases. Math Biosci 2005, 193:223-234. The SALSA method is amenable to automation and 4. Llewellyn R, Eisenberg DS: Annotating proteins with generalized functional linkages. Proc Natl Acad Sci USA 2008, 105:17700-17705. could be used to complement sequence-based function 5. Lee D, Redfern O, Orengo C: Predicting protein function from sequence annotation methods, such as those evaluated in the CAFA and structure. Nat Rev Mol Cell Biol 2007, 8:995-1005. experiments. 6. Loewenstein Y, Raimondo D, Redfern O, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A: Protein function annotation by homology-based inference. Genome Biology 2009, 207. Author information 7. Sleator RD, Walsh P: An overview of in silico protein function prediction. ZW, PY, JSL, and RP are doctoral candidates in the Arch Microbiol 2010, 192:151-155. 8. Chi X, Hou J, Erdin S, Lisewski AM, Lichtarge O: An Iterative Approach of Department of Chemistry and Chemical Biology at North- Protein Function Prediction: towards integration of similarity metrics. eastern University. SS earned the Ph.D. degree in Chemis- BMC Bioinformatics 2011, 12:437. try from Northeastern University in 2011 and is currently 9. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A new generation of protein database engaged in postdoctoral research at Yale University. MJO search programs. Nucleic Acids Res 1997, 25:3389-3402. is Professor of Chemistry and Chemical Biology and is 10. Holm L, Kaariainen S, Wilton C, Plewczynski D: Using Dali for structural Principal Investigator of the Computational Biology comparison of proteins. Curr Protoc Bioinformatics 2006, Chapter 5: Unit 5 5. 11. Shindyalov IN, Bourne PE: Protein structure alignment by incremental Research Group at Northeastern University. combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11:739-747. 12. Ilyin VA, Abyzov A, Leslin CM: Structural alignment of proteins by a novel Abbreviations TOPOFIT method, as a superimposition of common volumes at a ABDH: Anabaena beta-diketone hydrolase; BLOSUM: BLOcks of amino acid topomax point. Protein Sci 2004, 13:1865-1874. SUbstitution Matrix; CAFA: Critical Assessment of Function Annotation; CE: 13. Rost B: Enzyme function less conserved than anticipated. J Mol Biol 2002, Combinatorial Extension; ECH: enoyl-CoA hydratase; GH16: glycoside 318:595-608. hydrolase family 16; INTREPID: INformation-theoretic TREe traversal for 14. Ivanisenko VA, Pintus SS, Grigorovich DA, Kolchanov NA: PDBSiteScan: a Protein functional site Identification; OMP: orotidine monophosphate; program for searching for active, binding and posttranslational OMPDC: orotidine 5’;-monophosphate decarboxylase; PDB: Protein Data modification sites in the 3D structures of proteins. Nucleic Acids Res 2004, Bank; POOL: Partial Order Optimum Likelihood; RPBB: Ribulose Phosphate 32:W549-W554. Binding Barrel; SALSA: Structurally Aligned Local Sites of Activity; SG: 15. Meng EC, Polacco BJ, Babbitt PC: Superfamily active site templates. Structural Genomics; THEMATICS: THEoretical Microscopic Anomalous Proteins 2004, 55:962-976. TItration Curve Shapes; UMP: uridine monophosphate. 16. Binkowski T, Joachimiak A, Liang J: Protein surface analysis for function annotation in high-throughput structural genomics pipeline. Protein Authors’ contributions Science 2005, 14:2972-2981. All six authors performed the calculations, participated in the development 17. Shulman-Peleg A, Nussinov R, Wolfson H: SiteEngines: recognition and of the methodology, and contributed to the writing of the manuscript. ZW comparison of binding sites and protein-protein interfaces. Nucleic Acids had primary responsibility for the analysis of the Concanavalin A-like lectins/ Res 2005, 33:W337-W341. glucanases, PY for the Mycobacterium avium SG protein, and JSL for the 18. Laskowski RA, Watson JD, Thornton JM: ProFunc: a server for predicting OMPDCs. ZW, PY, and JSL contributed equally to this work. protein function from 3D structure. Nucl Acids Res 2005, 33:W89-W93. 19. Parasuram R, Lee JS, Yin P, Somarowthu S, Ondrechen MJ: Functional Competing interests classification of protein 3D structures from predicted local interaction The authors declare that they have no competing interests. sites. J Bioinform Comput Biol 2010, 8(Suppl 1):1-15. 20. Goldsmith-Fischman S, Honig B: Structural genomics: computational Acknowledgements methods for structure analysis. Protein Sci 2003, 12:1813-1821. The support of the NSF under grants number MCB-0843603 and MCB- 21. Laskowski RA, Watson JD, Thornton JM: From protein structure to 1158176 is gratefully acknowledged. JSL is an NSF Graduate Research Fellow. biochemical function. J Struct Funct Genomics 2003, 4:167-177. 22. Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ: Partial Order Declarations Optimum Likelihood (POOL): Maximum Likelihood Prediction of Protein This article has been published as part of BMC Bioinformatics Volume 14 Active Site Residues Using 3D Structure and Sequence Properties. PLoS Supplement 3, 2013: Proceedings of Automated Function Prediction SIG Comp Biol 2009, 5:e1000266. 2011 featuring the CAFA Challenge: Critical Assessment of Function 23. Somarowthu S, Yang H, Hildebrand DGC, Ondrechen MJ: High- Annotations. The full contents of the supplement are available online at performance prediction of functional residues in proteins with http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S3. machine learning and computed input features. Biopolymers 2011, 95:390-400. Author details 24. Ko J, Murga LF, André P, Yang H, Ondrechen MJ, Williams RJ, Department of Chemistry and Chemical Biology, Northeastern University, Agunwamba A, Budil DE: Statistical criteria for the identification of Boston, MA 02115 USA. Current address: Department of Molecular, Cellular, protein active sites using theoretical microscopic titration curves. and Developmental Biology, 219 Prospect Street, Kline Biology Tower Room Proteins 2005, 59:183-195. 826, Yale University, New Haven, CT 06520-8103 USA. 25. Wei Y, Ko J, Murga LF, Ondrechen MJ: Selective Prediction of Interaction Sites in Protein Structures with THEMATICS. BMC Bioinformatics 2007, Published: 28 February 2013 8:119. 26. Sankararaman S, Sjolander K: INTREPID: INformation-theoretic TREe References traversal for Protein functional site IDentification. Bioinformatics 2008, 1. Westbrook J, Feng Z, Chen L, Yang H, Berman HM: The Protein Data Bank 24:2445-2452. and structural genomics. Nucleic Acids Res 2003, 31:489-491. 27. Sankararaman S, Kolaczkowski B, Sjolander K: INTREPID: a web server for 2. Schnoes AM, Brown SD, Dodevski I, Babbitt PC: Annotation Error in Public prediction of functionally important residues by evolutionary analysis. Databases: Misannotation of Molecular Function in Enzyme Nucleic Acids Res 2009, 37:W390-W395. Superfamilies. PLoS Comp Biol 2009, 5:e1000605. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 9 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 28. Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA: Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure. PLoS Comput Biol 2009, 5:e1000585. 29. Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for multiple sequence alignments. J Mol Biol 2000, 302:205-207. 30. Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 2004, 22:1035-1036. 31. Harris P, Poulsen JN, Jensen K, Larsen S: Substrate binding induces domain movements in orotidine 5’-monophosphate decarboxylase. J Mol Biol 2002, 318:1019-1029. 32. Wu N, Mo Y, Gao J, Pai E: Structure and mechanism of the enzyme orotidine monophosphate decarboxylase. Proc Natl Acad Sci (USA) 2000, 97:2017-2022. 33. Wierenga RK: The TIM-barrel fold: A versatile framework for efficient enzymes. FEBS Lett 2001, 492:193-198. 34. Vedadi M, Lew J, Arz J, Amani M, Zhao Y, Dong A, Wasney G, Gao M, Hills T, Brokx S, et al: Genome-scale protein expression and structural biology of Plasmodium falciparum and related Apicomplexan organisms. Molecular and Biochemical Parasitology 2007, 151:100-110. 35. Appleby TC, Kinsland C, Begley TP, Ealick SE: The crystal structure and mechanism of orotidine 5’-monophosphate decarboxylase. Proc Natl Acad Sci USA 2000, 97:2005-2010. 36. Harris P, Navarro Poulsen JC, Jensen KF, Larsen S: Structural basis for the catalytic mechanism of a proficient enzyme: orotidine 5’- monophosphate decarboxylase. Biochemistry 2000, 39:4217-4224. 37. Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics 2000, 16:566-567. 38. Hahn M, Keitel T, Heinemann U: Crystal and molecular structure at 0.16- nm resolution of the hybrid Bacillus endo-1,3-1,4-beta-D-glucan 4- glucanohydrolase H(A16-M). Eur J Biochem 1995, 232:849-858. 39. Hakulinen N, Turunen O, Janis J, Leisola M, Rouvinen J: Three-dimensional structures of thermophilic beta-1,4-xylanases from Chaetomium thermophilum and Nonomuraea flexuosa. Eur J Biochem 2003, 270:1399-1412. 40. Muller-Newen G, Janssen U, Stoffel W: Enoyl-CoA hydratase and isomerase form a superfamily with a common active-site glutamate residue. Eur J Biochem 1995, 228:68-73. 41. Bell AF, Feng Y, Hofstein HA, Parikh S, Wu J, Rudolph MJ, Kisker C, Whitty A, Tonge PJ: Stereoselectivity of enoyl-CoA hydratase results from preferential activation of one of two bound substrate conformers. Chem Biol 2002, 9:1247-1255. 42. Bennett JP, Whittingham JL, Brzozowski AM, Leonard PM, Grogan G: Structural characterization of a beta-diketone hydrolase from the cyanobacterium Anabaena sp. PCC 7120 in native and product-bound forms, a coenzyme A-independent member of the crotonase suprafamily. Biochemistry 2007, 46:137-144. 43. Han GW, Ko J, Farr CL, Deller MC, Xu Q, Chiu H-J, Miller MD, Sefcikova J, Somarowthu S, Beuning PJ, et al: Crystal structure of a metal-dependent phosphoesterase (YP_910028.1) from Bifidobacterium adolescentis: Computational prediction and experimental validation of phosphoesterase activity. Proteins 2011, 79:2146-2160. doi:10.1186/1471-2105-14-S3-S13 Cite this article as: Wang et al.: Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs). BMC Bioinformatics 2013 14(Suppl 3):S13. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals

Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs)

Loading next page...
 
/lp/springer-journals/protein-function-annotation-with-structurally-aligned-local-sites-of-xw5Di1DtDZ

References (86)

Publisher
Springer Journals
Copyright
Copyright © 2013 by Wang et al.; licensee BioMed Central Ltd.
Subject
Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN
1471-2105
DOI
10.1186/1471-2105-14-S3-S13
pmid
23514271
Publisher site
See Article on Publisher Site

Abstract

Background: The prediction of biochemical function from the 3D structure of a protein has proved to be much more difficult than was originally foreseen. A reliable method to test the likelihood of putative annotations and to predict function from structure would add tremendous value to structural genomics data. We report on a new method, Structurally Aligned Local Sites of Activity (SALSA), for the prediction of biochemical function based on a local structural match at the predicted catalytic or binding site. Results: Implementation of the SALSA method is described. For the structural genomics protein PY01515 (PDB ID 2aqw) from Plasmodium yoelii, it is shown that the putative annotation, Orotidine 5’-monophosphate decarboxylase (OMPDC), is most likely correct. SALSA analysis of YP_001304206.1 (PDB ID 3h3l), a putative sugar hydrolase from Parabacteroides distasonis, shows that its active site does not bear close resemblance to any previously characterized member of its superfamily, the Concanavalin A-like lectins/glucanases. It is noted that three residues in the active site of the thermophilic beta-1,4-xylanase from Nonomuraea flexuosa (PDB ID 1m4w), Y78, E87, and E176, overlap with POOL-predicted residues of similar type, Y168, D153, and E232, in YP_001304206.1. The substrate recognition regions of the two proteins are rather different, suggesting that YP_001304206.1 is a new functional type within the superfamily. A structural genomics protein from Mycobacterium avium (PDB ID 3q1t) has been reported to be an enoyl-CoA hydratase (ECH), but SALSA analysis shows a poor match between the predicted residues for the SG protein and those of known ECHs. A better local structural match is obtained with Anabaena beta-diketone hydrolase (ABDH), a known b-diketone hydrolase from Cyanobacterium anabaena (PDB ID 2j5s). This suggests that the reported ECH function of the SG protein is incorrect and that it is more likely a b-diketone hydrolase. Conclusions: A local site match provides a more compelling function prediction than that obtainable from a simple 3D structure match. The present method can confirm putative annotations, identify misannotation, and in some cases suggest a more probable annotation. Background new structures of unknown function are determined, it is There are currently over 11,000 structural genomics (SG) common practice to make a tentative functional assign- protein structures in the Protein Data Bank (PDB) [1] ment from the closest sequence match or the best 3D and most of them are of unknown or uncertain function, structure match to an annotated protein. Such tentative as the inference of function from structure has proved to functional assignments are often incorrect [2]. Further- be more difficult than anticipated. Furthermore, when more, one annotation error can propagate or “percolate” [2-4] in databases as additional proteins are annotated by automated or semi-automated means. * Correspondence: [email protected] Overviews of current methods for the functional anno- Department of Chemistry and Chemical Biology, Northeastern University, Boston, MA 02115 USA tation of proteins from their sequence and/or structure Full list of author information is available at the end of the article © 2013 Wang et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 2 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 have been given in recent reviews [5-8]. The simplest, biochemical function. The quality of the match of the and most commonly employed [6] methods seek the clo- predicted functional site in the query protein to func- sest sequence matches using a search program such as tional sites in proteins of known function is measured BLAST [9], or alternatively the closest 3D structure using a scoring function. The present method can deter- match obtained from e.g. Dali [10], Combinatorial Exten- mine whether a putative functional assignment is likely sion (CE) [11], or Topofit [12], and then just transfer the to be correct or incorrect. In some cases where a protein is shown to be misannotated, a probable functional function from the closest match to the query protein. assignment is made. However, even relatively high sequence similarity does not necessarily imply similar function [13]. Other types of sequence-based methods employ motif searching, phy- Methods logenetic profiling, or genome context. The Critical Functional residue predictions were made using POOL Assessment of Function Annotation (CAFA) experiment [22,23]. Input features for each residue in a given struc- (http://biofunctionprediction.org/) seeks to assess the ture include: electrostatics information, as contained in state of the current art of function prediction, chiefly THEMATICS metrics [24,25]; phylogenetic information from sequence. The aim of this work is to exploit struc- from INTREPID [26,27]; and geometric information tural information, together with computed chemical from ConCavity(structureonlyversion)[28]. Thetop- properties, to enhance function prediction capabilities. ranked residues in the POOL output constitute the It was hoped that SG would provide functional annota- functional site prediction. Cut-off limits are specified for tions for the protein products of newly-sequenced coding each case. genes, as indeed the 3D structure can sometimes be indi- Multiple structure alignments are made for each set cative of function. Simple protein fold comparison does of proteins. The structural alignment of multiple struc- work in some cases, as domains having a common fold tures of diverse function can be difficult and therefore sometimes do have the same function. However, many multiple alignment methods [11,12,29] may be needed folds have multiple functions. For instance, the Rossman for some cases. In the examples shown here, T-Coffee fold and the TIM barrel each represent more than 50 dif- [29] is used. For present purposes, a full alignment is ferent functions. The use of local 3D structural motifs or not necessary. A quality alignment is only required in templates, a feature of the present method, is now emer- the local spatial region of the predicted active site. ging as a more promising path for correct functional SALSA tables are constructed for the locally aligned annotation from structure [14-19]. residues in the predicted active site. In a SALSA table, the In spite of recent advances in protein function predic- rows represent individual protein structures and the col- tion, inference of biochemical function from the structure umns represent spatially aligned positions. is difficult [20,21]. Hundreds of SG structures have no Consensus signatures for a given functional subclass functional assignment at all and, for thousands of other are established using POOL predictions on a set of pre- SG proteins, functional hypotheses for SG proteins are viously characterized proteins with the same biochemical putative and uncertain. Not all such hypotheses will prove function, usually with common fold. To maximize in time to be correct, as examples below will illustrate. sequence diversity in this reference set, sets of structures The ability to determine function from the 3D structure are sought with the lowest possible sequence identity would add great value to this growing volume of SG data. among them. POOL-predicted residues of the same A different approach to functional annotation from 3D amino acid type in the same spatial position for the structure is presented here and is based on the combina- majority of the previously characterized proteins of com- tion of functional site prediction with local 3D structural mon biochemical function then constitute the consensus alignment. Functional site predictions are obtained from signature for that functional group. The consensus signa- Partial Order Optimum Likelihood (POOL) [22,23], a ture for a given biochemical function thus consists of a monotonicity-constrained maximum likelihood method, series of amino acid types in specified spatial positions. using computed chemical, electrostatic, and geometric SG proteins of unknown or uncertain function are properties, as well as phylogenetic information (if avail- analyzed by POOL and the predictions are aligned with able), as input features. POOL places all of the residues those of proteins of known function, or with the consen- in the input protein structure into an ordered list, ranked sus signature. according to probability of participation in the active site. Scoring the match between the predicted active site for The top-ranked residues constitute the active site predic- the query protein and that of the consensus signature is tion. Structural alignments are obtained for sets of these performed using the BLOSUM62 matrix [30]. Scores are local sites. Characteristic spatial patterns of predicted reported as a percentage of the maximum value (i.e.the residues at the structurally aligned local sites of activity score for the perfect match, the consensus signature with (SALSAs) are then used to identify specific types of itself). Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 3 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 Table 2 Local structural alignment of the consensus Results and discussion signature residues for the OMPDCs. Confirmation of annotation for PY01515, a putative Structurally aligned signature active site residues for OMPDC Orotidine 5’-monophosphate decarboxylase (OMPDC) Orotidine 5’-monophosphate decarboxylase (OMPDC) b1 b2 b3 b4 b7 b8 catalyzes one step in the pyrimidine biosynthesis pathway. PDB 1 2 3 4 5 6 7 8 It catalyzes the metal ion dependent decarboxylation of 1dbt D11 K33 D60 K62 D65 H88 P182 R215 orotidine monophosphate (OMP) to uridine monopho- 1dvj D20 K42 D70 K72 D75 H98 P180 R203 sphate (UMP) and CO [31,32]. OMPDC is a member of Protein 1dqw D37 K59 D91 K93 D96 H122 P202 R235 the ribulose phosphate binding barrel (RPBB) superfamily 1l2u D22 K44 D71 K73 D76 H99 P189 R222 and has a TIM barrel [33] structure, with the active site 2za1 D23 K102 D136 K138 D141 n165 P264 R294 located inside the beta barrel, spanning the eight beta SG 2aqw D23 K105 D139 K141 D144 n168 P267 R297 strands. The structural genomics protein PY01515 (PDB The first five rows represent previously-characterized OMPDCs. The sixth row ID 2aqw) is a putative OMPDC from Plasmodium yoelii is a putative OMPDC from Structural Genomics. The columns represent spatially coincident positions in the structural alignment; these positions are [34]. numbered 1-8. Known catalytic residues are shown in boldface. POOL- The POOL-predicted functional site for PY01515 was predicted residues are shown in uppercase; residues not predicted by POOL aligned with eight different functional site types predicted are shown in lowercase. The beta strand on which each position is located is given at the top of the column, above the position number. The good match by POOL for structures in the RPBB superfamily and a between the SG protein and the known OMPDCs suggests common function. strong match was found with that of the OMPDCs and not with the other seven functional types. Five previously positions in the local structural alignment. The residues characterized OMPDC structures, those from Bacillus predicted by POOL are shown in uppercase; residues in subtilis (PDB ID 1dbt), Methanothermobacter thermauto- lowercase are not in the top 9% of the POOL rankings. trophicus (PDB ID 1dvj), Saccharomyces cerevisiae (PDB The previously reported catalytic residues [35,36] are ID 1dqw), Escherichia coli (PDB ID 1l2u), and Plasmo- shown in boldface. Positions 1-8 are positions in the con- dium falciparum (PDB ID 2za1), were used to establish sensus prediction, i.e. similar residues are predicted by the consensus signature of an OMPDC active site. These POOL for the majority of the previously characterized five previously characterized OMPDCs represent consider- OMPDCs. The row above each position gives the beta able sequence diversity, as shown in Table 1. With the strand on which that position is located. For positions 1-5, exception of structures 1 and 4, which share sequence 7, and 8, an identical residue is predicted by POOL for all identity of 60%, all other pairs of structures have sequence identities in the 6% - 30% range. five previously characterized OMPDCs. At position 6, a For thefivepreviouslycharacterized OMPDCs,the histidine is predicted for four out of the five previously important residues are predicted using the top 9% of the characterized OMPDCs. For the Plasmodium falciparum residues, as ranked by POOL, for each protein structure. structure, there is an asparagine, not predicted by POOL, When these five predicted active sites are structurally at position 6. The consensus signature may be abbreviated aligned, eight spatial positions are found to have common as (D,K, D,K,D,H, P, R).Thecombination of residue predicted residues across the five diverse, previously char- types at the eight positions shown in Table 2 is unique to acterized OMPDCs. Table 2 shows this local structural OMPDC within the RPBB superfamily. For instance, the alignment. The rows in Table 2 represent individual pro- lysine in position 2 and the proline in position 7 are not tein structures, with the five previously characterized observed in the equivalent positions for any of the seven OMPDCs listed first; the last row is the query protein other functional subclasses of the RPBB superfamily. The quality of a match with the consensus signature may from SG. The columns represent spatially coincident be measured using a scoring matrix. Using the BLOSUM62 [30] matrix, the first four proteins listed in Table 2 have a Table 1 Sequence identity matrix for five previously score of 48 with the consensus signature; this score is characterized OMPDCs (structures 1-5) and the SG 100% of the maximum value. The Plasmodium falciparum protein PY01515 (PDB ID 2aqw). structure has a score of 39 (81% of the maximum value) PDB ID: 1dbt 1dvj 1dqw 1l2u 2za1 2aqw against the consensus signature. The structurally aligned residues for the SG protein 1 1dbt 0.240 0.260 0.600 0.060 0.240 PY01515 from Plasmodium yoelii areshown in thelast 2 1dvj 0.240 0.280 0.280 0.120 0.220 row of Table 2. For seven out of the eight positions, 3 1dqw 0.260 0.280 0.300 0.080 0.280 POOL predicts residues that are identical to the consensus 4 1l2u 0.600 0.280 0.300 0.060 0.200 signature residues of the previously characterized 5 2za1 0.060 0.120 0.080 0.060 0.020 OMPDCs. The only variation is in position 6, where there 6 (SG) 2aqw 0.240 0.220 0.280 0.200 0.020 is an asparagine that is not predicted by POOL, just as in Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 4 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 the Plasmodium falciparum OMPDC. PY01515 has a shows some similarity in the catalytic residues. The score of 39 (81% of the maximum value) against the con- reported active site residues [39] for thermophilic beta- sensus signature, using the BLOSUM62 scoring matrix. 1,4-xylanase from Nonomuraea flexuosa are Y78, E87, and The strong match between the predicted active site for E176. YP_001304206.1 possesses a spatially coincident PY01515 and those of the previously characterized triad in the local structural alignment consisting of the resi- OMPDCs indicates that the putative OMPDC functional dues Y168, D153, and E232. This is illustrated in Figure 2, assignment is correct. where the predicted residues for YP_001304206.1 (shown in gray) are structurally aligned with the predicted residues of the thermophilic beta-1,4-xylanase (shown in blue) from YP_001304206.1 - a probable new functional type in the Concanavalin A-like lectins/glucanases superfamily Nonomuraea flexuosa.The overlapofthree of thepre- YP_001304206.1 (PDB ID 3h3l) is a putative sugar hydro- dicted residues in the query protein, Y168, D153, and E232, lase from Parabacteroides distasonis, a commensal bacter- with those of the catalytic residues of the xylanase, Y78, ium of the human intestinal tract. YP_001304206.1 is a E87, and E176 is shown in the boxed region of Figure 2; a member of the Concanavalin A-like lectins/glucanases close-up of this region is shown in the large box on the superfamily. right side of Figure 2. This suggests that the catalytic The POOL-predicted functionally important residues for mechanism of the query protein may have similarities with YP_001304206.1 show poor spatial overlap with those of that of the xylanase. However, as Figure 2 shows, the other all of the enzymes of known function within the Concana- residues, those involved in substrate recognition in the valin A-like lectins/glucanases superfamily. Figure 1 shows xylanase, are not very well conserved in YP_001304206.1. a structural alignment of the predicted residues for Furthermore, the predicted residues D98, D255, and H256 YP_001304206.1 with those of its closest Dali [10,37] of YP_001304206.1, observed as a cluster in the center of structural match, endo-1,3-1,4-beta-D-glucan 4-glucano- Figure 2, appear to form a metal-binding motif that is not hydrolase (PDB ID 2ayh), a representative member of the present in the xylanase. This suggests that YP_001304206.1 glycoside hydrolases family 16 (GH16). The residues for is a novel functional type in the Concanavalin A-like lec- the query protein YP_001304206.1 are shown in gray and tins/glucanases superfamily. those for endo-1,3-1,4-beta-D-glucan 4-glucanohydrolase are shown in pink. Table 3 shows an alignment at the 14 An enoyl-CoA hydratase reported for Mycobacterium consensus signature positions of GH16 for the representa- avium is incorrectly annotated tive GH16, endo-1,3-1,4-beta-D-glucan 4-glucanohydro- A structural genomics protein from Mycobacterium avium lase, with the SG protein YP_001304206.1. Previously (PDB ID 3q1t), a potential target for the treatment of reported active site residues [38] are shown in boldface. infectious disease, has been reported to be an enoyl-CoA POOL-predicted residues (top 8%) are shown in upper- hydratase (ECH). This SG protein and the ECHs are mem- case; residues not predicted are shown in lowercase. Note bers of the ClpP/crotonase superfamily. The consensus that the SG protein has a gap (no residue well aligned) at signature residues for previously characterized ECHs were three of the consensus signature positions. For the align- established using POOL predictions and SALSA. These ment shown in Table 3, a negative BLOSUM62 score of -5 residues, the spatial signature of an ECH catalytic site, are is obtained, corresponding to -5% of the maximum value located in nine positions in the structural alignment. of +97. The three catalytic residues for endo-1,3-1,4-beta- Then, the residues in the consensus signature were struc- D-glucan 4-glucanohydrolase, E105, D107, and E109 [38], turally aligned with residues in the SG M. avium structure. form an EXDXE motif on a common beta sheet and are An alignment of the consensus signature residues, repre- seen forming a vertical line through the center of Figure 1. sented by enoyl-CoA hydratase from Rattus norvegicus Note that these three residues overlap spatially in the (PDB ID 1ey3), with the corresponding spatially overlap- alignment with S140, E142, and Q144 in YP_001304206.1. ping residues of the query protein, is shown in Table 4. The very poor match score (negative) suggests that the Again, the rows represent individual protein structures function of endo-1,3-1,4-beta-D-glucan 4-glucanohy- and the columns represent spatial positions in the align- drolase cannot be transferred to YP_001304206.1. ment. The known catalytic residues, A98, G141, E144, and While the predicted active residues for YP_001304206.1 E164 [40,41], are shown in boldface. Residues predicted have low scores with those of the previously characterized by POOL are shown in uppercase and residues not pre- members of the superfamily, one interesting comparison dicted are shown in lowercase. The BLOSUM62 score does emerge. The superposition of the predicted residues between the SG protein and the known ECH is only 11, or for the query protein with those of thermophilic beta- 22% of the maximum value of 51, for these nine positions. 1,4-xylanase from Nonomuraea flexuosa (PDB ID 1m4w), Note further that the SG protein is missing the catalytic a member of the xylanase/endoglucanase 11/12 family, residues that correspond to E144 and E164 in the Rattus Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 5 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 Figure 1 Structural alignment of predicted residues for YP_001304206.1 (gray) with those of endo-1,3-1,4-beta-D-glucan 4- glucanohydrolase (pink). Table 3 Local structural alignment of the residues in the GH16 consensus signature positions for the known representative GH16, endo-1,3-1,4-beta-D-glucan 4-glucanohydrolase, with the SG protein YP_001304206.1. Spatial Positions® 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2ayh (GH16) F92 Y94 W103 d104 E105 D107 E109 L111 Q119 N121 Y123 Y147 W158 W184 3h3l (SG) F125 - Y138 - s140 E142 q144 L146 A166 Y168 - t187 H198 H256 Previously reported active site residues are shown in boldface. POOL-predicted residues (top 8%) are shown in uppercase; residues not predicted are shown in lowercase. The poor match suggests different functions. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 6 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 Figure 2 Structural alignment of the POOL-predicted residues for the structural genomics protein YP_001304206.1 (gray) with those of a beta-1,4-xylanase from Nonomuraea flexuosa (blue). The overlap of the three catalytic residues, E87, Y89, and E176 of the xylanase with the aligned, predicted residues from YP_001304206.1 is highlighted in the blue box and shown in close-up in the large box on the right. norvegicus ECH structure. These results strongly suggest [42] are shown in boldface. Again, the columns represent that the reported enoyl-CoA hydratase annotation is overlapping spatial positions, but in Table 5 they are listed incorrect. in order of the POOL rank for the M. avium structure Comparison of the local site prediction for the SG pro- (D155 is ranked first, H146 second, E244 third, ...). Thus tein with those of other members of the ClpP/crotonase all of the residues listed for the SG protein in Table 5 are superfamily reveals a much better match with ABDH predicted by POOL. Residues not predicted by POOL for (Anabaena beta diketone hydrolase), a known b-diketone ABDH are shown in lowercase. Notice that four of the hydrolase from Cyanobacterium anabaena (PDB ID 2j5s). top-ranked POOL residues for the SG protein are aligned The local alignment of the top POOL-predicted residues with the known catalytic residues of ABDH: D153, H144, for the M. avium structure with residues from ABDH is E243, and H43. The BLOSUM62 score between the SG shown in Table 5. The known catalytic residues for ABDH protein and the known ABDH for these seven positions is 30, or 60% of the maximum value. These results sugg- estthatthe M. avium structure may be a b-diketone Table 4 Local structural alignment of the predicted active site residues by SALSA for a known ECH from Rattus norvegicus (PDB ID 1ey3) with predicted residues for a Table 5 Local structural alignment of the predicted Structural Genomics protein from Mycobacterium avium residues for the SG protein from Mycobacterium avium (PDB ID 3q1t), reported to be an ECH. (PDB ID 3q1t) with the corresponding residues of ABDH Spatial 12 3 4 5 6 7 8 9 from Cyanobacterium anabaena. Positions® POOL Ranking® 12 3 4 5 6 7 Known A98 g141 E144 C149 D150 E164 R178 k241 N245 SG protein “ECH” (3q1t) D155 H146 E244 D144 H44 C160 H156 ECH (1ey3) Known ABDH (2j5s) D153 H144 E243 D141 H43 l158 g154 SG protein G76 A123 V126 a131 D132 H146 C160 k223 n227 “ECH” The spatial positions 1 through 7 correspond to the ordinal values for the top (3q1t) seven residues in the POOL rank order for 3q1t. Known catalytic residues are shown in boldface. Residues predicted by POOL Known catalytic residues for ABDH are shown in boldface. Residues predicted are in uppercase; residues not predicted are in lowercase. Note the poor by POOL are in uppercase; residues not predicted are in lowercase. Note that match between the residues of the SG protein with those of the the match between the residues of the SG protein and of ABDH is better than representative ECH. that of Table 4. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 7 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 hydrolase, but perhaps with a native substrate different For any given protein structure of previously character- from that of the Cyanobacterium anabaena protein. ized function, the list of residues reported in the literature Figure 3 illustrates the structural alignment of the top to be important for the biochemical function is a subset of POOL-predicted residues for the SG M. avium structure the list of residues predicted by POOL. This longer list is a (purple) with the corresponding residues from ABDH key advantage of the present method, as it enables better (green), showing that the known catalytic residues of discrimination between the functional subclasses. ABDH have strong overlap with the top POOL-pre- To date, one prediction made by local site matching dicted residues for the SG protein. using our electrostatics-based functional site prediction has been verified experimentally by direct biochemical Conclusions assays [43]. Further experimental testing of SALSA func- Local structural matching, as implemented by the SALSA tion predictions is in progress. method, provides a more compelling prediction of bio- The BLOSUM62 scoring matrix has been used to mea- chemical function than a simple, global 3D structure sure the quality of the match between two predicted active match. SALSA can confirm putative annotations, identify sites. Whether there exists a better scoring matrix for this misannotations, suggest correct annotations, and, in some purpose is currently under investigation. At the present cases of misannotation, predict a more probable functional time, there are too few SG proteins with experimentally annotation. verified biochemical function to be able to translate the Figure 3 Structural alignment of the top POOL-predicted residues for the SG protein (purple; PDB 3q1t), reported to be an enoyl-CoA hydratase, with those of ABDH (green). H43, H144, D153, and E243 are known catalytic residues in ABDH. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 8 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 3. Gilks WR, Audit B, de Angelis D, Tsoka S, Ouzounis CA: Percolation of match score into a confidence metric, but as experimental annotation errors through hierarchically structured protein sequence testing progresses, this will become possible. databases. Math Biosci 2005, 193:223-234. The SALSA method is amenable to automation and 4. Llewellyn R, Eisenberg DS: Annotating proteins with generalized functional linkages. Proc Natl Acad Sci USA 2008, 105:17700-17705. could be used to complement sequence-based function 5. Lee D, Redfern O, Orengo C: Predicting protein function from sequence annotation methods, such as those evaluated in the CAFA and structure. Nat Rev Mol Cell Biol 2007, 8:995-1005. experiments. 6. Loewenstein Y, Raimondo D, Redfern O, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A: Protein function annotation by homology-based inference. Genome Biology 2009, 207. Author information 7. Sleator RD, Walsh P: An overview of in silico protein function prediction. ZW, PY, JSL, and RP are doctoral candidates in the Arch Microbiol 2010, 192:151-155. 8. Chi X, Hou J, Erdin S, Lisewski AM, Lichtarge O: An Iterative Approach of Department of Chemistry and Chemical Biology at North- Protein Function Prediction: towards integration of similarity metrics. eastern University. SS earned the Ph.D. degree in Chemis- BMC Bioinformatics 2011, 12:437. try from Northeastern University in 2011 and is currently 9. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: A new generation of protein database engaged in postdoctoral research at Yale University. MJO search programs. Nucleic Acids Res 1997, 25:3389-3402. is Professor of Chemistry and Chemical Biology and is 10. Holm L, Kaariainen S, Wilton C, Plewczynski D: Using Dali for structural Principal Investigator of the Computational Biology comparison of proteins. Curr Protoc Bioinformatics 2006, Chapter 5: Unit 5 5. 11. Shindyalov IN, Bourne PE: Protein structure alignment by incremental Research Group at Northeastern University. combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11:739-747. 12. Ilyin VA, Abyzov A, Leslin CM: Structural alignment of proteins by a novel Abbreviations TOPOFIT method, as a superimposition of common volumes at a ABDH: Anabaena beta-diketone hydrolase; BLOSUM: BLOcks of amino acid topomax point. Protein Sci 2004, 13:1865-1874. SUbstitution Matrix; CAFA: Critical Assessment of Function Annotation; CE: 13. Rost B: Enzyme function less conserved than anticipated. J Mol Biol 2002, Combinatorial Extension; ECH: enoyl-CoA hydratase; GH16: glycoside 318:595-608. hydrolase family 16; INTREPID: INformation-theoretic TREe traversal for 14. Ivanisenko VA, Pintus SS, Grigorovich DA, Kolchanov NA: PDBSiteScan: a Protein functional site Identification; OMP: orotidine monophosphate; program for searching for active, binding and posttranslational OMPDC: orotidine 5’;-monophosphate decarboxylase; PDB: Protein Data modification sites in the 3D structures of proteins. Nucleic Acids Res 2004, Bank; POOL: Partial Order Optimum Likelihood; RPBB: Ribulose Phosphate 32:W549-W554. Binding Barrel; SALSA: Structurally Aligned Local Sites of Activity; SG: 15. Meng EC, Polacco BJ, Babbitt PC: Superfamily active site templates. Structural Genomics; THEMATICS: THEoretical Microscopic Anomalous Proteins 2004, 55:962-976. TItration Curve Shapes; UMP: uridine monophosphate. 16. Binkowski T, Joachimiak A, Liang J: Protein surface analysis for function annotation in high-throughput structural genomics pipeline. Protein Authors’ contributions Science 2005, 14:2972-2981. All six authors performed the calculations, participated in the development 17. Shulman-Peleg A, Nussinov R, Wolfson H: SiteEngines: recognition and of the methodology, and contributed to the writing of the manuscript. ZW comparison of binding sites and protein-protein interfaces. Nucleic Acids had primary responsibility for the analysis of the Concanavalin A-like lectins/ Res 2005, 33:W337-W341. glucanases, PY for the Mycobacterium avium SG protein, and JSL for the 18. Laskowski RA, Watson JD, Thornton JM: ProFunc: a server for predicting OMPDCs. ZW, PY, and JSL contributed equally to this work. protein function from 3D structure. Nucl Acids Res 2005, 33:W89-W93. 19. Parasuram R, Lee JS, Yin P, Somarowthu S, Ondrechen MJ: Functional Competing interests classification of protein 3D structures from predicted local interaction The authors declare that they have no competing interests. sites. J Bioinform Comput Biol 2010, 8(Suppl 1):1-15. 20. Goldsmith-Fischman S, Honig B: Structural genomics: computational Acknowledgements methods for structure analysis. Protein Sci 2003, 12:1813-1821. The support of the NSF under grants number MCB-0843603 and MCB- 21. Laskowski RA, Watson JD, Thornton JM: From protein structure to 1158176 is gratefully acknowledged. JSL is an NSF Graduate Research Fellow. biochemical function. J Struct Funct Genomics 2003, 4:167-177. 22. Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ: Partial Order Declarations Optimum Likelihood (POOL): Maximum Likelihood Prediction of Protein This article has been published as part of BMC Bioinformatics Volume 14 Active Site Residues Using 3D Structure and Sequence Properties. PLoS Supplement 3, 2013: Proceedings of Automated Function Prediction SIG Comp Biol 2009, 5:e1000266. 2011 featuring the CAFA Challenge: Critical Assessment of Function 23. Somarowthu S, Yang H, Hildebrand DGC, Ondrechen MJ: High- Annotations. The full contents of the supplement are available online at performance prediction of functional residues in proteins with http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S3. machine learning and computed input features. Biopolymers 2011, 95:390-400. Author details 24. Ko J, Murga LF, André P, Yang H, Ondrechen MJ, Williams RJ, Department of Chemistry and Chemical Biology, Northeastern University, Agunwamba A, Budil DE: Statistical criteria for the identification of Boston, MA 02115 USA. Current address: Department of Molecular, Cellular, protein active sites using theoretical microscopic titration curves. and Developmental Biology, 219 Prospect Street, Kline Biology Tower Room Proteins 2005, 59:183-195. 826, Yale University, New Haven, CT 06520-8103 USA. 25. Wei Y, Ko J, Murga LF, Ondrechen MJ: Selective Prediction of Interaction Sites in Protein Structures with THEMATICS. BMC Bioinformatics 2007, Published: 28 February 2013 8:119. 26. Sankararaman S, Sjolander K: INTREPID: INformation-theoretic TREe References traversal for Protein functional site IDentification. Bioinformatics 2008, 1. Westbrook J, Feng Z, Chen L, Yang H, Berman HM: The Protein Data Bank 24:2445-2452. and structural genomics. Nucleic Acids Res 2003, 31:489-491. 27. Sankararaman S, Kolaczkowski B, Sjolander K: INTREPID: a web server for 2. Schnoes AM, Brown SD, Dodevski I, Babbitt PC: Annotation Error in Public prediction of functionally important residues by evolutionary analysis. Databases: Misannotation of Molecular Function in Enzyme Nucleic Acids Res 2009, 37:W390-W395. Superfamilies. PLoS Comp Biol 2009, 5:e1000605. Wang et al. BMC Bioinformatics 2013, 14(Suppl 3):S13 Page 9 of 9 http://www.biomedcentral.com/1471-2105/14/S3/S13 28. Capra JA, Laskowski RA, Thornton JM, Singh M, Funkhouser TA: Predicting Protein Ligand Binding Sites by Combining Evolutionary Sequence Conservation and 3D Structure. PLoS Comput Biol 2009, 5:e1000585. 29. Notredame C, Higgins DG, Heringa J: T-Coffee: A novel method for multiple sequence alignments. J Mol Biol 2000, 302:205-207. 30. Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nature Biotechnology 2004, 22:1035-1036. 31. Harris P, Poulsen JN, Jensen K, Larsen S: Substrate binding induces domain movements in orotidine 5’-monophosphate decarboxylase. J Mol Biol 2002, 318:1019-1029. 32. Wu N, Mo Y, Gao J, Pai E: Structure and mechanism of the enzyme orotidine monophosphate decarboxylase. Proc Natl Acad Sci (USA) 2000, 97:2017-2022. 33. Wierenga RK: The TIM-barrel fold: A versatile framework for efficient enzymes. FEBS Lett 2001, 492:193-198. 34. Vedadi M, Lew J, Arz J, Amani M, Zhao Y, Dong A, Wasney G, Gao M, Hills T, Brokx S, et al: Genome-scale protein expression and structural biology of Plasmodium falciparum and related Apicomplexan organisms. Molecular and Biochemical Parasitology 2007, 151:100-110. 35. Appleby TC, Kinsland C, Begley TP, Ealick SE: The crystal structure and mechanism of orotidine 5’-monophosphate decarboxylase. Proc Natl Acad Sci USA 2000, 97:2005-2010. 36. Harris P, Navarro Poulsen JC, Jensen KF, Larsen S: Structural basis for the catalytic mechanism of a proficient enzyme: orotidine 5’- monophosphate decarboxylase. Biochemistry 2000, 39:4217-4224. 37. Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics 2000, 16:566-567. 38. Hahn M, Keitel T, Heinemann U: Crystal and molecular structure at 0.16- nm resolution of the hybrid Bacillus endo-1,3-1,4-beta-D-glucan 4- glucanohydrolase H(A16-M). Eur J Biochem 1995, 232:849-858. 39. Hakulinen N, Turunen O, Janis J, Leisola M, Rouvinen J: Three-dimensional structures of thermophilic beta-1,4-xylanases from Chaetomium thermophilum and Nonomuraea flexuosa. Eur J Biochem 2003, 270:1399-1412. 40. Muller-Newen G, Janssen U, Stoffel W: Enoyl-CoA hydratase and isomerase form a superfamily with a common active-site glutamate residue. Eur J Biochem 1995, 228:68-73. 41. Bell AF, Feng Y, Hofstein HA, Parikh S, Wu J, Rudolph MJ, Kisker C, Whitty A, Tonge PJ: Stereoselectivity of enoyl-CoA hydratase results from preferential activation of one of two bound substrate conformers. Chem Biol 2002, 9:1247-1255. 42. Bennett JP, Whittingham JL, Brzozowski AM, Leonard PM, Grogan G: Structural characterization of a beta-diketone hydrolase from the cyanobacterium Anabaena sp. PCC 7120 in native and product-bound forms, a coenzyme A-independent member of the crotonase suprafamily. Biochemistry 2007, 46:137-144. 43. Han GW, Ko J, Farr CL, Deller MC, Xu Q, Chiu H-J, Miller MD, Sefcikova J, Somarowthu S, Beuning PJ, et al: Crystal structure of a metal-dependent phosphoesterase (YP_910028.1) from Bifidobacterium adolescentis: Computational prediction and experimental validation of phosphoesterase activity. Proteins 2011, 79:2146-2160. doi:10.1186/1471-2105-14-S3-S13 Cite this article as: Wang et al.: Protein function annotation with Structurally Aligned Local Sites of Activity (SALSAs). BMC Bioinformatics 2013 14(Suppl 3):S13. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color figure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit

Journal

BMC BioinformaticsSpringer Journals

Published: Feb 28, 2013

There are no references for this article.