PFRES: protein fold classification by using evolutionary information and predicted secondary structure

Ke Chen; Lukasz Kurgan

doi:10.1093/bioinformatics/btm475

PFRES: protein fold classification by using evolutionary information and predicted secondary structure

Chen, Ke; Kurgan, Lukasz 2007-10-17 00:00:00 Vol. 23 no. 21 2007, pages 2843–2850 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm475 Structural bioinformatics PFRES: protein fold classification by using evolutionary information and predicted secondary structure Ke Chen and Lukasz Kurgan Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada Received on May 28, 2007; revised on August 27, 2007; accepted on September 17, 2007 Advance Access publication October 17, 2007 Associate Editor: Alfonso Valencia Based on the Chothia’s estimate, which states that the number ABSTRACT of different protein families is finite and perhaps as small as 1000 Motivation: The number of protein families has been estimated to (Chothia, 1992), it seems feasible to derive most of the unsolved be as small as 1000. Recent study shows that the growth in structures by homology modeling based only on a relatively discovery of novel structures that are deposited into PDB and the small portion of the protein structures that are determined related rate of increase of SCOP categories are slowing down. This experimentally. This explains why the novel structures are indicates that the protein structure space will be soon covered and especially valuable. This fact also served as the basis of the thus we may be able to derive most of remaining structures by using Protein Structure Initiative that was initiated by NIH in 1999 the known folding patterns. Present tertiary structure prediction (Chandonia and Brenner, 2006). One of the aims of this project methods behave well when a homologous structure is predicted, but give poorer results when no homologous templates are available. is to cover the structure space of proteins. These early findings At the same time, some proteins that share twilight-zone sequence are supported by a recent computational analysis of the Protein identity can form similar folds. Therefore, determination of structural Data Bank, which showed that the growth of the structural similarity without sequence similarity would be beneficial for data has slowed down and that the rate of increase of the related prediction of tertiary structures. SCOP categories (including number of families, superfamilies Results: The proposed PFRES method for automated protein fold and folds) is also slowing down (Levitt, 2007). Homology classification from low identity (535%) sequences obtains 66.4% modeling is based on the assumption that homologous and 68.4% accuracy for two test sets, respectively. PFRES obtains sequences share similar folding patterns (Ruan et al., 2006; 6.3–12.4% higher accuracy than the existing methods. The predic- Zhang and Skolnick, 2005). At the same time, sequences with tion accuracy of PFRES is shown to be statistically significantly low sequence identity can also share similar folding patterns better than the accuracy of competing methods. Our method adopts (Paiardini et al., 2004) and can be used to predict tertiary a carefully designed, ensemble-based classifier, and a novel, structure (Bujnicki, 2006). Sequence alignment software is an compact and custom-designed feature representation that includes important tool to find homologous sequences among the known nearly 90% less features than the representation of the most structures (Altschul et al., 1997; Yu et al., 2006), but inept when accurate competing method (36 versus 283). The proposed no homologous sequences are available. Research also shows representation combines evolutionary information by using the that finding similar folding patterns among the low identity PSI-BLAST profile-based composition vector and information sequences is beneficial for reconstruction of the tertiary extracted from the secondary structure predicted with PSI-PRED. structure (Reinhardt and Eisenberg, 2004; Tomii et al., 2005). Availability: The method is freely available from the authors upon A comprehensive and detailed description of the structural request. relationships between all solved proteins is provided in the Contact: [email protected] SCOP (Structural Class of Proteins) database (Andreeva et al., Supplementary information: Supplementary data are available at 2004; Murzin et al., 1995). This database implements a Bioinformatics online. hierarchy of relations between known protein and protein domain structures. The classification on the first level of the hierarchy is commonly known as the protein structural class, 1 INTRODUCTION while the second level classifies proteins into folds, which are Protein structures are being solved to answer key biological the classification target in this article. Several machine-learning questions related to protein function, regulation and interactions. methods have been applied to detect the structurally similar Outside of their biological context, the solved structures are proteins (protein folds) from sequences that share low identity. increasingly useful for structure modeling/prediction for unsolved Ding and Dubchak investigated support vector machine (SVM) protein sequences that have a closely related (similar) sequence and neural network for protein fold classification (Ding and with known structure (Tress et al., 2005; Wang et al., 2005). Dubchak, 2001). Shen and Chou studied ensemble models based on nearest neighbor (Shen and Chou, 2006). The *To whom correspondence should be addressed. prediction of the type of the protein fold for a given sequence The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2843 K.Chen and L.Kurgan 4-helical up-and-down bundle, (9) 4-helicalcytokines, (11) EF-hand, is performed with an intermediate step that converts the (20) immunoglobulin-like, (23) cupredoxins, (26) viral coat and capsid sequence into a feature space representation. Several other proteins, (30) conA-like lectin/glucanases, (31) SH3-like barrel, (32) ensemble models that applied the same feature space repre- OB-fold, (33) beta-trefoil, (35) trypsin-like serine proteases, (39) sentation as the one proposed by Ding and Dubchak were also lipocalins, (46) (TIM)-barrel, (47) FAD (also NAD)-binding motif, proposed (Bologna and Appel, 2002; Nanni, 2006; Okun, (48) flavodoxin-like, (51) NAD(P)-binding Rossmann-fold, (54) P-loop, 2004). In these studies protein sequences were represented (57) thioredoxin-like, (59) ribonuclease H-like motif, (62) hydrolases, by composition vector (CV), predicted secondary structure, (69) periplasmic binding protein-like, (72) b-grasp, (87) ferredoxin-like hydrophobicity, normalized van der Waals volume, polarity, and (110) small inhibitors, toxins and lectins. These 27 folds are the polarizability and pseudo-amino acid composition. The fold most populated in SCOP; each of them contains at least seven proteins. classification success rate ranged between 56% and 62%. The Based on the concept of protein structural class proposed by Levitt and Chothia (Levitt and Chothia, 1976), folds 1–11 belong to all- dimensionality of the feature space was relatively high, i.e. 125 structural class, folds 20–39 to all- class, folds 46–69 to / class features were proposed by Ding and Dubchak and 283 features and folds 72–87 to þ class. The fold distribution can be found in by Chou and Shen, when compared with size of the dataset Ding and Dubchak (2001) and these two datasets can be downloaded used in the experimental evaluation, i.e. 313 training and 385 from Supplementary Material in Shen and Chou (2006). test proteins. To this end, we propose a novel fold classification Test set 2 includes sequences that belong to the same 27 folds and method, called PFRES, that provides significantly better that were deposited into PDB between 2002 and 2004. The selected prediction accuracy when compared with the existing methods timeframe is a result of two factors: the newest version of SCOP and that uses a small number of new and more effective assigned folds only for sequences deposited until January 2005, while features. The main source of the achieved improvement is the training set and test set 1 were generated before 2001 and we aimed attributed to the application of PSI-BLAST profile (Altschul to avoid overlap between these datasets. The sequences in test set 2 were filtered by CD-HIT (Li and Godzik, 2006) at 40% sequence identity. et al., 1997) based composition vector, which considers Next, the remaining sequences were aligned with the sequences in both evolutionary information (Jones, 1999, 2007; Kim and Park, the training set and test set 1 using Smith–Waterman algorithm (Smith 2004), instead of the composition and pseudo-composition and Waterman, 1981). Only sequences that have 535% sequence vectors that were used in the prior works. We also applied identity with any sequence in these two sets were selected to form the features generated from secondary structure predicted with test set 2. The resulting 908 sequences are available from the authors PSI-PRED (Jones, 1999), which are also shown to be beneficial upon request. in the context of the fold classification. Finally, we note that PFRES, as well as all other relevant competing methods, 2.2 Feature space representation address a simplified fold classification problem, i.e. they predict 27 folds, due to low counts of proteins that belong to the 2.2.1 PSI-BLAST profile-based composition vector The com- remaining folds. position vector (CV) is computed directly from amino acid (AA) sequence (Chen et al., 2007; Chou, 2005). Given that the 20 AAs, which are ordered alphabetically (A, C, .. . , W, Y), are represented as AA , 2 MATERIALS AND METHODS AA , .. . , AA and AA , and the number of occurrences of AA in the 2 19 20 i entire sequence is denoted as n , the composition vector is defined as: 2.1 Datasets n n n n 1 2 19 20 The proposed method was designed on a training dataset with 313 , , .. . , , L L L L domains proposed by Ding and Dubchak (Ding and Dubchak, 2001). The tests were performed on two datasets: the test set 1 with 385 where L is the length of the sequence. This representation was used by domains was also taken from Ding and Dubchak (2001) and was used majority of the existing fold classification methods (Bologna and Appel, to perform comparison with other existing methods; the test set 2 with 2002; Ding and Dubchak, 2001; Nanni, 2006; Okun, 2004). 908 domains was included to provide larger scale evaluation on more The new representation, which combines PSI-BLAST profile and the recently deposited domains and to assure that the proposed method concept of composition vector, was developed for the proposed does not overfit the first, small test set. prediction method. The prior successful applications of PSI-BLAST The follow-up study by Shen and Chou excluded two training profile illustrate that the evolutionary information is more informative domains (2SCMC and 2GPS) and two domains from test set 1 than the query sequence itself (Jones, 1999, 2007; Kim and Park, 2004). (2YHX_1 and 2YHX_2) due to lack of sequence records (Shen and PSI-BLAST aligns a given query sequence to a database of sequences. Chou, 2006). We follow Shen and Chou’s study and adopt the two Using multiple sequence alignment, PSI-BLAST counts the frequency datasets without these four sequences. The sequence identity for any of each AA at each position for the query sequence and generates pair of sequences in the training set is535%. According to the dataset 20-dimensional vector of AA frequencies for each position in the query authors, the sequence in test set 1 share more than 35% sequence sequence. The generated PSI-BLAST profile can be used to identify identity with the sequences in the training set. We found seven key positions of conserved AAs and positions that undergo mutations. duplicates between these two sets, i.e. 1APLC, 3RUB2, 2REB1, Our approach combines the composition vector of the entire sequence 1DSBA2, 1GLCG1, 1GLCG2 and 1SLTA from the training set and the PSI-BLAST profile into so called PSI-BLAST profile-based correspond to 1YRNB, 3RUBL2, 2REB_1, 1DSBA2, 1GLAG1, composition vector (PCV). The PSI-BLAST profile is an L 20 matrix, 1GLAG2 and1SLAA from the test set 1, respectively. We also found which is denoted as [a ], where i ¼ 1,2, .. . , L denotes position in the i,j another 12 pairs that share over 50% identity. This redundancy may query sequence and j ¼ 1,2, .. . , 20 denotes a given AA. After applying result in overestimated test results on dataset 1, but at the same time it the substitution matrix and log function, a values range between ij should not impact ability to compare the relative differences between 9 and 11. The proposed representation is related to calculation of the prediction accuracies achieved by various methods on this test set. composition vector based on binary coding. The binary coding uses a The training and test set 1 sequences belong to the following 27 folds: 20-dimensional vector to encode each AA. In binary coding, AA is (1) globin-like, (3) cytochrome c, (4) DNAbinding3-helical bundle, (7) encoded as (0,0, ... , 0,1,0, ... , 0,0), where only the ith value is greater 2844 PFRES than 0. The binary coding matrix is denoted as [b ]. The binary Accuracy on i,j Helix Strand Coil test set 1 encoding and PSI-BLAST profile matrices have the same dimension- 65.0% ality (L 20). 64.5% CV can be computed from the binary coding matrix in a straightforward way. For a given protein sequence A A ... A 1 2 N 64.0% ki CV ¼ ði ¼ 1,2, .. .,20Þ 63.5% k¼1 63.0% where {CV , i ¼ 1, 2, .. . , 20} is the 20-dimensional composition vector. Threshold PCV is calculated in a similar way. The only difference is that the 62.5% binary coding matrix [b ] is replaced by PSI-BLAST profile [a ]. i,j i,j 23 45 67 89 Therefore, PCV is defined as: Fig. 1. Optimization of segment length thresholds to define DSSS ki PCV ¼ ði ¼ 1, 2, .. .,20Þ features. k¼1 Since PSI-BLAST profile values can be negative, while the Table 1. Summary of the feature selection results frequencies of AA pairs should not be negative, we redefine PCV as follows: Features set Total number features Selected features maxða ,0Þ ki PCV ¼ ði ¼ 1, 2, .. .,20Þ k¼1 PCV 20 20 where the negative a values are replaced by 0 and the 20-dimensional ki SSC 33 {PCV , i ¼ 1, 2, .. . , 20} vector corresponds to the PSI-BLAST profile- i Number of DSSS 32 based composition vector. Arrangement of DSSS 27 10 Length 1 1 Total 54 36 2.2.2 Secondary structure predicted with PSI-PRED Predicted secondary structure is proven to be helpful in fold classification. The recently proposed fold classification studies (Ding and Dubchak, 2001; Shen and Chou, 2006) used the secondary structure predicted with strand and coil segments equal 5 and 6, respectively. We note that the relatively older methods (Holley and Karplus, 1989; Quian and accuracies resulting from using different threshold are relatively similar, Sejnowski, 1988). In contrast, we use a more recent PSI-PRED i.e. within 1%, and thus the quality of the proposed method should not method (Jones, 1999), which is shown to provide superior accuracy be sensitive to this parameter. when compared with other state-of-the-art competing secondary Arrangement of DSSS. In some cases, structural folds cannot be structure prediction methods (Birzele and Kramer, 2006; Lin et al., distinguished based on the SSC and DSSS features. For instance, the 2005). We used PSI-PRED25 with default parameters to predict / and þ classes contain both -helices and -strands; the / class secondary structure from the protein sequences. The 3-state predictions includes mainly parallel -strands, while þ class mainly includes (helix, strand and coil) are used to generate the features. anti-parallel strands, which is related to the arrangement of secondary Secondary structure content (SSC) is shown to improve classification structure segments, but not the SSC or DSSS values. Therefore, we also designed another set of features that encode arrangement of three accuracy of a related problem of structural class prediction (Kurgan neighboring secondary structure segments, which meet the minimum and Chen, 2007). To this end, we introduce SSC that is calculated from threshold criteria set for DSSS features. There are 27 possible the secondary structure predicted with PSI-PRED. Let us denote the segment arrangements, i.e. class-class-class where class ¼ ‘H’, ‘E’ and AA sequence as {A , i ¼ 1, 2, .. . , L} and the predicted secondary ‘C’. We count the corresponding number of occurrences for each structure as {S , i ¼ 1, 2, .. . , L}. We count the occurrences of ‘H’, ‘E’ and ‘C’ predictions and denote the corresponding counts as COUNT , arrangement. Finally, we also include the length of the sequence (L) as a feature. COUNT , COUNT , respectively. The SSC is defined as: E C Table 1 summarizes features used in this article. COUNT class Content ¼ class where class ¼ ‘H’, ‘E’ and ‘C’. 2.3 Feature selection Number of distinct secondary structure segments (DSSS). Although Feature selection method was used to reduce the dimensionality and secondary structure content reflects information about the secondary potentially improve the prediction accuracy. An entropy-based feature structure of the entire sequence, it does not provide information selection method (Yu and Liu, 2003), which evaluates each feature by concerning individual secondary structure segments. At the same time, measuring the information gain with respect to the class (protein fold), size (length) of secondary structure segments is one of the deciding was applied. factors when it comes to the classification of the structural classes and The entropy of a feature X is defined as: folds. To this end, we designed features that count the number of HðXÞ¼ Pðx Þ log ðPðx ÞÞ i i occurrences of distinct helix, strand and coil structures which length (number of the corresponding AAs) is above a certain threshold. In this where {x } is a set of values of X and P(x ) is the prior probability of x . i i i way short secondary structure segments, which possibly can be The conditional entropy of X, given another feature Y (in our case the incorrectly predicted, will be filtered out. We varied the threshold protein fold) is defined as: values between 2 and 9 for the strand and coil segments and between 3 X X HðXjYÞ¼ Pðy Þ Pðx jy Þ log ðPðx jy ÞÞ and 9 for the helix segments and run predictions using SVM classifier. j i j i j j i The corresponding results are shown in Figure 1. Based on the graph, the threshold to count helical segments equals 7. The thresholds for where P(x | y ) is the posterior probability of X given the value y of Y. i j i 2845 K.Chen and L.Kurgan For the test set 1, the fold classification accuracies for the six PSI-PRED Sequence PSI-BLAST classifiers that include SVM, Multiple Logistic Regression, Kstar, IB1, Naıve Bayes and Random Forest and when using Feature the selected 36 features to represent sequences are shown in generation Table 2. Random Forest (with 250 trees) gives the highest module accuracy, i.e. 66.8%, among the six classifiers. The two runner- up classifiers, SVM (with RBF kernel with ¼ 0.8 and complexity parameter C ¼ 5.0) and Kstar (with global 36 features based sequence representation blend ¼ 96), obtained 66.1% and 65.0% accuracy, respectively. The same classifiers were also evaluated on the test set 2 by applying the same group of features and the same parameters. SVM Kstar Random forest classifier classifier classifier Random Forest again gives the highest accuracy, i.e. 63.3%, with the same two runner-up classifiers, SVM and Kstar, which Fold type predicted Fold type Fold type obtained 62.4% and 62.7% accuracy, respectively. The accu- by Kstar predicted predicted racies on test set 2 are slightly lower than accuracies on test by SVM by RF set 1. The remaining three classifiers obtained accuracy that is Voting module 3–10% lower than the accuracy of the three best classifiers, and thus were not used to implement the proposed fold classifica- Predicted fold type tion method. Fig. 2. Architecture of the proposed fold classification method. Among the 27 folds, fold 1 and fold 39 are the easiest to classify, i.e. all six classifiers achieved 100% accuracy for these two folds. Folds 3, 7, 9, 26, 33, 47 and 110 are also relatively The amount by which the entropy of X decreases reflects additional easy to classify, i.e. the average accuracy of the six classifiers for information about X provided by Y and is called information gain these folds is above 80%. The average prediction accuracy for IGðXjYÞ¼ HðXÞ HðXjYÞ all- structural class (folds 1–11) is 77.1%, for all- class (folds 20–39) is 64.3%, for / class (folds 46–69) is 55.6% and for According to this measure, Y has stronger correlation with X than with þ class (folds 72–87) is 40%. The folds that belong to all- Z if IG(X|Y)4IG(Z|Y). The feature selection was performed using and all- structural classes are easier to classify, while folds that 10-fold cross-validation on the training set. Among the original set of belong to / and þ classes are more difficult to correctly 54 features, 36 with the best information gain values were selected; see recognize. This is expected as the proposed features, and Table 1. especially those based on the predicted secondary structure, should be able to successfully represent proteins that contain 2.4 Proposed prediction method mainly -helices and -strands. At the same time, although still The proposed prediction method was designed and tested in two steps. well performing, the proposed features are less efficient in First, we selected a set of best-performing classifiers among six state- capturing long range interactions that are characteristic to of-the-art methods that include SVM (Kerthi et al., 2001), Multiple formation of parallel and anti-parallel -strands. Logistic Regression (Le and Houwelingen, 1992), instance learning- based Kstar (Cleary and Trigg, 1995) and IB1 (Aha and Kibler, 1991) algorithms, Naı¨ ve Bayes (John and Langley, 1995), and Random Forest 3.1 Effectiveness of PCV features (Leo, 2001) and when using the selected 36 features to represent The PSI-BLAST profile-based composition vector (PCV), sequences. Second, three different ensembles of the selected classifiers, which is proposed in this article, was directly compared including voting, grading and stacking (Seewald, 2002; Seewald and with the corresponding sequence-based composition vector Fuernkranz, 2001), were tried and the best performing ensemble was (CV) representation that was used in Bologna and Appel used to implement our fold classification method. As a result, voting- (2002), Ding and Dubchak (2001), Nanni (2006) and Okun based ensemble, which combines predictions from the three classifiers based on an unweighted average of the corresponding classification (2004). PCV and CV were compared based on fold classifica- probability estimates, was selected. The architecture of the proposed tion performed with the six classifiers. The prediction results PFRES method is shown in Figure 2. The classification algorithms used are shown in Figure 3. The comparison shows consistent to develop and compare the proposed method were implemented in superior quality of PCV features, i.e. the results based on Weka (Witten and Frank, 2005). PCV features are at least 13% higher than the result from CV features for all six classifiers. For test set 1, the average accuracy when using PCV features is 54.8%, while for CV 3 RESULTS AND DISCUSSION features it drops to 39.1%. For the test set 2, the average The experiments first report results related to the design of the accuracy when using PCV features is 46.3%, while for CV proposed fold classification method. We also test and discuss features it drops to 27.3%. This illustrates that sequential effectiveness of individual feature sets from the proposed evolutionary information is critical for successful classification sequence representation. Finally, the results of our ensemble of protein folds, even for sequences that share low sequence method are compared with the results of five competing identity. The results also indicate that the test set 2 is more methods. challenging. 2846 PFRES Table 2. Comparison of prediction accuracies between different classifiers for the proposed sequence representation that includes the selected 36 features Folds Individual classifiers Ensemble classifiers SVM Kstar Random Forest IB1 Naı¨ ve Bayes Regression Grading Voting Stacking-C 1 100 100 100 100 100 100 100 100 100 3 100 100 100 88.9 100 100 100 100 100 445 45 70 40 65 50 55 60 60 7 100 62.5 100 87.5 100 75 100 75 100 9 100 88.9 88.9 88.9 100 100 88.9 88.9 88.9 11 77.8 66.7 66.7 66.7 66.7 66.7 66.7 66.7 66.7 20 75 84.1 77.3 65.9 52.3 65.9 86.4 81.8 79.5 23 33.3 16.7 33.3 25 33.3 41.7 25 33.3 33.3 26 84.6 100 92.3 84.6 92.3 84.6 100 92.3 92.3 30 66.7 66.7 83.3 66.7 50 66.7 66.7 66.7 83.3 31 50 62.5 37.5 37.5 50 62.5 50 62.5 37.5 32 52.6 47.4 52.6 47.4 63.2 21.1 52.6 52.6 63.2 33 100 75 50 100 100 100 75 75 75 35 25 50 50 50 25 25 50 50 50 39 100 100 100 100 100 100 100 100 100 46 64.6 66.7 66.7 62.5 39.6 47.9 68.8 68.8 66.7 47 83.3 91.7 83.3 83.3 83.3 75 83.3 91.7 91.7 48 30.8 30.8 46.2 38.5 53.8 30.8 38.5 46.2 38.5 51 59.3 70.4 59.3 70.4 48.1 29.6 63 66.7 66.7 54 50 41.7 33.3 41.7 41.7 41.7 41.7 33.3 33.3 57 37.5 50 37.5 37.5 37.5 62.5 50 50 37.5 59 66.7 58.3 66.7 66.7 58.3 58.3 58.3 66.7 66.7 62 57.1 57.1 42.9 85.7 42.9 57.1 42.9 57.1 57.1 69 50 25 50 50 50 25 50 50 50 72 25 25 25 25 25 25 25 25 25 87 55.6 40.7 51.9 33.3 51.9 33.3 48.1 51.9 51.9 110 96.3 88.9 96.3 88.9 92.6 66.7 96.3 96.3 96.3 Overall 66.1 65 66.8 62.1 60.3 55.1 67.6 68.4 68.1 DSSS and the arrangement of DSSS, are also shown to Accuracy CV(Test set1) PCV(Test set1) CV(Test set2) PCV(Test set2) 70% contribute to the improved fold classification. We compared 60% prediction accuracy of the three best classifiers when using the 50% PCV features with accuracy when the features computed from 40% the predicted secondary structure are added, see Figure 4. For test set 1, 55.4%, 59.5% and 59.3% accuracies were 30% obtained for Kstar, Random Forest and SVM classifiers, 20% respectively, when using only PCV to represent sequences. 10% After adding SSC features, the accuracies increased to 57.7%, 0% SVM Kstar Random forest IB1 Logistic 62.4% and 60.1%. By adding the number of DSSS, the Naïve Bayes Regression accuracies again increased to 61.4%, 65.8% and 64.8%. Finally, adding the features related to the arrangement of Fig. 3. Comparison of prediction accuracies (y axis) between PSI- DSSS results in accuracies of 63.4%, 65.8% and 65.5%. BLAST profile-based composition vector and sequence-based composi- Similar results were observed for the test set 2. The accuracies tion vector. Two sets of feature were tested with six classifiers (x axis) of Kstar, Random Forest and SVM classifiers equal 43.6%, on both test sets. 50.9% and 50% when using only PCV, 49.7%, 57.3% and 57.9% after adding SSC features, 55.7%, 63% and 61.5% after 3.2 Effectiveness of features based on the predicted adding the number of DSSS, and finally 61%, 63% and 62.6% secondary structure after adding the features related to the arrangement of DSSS, Features generated from the predicted secondary structure that respectively. These consistent improvements show that each of were proposed in this article, which include SSC, number of the proposed features sets results in improvements and 2847 K.Chen and L.Kurgan Table 3. Comparison between PFRES and the competing fold Accuracy PCV Add SSC Add DSSS Add Arrangment 70% classification methods on test set 1. The best results for each fold are 60% shown in bold 50% 40% Folds Fold classification methods 30% a b c d e 20% SVM HKNN DIMLP SE PFP PFRES 10% (%) (%) (%) (%) (%) this article 0% Kstar(Test set1) Kstar(Test Random Random SVM(Test set1) SVM(Test set2) set2) Forest(Test Forest(Test set1) set2) 1 83.3 83.3 85.0 83.3 83.3 100 3 77.8 77.8 97.8 88.9 55.6 100 4 35.0 50.0 66.0 70.0 85.0 60.0 Fig. 4. Comparison of classification accuracy (y axis) obtained by using 7 50.0 87.5 41.3 50.0 75.0 75.0 features calculated from the secondary structure predicted by PSI- 9 100 88.9 91.1 100 100 88.9 PRED, i.e. PCV features only, PCV and SSC features, PCV, SSC and 11 66.7 44.4 22.2 33.3 33.3 66.7 number of DSSS features, and PCV, SSC, number of DSSS and 20 71.6 56.8 75.7 79.6 70.5 81.8 arrangement of DSSS features. Results of the three best classifiers on 23 16.7 25.0 40.0 25.0 16.7 33.3 both test sets (xaxis) are shown. 26 50.0 84.6 80.8 69.2 100 92.3 30 33.3 50.0 46.7 33.3 33.3 66.7 31 50.0 50.0 75.0 62.5 37.5 62.5 32 26.3 42.1 22.6 36.8 15.8 52.6 illustrates the importance of the secondary structure informa- 33 50.0 50.0 45.0 50.0 75.0 75.0 tion with respect to the classification of protein folds. 35 25.0 50.0 50.0 25.0 50.0 50.0 39 57.1 42.9 74.3 28.6 71.4 100 46 77.1 79.2 83.8 87.5 97.9 68.8 3.3 Comparison of ensemble models 47 58.3 58.3 55.0 58.3 66.7 91.7 Several prior works on protein folds classification applied 48 48.7 53.9 52.3 61.5 15.4 46.2 ensemble models to improve prediction accuracy (Bologna and 51 61.1 40.7 39.3 37.0 44.4 66.7 Appel, 2002; Nanni, 2006; Shen and Chou, 2006). The method 54 36.1 33.3 41.7 50.0 33.3 33.3 by Shen and Chou (Shen and Chou, 2006) ensembles nine 57 50.0 37.5 46.3 50.0 62.5 50.0 evidence-theoretic k-nearest neighbor classifiers that use 59 35.7 71.4 55.0 64.3 66.7 66.7 different input feature sets. The ensemble proposed in 62 71.4 71.4 44.3 71.4 57.1 57.1 69 25.0 25.0 25.0 25.0 50.0 50.0 Bologna and Appel (2002) applies four specialized neural 72 12.5 25.0 23.8 25.0 37.5 25.0 networks that use different subsets of protein sequences from 87 37.0 25.9 41.1 33.3 29.6 51.9 the training set. Finally, the ensemble developed by Nanni 110 83.3 85.2 100 85.2 96.3 96.3 (Nanni, 2006) uses 27 k-local hyperplane-based nearest Overall 56.0 57.1 61.1 61.1 62.1 68.4 neighbor classifiers, each of which uses different subset of features among these proposed in Ding and Dubchak (2001). refers to (Ding and Dubchak, 2001). In contrast to the above methods that ensemble the same type of refers to (Okun, 2004). classifiers, our method ensembles three different classifiers that refers to (Bologna and Appel, 2002). refers to (Nanni, 2006). provide complementary predictions, i.e. SVM provides superior refers to (Shen and Chou, 2006). predictions for folds 9, 11, 33, 54 and 87; Kstar for folds 20, 26, 31, 47, 51 and 57; Random Forest for folds 4, 30 and 48; see Table 2. Three methods for combining multiple classifiers that include voting, grading and stacking were compared on test were removed from test set 1, the PFRES obtains 67% accuracy set 1; see Table 2. All three ensembles are shown to provide on this set, which is only 0.6% higher than accuracy on the better accuracies than the best single classifier, Random Forest. test set 2. The proposed method adopts the best performing voting-based ensemble that achieves 68.4% accuracy on test set 1. For the test set 2, the same voting-based ensemble achieves 66.4% accuracy. 3.4 Comparison with competing prediction methods In case of both test sets, folds 1, 3, 9, 20 and 110 were predicted The proposed PFRES method was compared with five recent with accuracy of above 80%, while accuracy of below 50% was methods that address same task on test set 1; see Table 3. Ding recorded for folds 23, 35, 48 and 72. Results on both test sets and Dubchak’s method uses representation with 125 features show that the application of the ensemble model results in 2–3% and SVM and neural networks as the classifiers (Ding and improvement in prediction accuracy over the prediction based on single classifier. The lower prediction accuracy on the test set Dubchak, 2001). Okun’s method uses features proposed in 2 could be explained by the strict separation (up to 35% Ding and Dubchak (2001) and k-local hyperplane nearest sequence similarity) between this test set and the training set. neighbor classifier (Okun, 2004). Bologna and Appel’s and In contrast, test set 1 is shown to share some redundant and Nanni’s methods again use the same features and the ensemble- similar sequences with the training set. When these 19 sequences based classifiers (Bologna and Appel, 2002; Nanni, 2006). 2848 PFRES Table 4. Average accuracy of predicted secondary structure and Finally, method by Shen and Chou uses a new representation accuracy of fold classification for two subsets of test set 2; subset 1 that includes 283 features and the ensemble-based classifier. includes sequences for which secondary structure was predicted with They substituted composition vector from the feature set accuracy below 75.4%; subset 2 includes the remaining sequences proposed by Ding and Dubchak with 178 features that implement pseudo-amino acid composition. When compared with the competing methods, PFRES uses only 36 features, Number of Average accuracy of Accuracy of fold which is 70% less features than the representation applied in sequences predicted secondary classification with structure (%) PFRES (%) Ding and Dubchak (2001), Bologna and Appel (2002), Nanni (2006) and Okun (2004), and nearly 90% less features than the Subset1 379 67.6 65.2 representation proposed in Shen and Chou (2006). Table 3 Subset2 529 81.1 67.3 shows that PFRES provides 6.3–12.4% higher accuracy than Total 908 75.4 66.4 the prior methods. When compared with the best performing competing method by Shen and Chou, prediction with PFRES results in substantial 6.3/37.9 ¼ 17% error rate reduction. PFRES provides superior accuracy for 13 out of 27 folds, while method by Shen and Chou provides the best predictions (February 2005 to March 2006) (Eyrich et al., 2001). Since for nine folds. the average accuracy of the predicted secondary structure for The statistical significance of the differences between sequences in the test set 2 was 75.4%, we believe that the accuracies obtained by the proposed and the competing presented test results provide a reliable estimate of the future methods over the 27 proteins folds was investigated using performance of the proposed method. paired t-test. The corresponding t-values for the differences between PFRES and PFP (Shen and Chou, 2006), SE (Nanni, 2006), DIMLP (Bologna and Appel, 2002), HKNN (Okun, 4 CONCLUSIONS 2004) and SVM (Ding and Dubchak, 2001) methods equal 2.44, 3.18, 2.82, 3.12 and 4.12, respectively. As the critical t-value A high quality predictor for the protein fold classification for the standard 0.05 significance level equals 1.71, the test would be beneficial for in silico prediction of tertiary structure shows that the proposed method provides statistically signifi- of proteins with low sequence identity, since it would allow for cantly better predictions than the predictions of the five the determination of structural similarity without the sequence competing methods. We also note that critical t-values for similarity. To this end, we propose PFRES method that uses a stronger, 0.01 and 0.005, significance levels equal 2.48 and 2.78, novel protein sequence representation, which consists of a small respectively. set of 36 features, and applies a carefully designed ensemble classifier. The proposed feature representation that is utilized by PFRES includes PSI-BLAST profile-based composition 3.5 Impact of the quality of the secondary structure vector, features based on secondary structure predicted with predicted by PSI-PRED PSI-PRED and sequence length. The experimental evaluation Since 15 features proposed in this article were generated from of the proposed fold classification method was performed with the secondary structure predicted by PSI-PRED, we further a standard benchmark dataset and another large set of over 900 analyze the impact of the quality of the predicted secondary sequences, both with chains with identity below 35% with structure on the accuracy of the fold classification. For the test respect to the training sequences. Using the benchmark set, set 2, the average accuracy of the predicted secondary structure PFRES is shown to predict the protein folds with 68.4% was 75.4%. We divided the test set 2 into two subsets with accuracy, which is over 6% higher than the accuracy of the best sequences for which the secondary structure was predicted with existing method. The results also show that the fold classifica- accuracy below and above the average, correspondingly. The tion accuracy of the proposed method is statistically signifi- PFRES was evaluated on each of these subsets independently, cantly better than the accuracy of all competing methods. see Table 4. The prediction accuracy for the second subset was Similar performance, i.e. 66.4% was achieved by the proposed 67.3%, while for the first subset it was slightly lower and equal method on the second test set. At the same time, PFRES uses 65.2%. As expected, higher quality of predicted secondary 70–90% less features to represent sequences when compared structure results in higher accuracy of fold classification. At the with the existing methods. The proposed PSI-BLAST profile- same time, this difference is relatively small, i.e. 2%, while the based composition vector, which imbeds evolutionary informa- difference in accuracy of the predicted secondary structure tion, was compared with commonly used sequence-based between these two subsets was much larger (over 13%, see composition vector. Our empirical tests with six machine- Table 4). This shows that the proposed method provides learning classifiers have shown that the PSI-BLAST profile- relatively stable quality of predictions with respect to the based composition vector is superior to the composition vector. quality of the predicted secondary structure. We also note The new representation can be extended to other protein that current secondary structure prediction methods achieve the prediction tasks that currently apply AA composition, e.g. average accuracy close to 80%, e.g. EVA server reports that PSI-PRED provides the average accuracy 77.9% for 224 prediction of structural class, secondary structure content, proteins (tested between April 2001 and September 2005), and membrane protein type, enzyme family, etc. to improve their Porter provides the average accuracy of 79.8% for 77 proteins accuracy. 2849 K.Chen and L.Kurgan Le,C.S. and Houwelingen,J.C. (1992) Ridge estimators in logistic regression. ACKNOWLEDGEMENTS Appl. Stat., 41, 191–201. K.C.’s research was supported by the Alberta Ingenuity Leo,B. (2001) Random forests. Mach. Learn., 1, 5–32. Levitt,M. (2007) Growth of novel protein structural data. Proc. Natl Acad. Sci. Scholarship and NSERC Canada. L.K. acknowledges support USA, 104, 3183–3188. from NSERC Canada. Levitt,M. and Chothia,C. (1976) Structural patterns in globular proteins. Nature, 261, 552–558. Conflict of Interest: none declared. Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659. Lin,K. et al. (2005) A simple and fast secondary structure prediction method REFERENCES using hidden neural networks. Bioinformatics, 21, 152–159. Aha,D. and Kibler,D. (1991) Instance-based learning algorithms. Mach. Learn., Murzin,A.G. et al. (1995) SCOP: a structural classification of proteins 6, 37–66. database for the investigation of sequences and structures. J. Mol. Biol., Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of 247, 536–5340. protein database search programs. Nucleic Acids Res., 17, 3389–3402. Nanni,L. (2006) A novel ensemble of classifiers for protein fold recognition. Andreeva,A. et al. (2004) SCOP database in 2004: refinements integrate structure Neurocomputing., 69, 2434–2437. and sequence family data. Nucleic Acids Res., 32, D226–D229. Okun,O. (2004) Protein fold recognition with K-local hyperplane distance nearest Birzele,F. and Kramer,S. (2006) A new representation for protein secondary stru- neighbor algorithm In Proceedings of the 2nd European Workshop on Data cture prediction based on frequent patterns. Bioinformatics, 22, 2628–2634. Mining and Text Mining in Bioinformatics. Vol. 1, pp. 51–57. Bologna,G. and Appel,R.D. (2002) A comparison study on protein fold Paiardini,A. et al. (2004) Evolutionarily conserved regions and hydrophobic recognition. In Proceedings of the 9th International Conference on Neural contacts at the superfamily level: the case of the fold-type I, pyridoxal-5 - Information Processing. Vol. 5, pp. 2492–2496. phosphate-dependent enzymes. Protein Sci., 13, 2992–3005. Bujnicki,J.M. (2006) Protein structure prediction by recombination of fragments. Quian,N. and Sejnowski,T.J. (1988) Predicting the secondary structure of Chem. BioChem., 7, 19–27. globular proteins using neural network models. J. Mol. Biol., 202, 865–884. Chandonia,J.M. and Brenner,S.E. (2006) The impact of structural genomics: Reinhardt,A. and Eisenberg,D. (2004) DPANN: improved sequence to structure expectations and outcomes. Science, 311, 347–351. alignments following fold recognition. Proteins, 56, 528–538. Chen,K. et al. (2007) Prediction of flexible/rigid regions from protein sequences Ruan,J. et al. (2006) Quantitative analysis of the conservation of the tertiary using k-spaced amino acid pairs. BMC Struct. Biol., 7, 25. structure of protein segments. Protein J., 25, 301–315. Chou,K.C. (2005) Progress in protein structural class prediction and its impact to Seewald,A.K. (2002) How to make stacking better and faster while also taking bioinformatics and proteomics. Curr. Protein Pept. Sci., 6, 423–436. care of an unknown weakness. In Proceedings of the 19th International Chothia,C. (1992) Proteins. One thousand families for the molecular biologist. Conference on Machine Learning, pp. 554–561. Nature, 357, 543–544. Seewald,A.K. and Fuernkranz,J. (2001) An evaluation of grading classifiers. Cleary,J.G. and Trigg,L.E. (1995) K*: an instance-based learner using an entropic In Proceedings of 4th International Conference on Advances in Intelligent Data distance measure. In Proceedings of the 12th International Conference on Analysis, pp. 115–124. Machine Learning, pp. 108–114. Shen,H.B. and Chou,K.C. (2006) Ensemble classifier for protein fold pattern Ding,C.H. and Dubchak,I. (2001) Multi-class protein fold recognition using recognition. Bioinformatics, 22, 1717–1722. support vector machines and neural networks. Bioinformatics, 17, 349–358. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular Eyrich,V.A. et al. (2001) EVA: continuous automatic evaluation of protein subsequences. J. Mol. Biol., 147, 195–197. structure prediction servers. Bioinformatics, 17, 1242–1243. Tomii,K. et al. (2005) Protein structure prediction using a variety of profile Holley,L.H. and Karplus,M. (1989) Protein secondary structure prediction with a libraries and 3D verification. Proteins, 61 (7), 114–121. neural network. Proc. Natl Acad. Sci. USA, 86, 152–156. Tress,M. et al. (2005) Assessment of predictions submitted for the CASP6 John,G.H. and Langley,P. (1995) Estimating continuous distributions in comparative modeling category. Proteins, 61 (7), 27–45. Bayesian classifiers. In Proceedings of the 11th Conference on Uncertainty in Wang,G. et al. (2005) Assessment of fold recognition predictions in CASP6. Artificial Intelligence, pp. 338–345. Proteins, 61 (Suppl. 7), 46–66. Jones,D.T. (1999) Protein secondary structure prediction based on position- Witten,I. and Frank,E. (2005) Data Mining: Practical Machine Learning Tools specific scoring matrices. J. Mol. Biol., 292, 195–202. and Techniques. Morgan Kaufmann, San Francisco. Jones,D.T. (2007) Improving the accuracy of transmembrane protein topology Yu,L. and Liu,H. (2003) Feature selection for high-dimensional data: a fast prediction using evolutionary information. Bioinformatics, 23, 538–544. correlation-based filter solution. In Proceedings of the 10th International Kerthi,S.S. et al. (2001) Improvements to Platt’s SMO algorithm for SVM Conference on Machine Learning, pp. 856–863. classifier design. Neural Comput., 13, 637–649. Yu,Y.K. et al. (2006) Retrieval accuracy, statistical significance and composi- Kim,H. and Park,H. (2004) Prediction of protein relative solvent accessibility tional similarity in protein sequence database searches. Nucleic Acids Res., 34, with support vector machines and long-range interaction 3D local descriptor. 5966–5973. Proteins, 54, 557–562. Zhang,Y. and Skolnick,J. (2005) The protein structure prediction problem could Kurgan,L and Chen,K. (2007) Prediction of protein structural class for the be solved using the current PDB library. Proc. Natl Acad. Sci. USA, 102, twilight zone sequences. Biochem. Biophys. Res. Commun., 357, 453–460. 1029–1034. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/pfres-protein-fold-classification-by-using-evolutionary-information-uZGS60cyJe

Loading next page...

References (46)

L. Nanni (2006)
A novel ensemble of classifiers for protein fold recognition
Neurocomputing, 69
J. Ruan, Ke Chen, J. Tuszynski, Lukasz Kurgan (2006)
Quantitative Analysis of the Conservation of the Tertiary Structure of Protein Segments
The Protein Journal, 25
J. Bujnicki (2006)
Protein‐Structure Prediction by Recombination of Fragments
ChemBioChem, 7
S. Cessie, J. Houwelingen (1992)
Ridge Estimators in Logistic Regression
Applied statistics, 41
Ke Chen, Lukasz Kurgan, J. Ruan (2007)
Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs
BMC Structural Biology, 7
Lei Yu, Huan Liu (2003)
Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution
L. Howard, H. and, M. Karplus (1989)
Protein secondary structure prediction with a neural network.
Proceedings of the National Academy of Sciences of the United States of America, 86 1
Ning Qian, T. Sejnowski (1988)
Predicting the secondary structure of globular proteins using neural network models.
Journal of molecular biology, 202 4
George John, P. Langley (1995)
Estimating Continuous Distributions in Bayesian Classifiers
Fabian Birzele, Stefan Kramer (2006)
A new representation for protein secondary structure prediction based on frequent patterns
Bioinformatics, 22 21
A. Andreeva, Dave Howorth, S. Brenner, T. Hubbard, C. Chothia, A. Murzin (2004)
SCOP database in 2004: refinements integrate structure and sequence family data
Nucleic acids research, 32 Database issue
M. Levitt, C. Chothia (1976)
Structural patterns in globular proteins
Nature, 261
V. Eyrich, M. Martí-Renom, Dariusz Przybylski, M. Madhusudhan, A. Fiser, F. Pazos, A. Valencia, A. Sali, B. Rost (2001)
EVA: continuous automatic evaluation of protein structure prediction servers
Bioinformatics, 17 12
M. Levitt (2007)
Growth of novel protein structural data
Proceedings of the National Academy of Sciences, 104
PROTEINS: Structure, Function, and Bioinformatics 54:557–562 (2004) Prediction of Protein Relative Solvent Accessibility with Support Vector Machines and Long-Range Interaction 3D Local Descriptor
Kim (2004)
Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3D local descriptor
Proteins, 54
Lukasz Kurgan, Ke Chen (2007)
Prediction of protein structural class for the twilight zone sequences.
Biochemical and biophysical research communications, 357 2
Yang Zhang, J. Skolnick (2005)
The protein structure prediction problem could be solved using the current PDB library.
Proceedings of the National Academy of Sciences of the United States of America, 102 4
Astrid Reinhardt, D. Eisenberg (2004)
DPANN: Improved sequence to structure alignments following fold recognition
Proteins: Structure, 56
G. Bologna, R. Appel (2002)
A comparison study on protein fold recognition
Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02., 5
A. Seewald (2002)
How to Make Stacking Better and Faster While Also Taking Care of an Unknown Weakness
A. Murzin, S. Brenner, T. Hubbard, C. Chothia (1995)
SCOP: a structural classification of proteins database for the investigation of sequences and structures.
Journal of molecular biology, 247 4
S. Keerthi, S. Shevade, C. Bhattacharyya, K. Murthy (2001)
Improvements to Platt's SMO Algorithm for SVM Classifier Design
Neural Computation, 13
Hong-Bin Shen, Kuo-Chen Chou (2006)
Ensemble classifier for protein fold pattern recognition
Bioinformatics, 22 14
David Jones (2007)
Improving the accuracy of transmembrane protein topology prediction using evolutionary information
Bioinformatics, 23 5
Yi-Kuo Yu, E. Gertz, R. Agarwala, A. Schäffer, S. Altschul (2006)
Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches
Nucleic Acids Research, 34
J. Chandonia, S. Brenner (2005)
The Impact of Structural Genomics: Expectations and Outcomes
Science, 311
C. Chothia (1992)
Proteins. One thousand families for the molecular biologist.
Nature, 357 6379
O. Okun, Infotech Oulu (2004)
Protein Fold Recognition with K-Local Hyperplane Distance Nearest Neighbor Algorithm
S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic acids research, 25 17
M. Tress, Iakes Ezkurdia, O. Graña, Gonzalo López, A. Valencia (2005)
Assessment of predictions submitted for the CASP6 comparative modeling category
Proteins: Structure, 61
P. Langley (1986)
Editorial: On Machine Learning
Machine Learning, 1
D. Aha, D. Kibler, M. Albert (2004)
Instance-based learning algorithms
Machine Learning, 6
C. Chothia (1992)
One thousand families for the molecular biologist
Nature, 357
Weizhong Li, A. Godzik (2006)
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Bioinformatics, 22 13
Kuang Lin, V. Simossis, W. Taylor, J. Heringa (2005)
A simple and fast secondary structure prediction method using hidden neural networks
Bioinformatics, 21 2
A. Seewald, Johannes Fürnkranz (2001)
An Evaluation of Grading Classifiers
L. Breiman (2001)
Random Forests
Machine Learning, 45
A. Paiardini, F. Bossa, S. Pascarella (2004)
Evolutionarily conserved regions and hydrophobic contacts at the superfamily level: The case of the fold‐type I, pyridoxal‐5′‐phosphate‐dependent enzymes
Protein Science, 13
J. Cleary, Leonard Trigg (1995)
K*: An Instance-based Learner Using and Entropic Distance Measure
Temple Smith, M. Waterman (1981)
Identification of common molecular subsequences.
Journal of molecular biology, 147 1
C. Ding, I. Dubchak (2001)
Multi-class protein fold recognition using support vector machines and neural networks
Bioinformatics, 17 4
K. Chou (2005)
Progress in protein structural class prediction and its impact to bioinformatics and proteomics.
Current protein & peptide science, 6 5
K. Tomii, T. Hirokawa, C. Motono (2005)
Protein structure prediction using a variety of profile libraries and 3D verification
Proteins: Structure, 61
David Jones (1999)
Protein secondary structure prediction based on position-specific scoring matrices.
Journal of molecular biology, 292 2
Guoli Wang, Yumi Jin, Roland Dunbrack (2005)
Assessment of fold recognition predictions in CASP6
Proteins: Structure, 61

Publisher: Oxford University Press
Copyright: © The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
eISSN: 1367-4811
DOI: 10.1093/bioinformatics/btm475
pmid: 17942446
Publisher site: See Article on Publisher Site

Abstract

Vol. 23 no. 21 2007, pages 2843–2850 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm475 Structural bioinformatics PFRES: protein fold classification by using evolutionary information and predicted secondary structure Ke Chen and Lukasz Kurgan Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB, Canada Received on May 28, 2007; revised on August 27, 2007; accepted on September 17, 2007 Advance Access publication October 17, 2007 Associate Editor: Alfonso Valencia Based on the Chothia’s estimate, which states that the number ABSTRACT of different protein families is finite and perhaps as small as 1000 Motivation: The number of protein families has been estimated to (Chothia, 1992), it seems feasible to derive most of the unsolved be as small as 1000. Recent study shows that the growth in structures by homology modeling based only on a relatively discovery of novel structures that are deposited into PDB and the small portion of the protein structures that are determined related rate of increase of SCOP categories are slowing down. This experimentally. This explains why the novel structures are indicates that the protein structure space will be soon covered and especially valuable. This fact also served as the basis of the thus we may be able to derive most of remaining structures by using Protein Structure Initiative that was initiated by NIH in 1999 the known folding patterns. Present tertiary structure prediction (Chandonia and Brenner, 2006). One of the aims of this project methods behave well when a homologous structure is predicted, but give poorer results when no homologous templates are available. is to cover the structure space of proteins. These early findings At the same time, some proteins that share twilight-zone sequence are supported by a recent computational analysis of the Protein identity can form similar folds. Therefore, determination of structural Data Bank, which showed that the growth of the structural similarity without sequence similarity would be beneficial for data has slowed down and that the rate of increase of the related prediction of tertiary structures. SCOP categories (including number of families, superfamilies Results: The proposed PFRES method for automated protein fold and folds) is also slowing down (Levitt, 2007). Homology classification from low identity (535%) sequences obtains 66.4% modeling is based on the assumption that homologous and 68.4% accuracy for two test sets, respectively. PFRES obtains sequences share similar folding patterns (Ruan et al., 2006; 6.3–12.4% higher accuracy than the existing methods. The predic- Zhang and Skolnick, 2005). At the same time, sequences with tion accuracy of PFRES is shown to be statistically significantly low sequence identity can also share similar folding patterns better than the accuracy of competing methods. Our method adopts (Paiardini et al., 2004) and can be used to predict tertiary a carefully designed, ensemble-based classifier, and a novel, structure (Bujnicki, 2006). Sequence alignment software is an compact and custom-designed feature representation that includes important tool to find homologous sequences among the known nearly 90% less features than the representation of the most structures (Altschul et al., 1997; Yu et al., 2006), but inept when accurate competing method (36 versus 283). The proposed no homologous sequences are available. Research also shows representation combines evolutionary information by using the that finding similar folding patterns among the low identity PSI-BLAST profile-based composition vector and information sequences is beneficial for reconstruction of the tertiary extracted from the secondary structure predicted with PSI-PRED. structure (Reinhardt and Eisenberg, 2004; Tomii et al., 2005). Availability: The method is freely available from the authors upon A comprehensive and detailed description of the structural request. relationships between all solved proteins is provided in the Contact: [email protected] SCOP (Structural Class of Proteins) database (Andreeva et al., Supplementary information: Supplementary data are available at 2004; Murzin et al., 1995). This database implements a Bioinformatics online. hierarchy of relations between known protein and protein domain structures. The classification on the first level of the hierarchy is commonly known as the protein structural class, 1 INTRODUCTION while the second level classifies proteins into folds, which are Protein structures are being solved to answer key biological the classification target in this article. Several machine-learning questions related to protein function, regulation and interactions. methods have been applied to detect the structurally similar Outside of their biological context, the solved structures are proteins (protein folds) from sequences that share low identity. increasingly useful for structure modeling/prediction for unsolved Ding and Dubchak investigated support vector machine (SVM) protein sequences that have a closely related (similar) sequence and neural network for protein fold classification (Ding and with known structure (Tress et al., 2005; Wang et al., 2005). Dubchak, 2001). Shen and Chou studied ensemble models based on nearest neighbor (Shen and Chou, 2006). The *To whom correspondence should be addressed. prediction of the type of the protein fold for a given sequence The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] 2843 K.Chen and L.Kurgan 4-helical up-and-down bundle, (9) 4-helicalcytokines, (11) EF-hand, is performed with an intermediate step that converts the (20) immunoglobulin-like, (23) cupredoxins, (26) viral coat and capsid sequence into a feature space representation. Several other proteins, (30) conA-like lectin/glucanases, (31) SH3-like barrel, (32) ensemble models that applied the same feature space repre- OB-fold, (33) beta-trefoil, (35) trypsin-like serine proteases, (39) sentation as the one proposed by Ding and Dubchak were also lipocalins, (46) (TIM)-barrel, (47) FAD (also NAD)-binding motif, proposed (Bologna and Appel, 2002; Nanni, 2006; Okun, (48) flavodoxin-like, (51) NAD(P)-binding Rossmann-fold, (54) P-loop, 2004). In these studies protein sequences were represented (57) thioredoxin-like, (59) ribonuclease H-like motif, (62) hydrolases, by composition vector (CV), predicted secondary structure, (69) periplasmic binding protein-like, (72) b-grasp, (87) ferredoxin-like hydrophobicity, normalized van der Waals volume, polarity, and (110) small inhibitors, toxins and lectins. These 27 folds are the polarizability and pseudo-amino acid composition. The fold most populated in SCOP; each of them contains at least seven proteins. classification success rate ranged between 56% and 62%. The Based on the concept of protein structural class proposed by Levitt and Chothia (Levitt and Chothia, 1976), folds 1–11 belong to all- dimensionality of the feature space was relatively high, i.e. 125 structural class, folds 20–39 to all- class, folds 46–69 to / class features were proposed by Ding and Dubchak and 283 features and folds 72–87 to þ class. The fold distribution can be found in by Chou and Shen, when compared with size of the dataset Ding and Dubchak (2001) and these two datasets can be downloaded used in the experimental evaluation, i.e. 313 training and 385 from Supplementary Material in Shen and Chou (2006). test proteins. To this end, we propose a novel fold classification Test set 2 includes sequences that belong to the same 27 folds and method, called PFRES, that provides significantly better that were deposited into PDB between 2002 and 2004. The selected prediction accuracy when compared with the existing methods timeframe is a result of two factors: the newest version of SCOP and that uses a small number of new and more effective assigned folds only for sequences deposited until January 2005, while features. The main source of the achieved improvement is the training set and test set 1 were generated before 2001 and we aimed attributed to the application of PSI-BLAST profile (Altschul to avoid overlap between these datasets. The sequences in test set 2 were filtered by CD-HIT (Li and Godzik, 2006) at 40% sequence identity. et al., 1997) based composition vector, which considers Next, the remaining sequences were aligned with the sequences in both evolutionary information (Jones, 1999, 2007; Kim and Park, the training set and test set 1 using Smith–Waterman algorithm (Smith 2004), instead of the composition and pseudo-composition and Waterman, 1981). Only sequences that have 535% sequence vectors that were used in the prior works. We also applied identity with any sequence in these two sets were selected to form the features generated from secondary structure predicted with test set 2. The resulting 908 sequences are available from the authors PSI-PRED (Jones, 1999), which are also shown to be beneficial upon request. in the context of the fold classification. Finally, we note that PFRES, as well as all other relevant competing methods, 2.2 Feature space representation address a simplified fold classification problem, i.e. they predict 27 folds, due to low counts of proteins that belong to the 2.2.1 PSI-BLAST profile-based composition vector The com- remaining folds. position vector (CV) is computed directly from amino acid (AA) sequence (Chen et al., 2007; Chou, 2005). Given that the 20 AAs, which are ordered alphabetically (A, C, .. . , W, Y), are represented as AA , 2 MATERIALS AND METHODS AA , .. . , AA and AA , and the number of occurrences of AA in the 2 19 20 i entire sequence is denoted as n , the composition vector is defined as: 2.1 Datasets n n n n 1 2 19 20 The proposed method was designed on a training dataset with 313 , , .. . , , L L L L domains proposed by Ding and Dubchak (Ding and Dubchak, 2001). The tests were performed on two datasets: the test set 1 with 385 where L is the length of the sequence. This representation was used by domains was also taken from Ding and Dubchak (2001) and was used majority of the existing fold classification methods (Bologna and Appel, to perform comparison with other existing methods; the test set 2 with 2002; Ding and Dubchak, 2001; Nanni, 2006; Okun, 2004). 908 domains was included to provide larger scale evaluation on more The new representation, which combines PSI-BLAST profile and the recently deposited domains and to assure that the proposed method concept of composition vector, was developed for the proposed does not overfit the first, small test set. prediction method. The prior successful applications of PSI-BLAST The follow-up study by Shen and Chou excluded two training profile illustrate that the evolutionary information is more informative domains (2SCMC and 2GPS) and two domains from test set 1 than the query sequence itself (Jones, 1999, 2007; Kim and Park, 2004). (2YHX_1 and 2YHX_2) due to lack of sequence records (Shen and PSI-BLAST aligns a given query sequence to a database of sequences. Chou, 2006). We follow Shen and Chou’s study and adopt the two Using multiple sequence alignment, PSI-BLAST counts the frequency datasets without these four sequences. The sequence identity for any of each AA at each position for the query sequence and generates pair of sequences in the training set is535%. According to the dataset 20-dimensional vector of AA frequencies for each position in the query authors, the sequence in test set 1 share more than 35% sequence sequence. The generated PSI-BLAST profile can be used to identify identity with the sequences in the training set. We found seven key positions of conserved AAs and positions that undergo mutations. duplicates between these two sets, i.e. 1APLC, 3RUB2, 2REB1, Our approach combines the composition vector of the entire sequence 1DSBA2, 1GLCG1, 1GLCG2 and 1SLTA from the training set and the PSI-BLAST profile into so called PSI-BLAST profile-based correspond to 1YRNB, 3RUBL2, 2REB_1, 1DSBA2, 1GLAG1, composition vector (PCV). The PSI-BLAST profile is an L 20 matrix, 1GLAG2 and1SLAA from the test set 1, respectively. We also found which is denoted as [a ], where i ¼ 1,2, .. . , L denotes position in the i,j another 12 pairs that share over 50% identity. This redundancy may query sequence and j ¼ 1,2, .. . , 20 denotes a given AA. After applying result in overestimated test results on dataset 1, but at the same time it the substitution matrix and log function, a values range between ij should not impact ability to compare the relative differences between 9 and 11. The proposed representation is related to calculation of the prediction accuracies achieved by various methods on this test set. composition vector based on binary coding. The binary coding uses a The training and test set 1 sequences belong to the following 27 folds: 20-dimensional vector to encode each AA. In binary coding, AA is (1) globin-like, (3) cytochrome c, (4) DNAbinding3-helical bundle, (7) encoded as (0,0, ... , 0,1,0, ... , 0,0), where only the ith value is greater 2844 PFRES than 0. The binary coding matrix is denoted as [b ]. The binary Accuracy on i,j Helix Strand Coil test set 1 encoding and PSI-BLAST profile matrices have the same dimension- 65.0% ality (L 20). 64.5% CV can be computed from the binary coding matrix in a straightforward way. For a given protein sequence A A ... A 1 2 N 64.0% ki CV ¼ ði ¼ 1,2, .. .,20Þ 63.5% k¼1 63.0% where {CV , i ¼ 1, 2, .. . , 20} is the 20-dimensional composition vector. Threshold PCV is calculated in a similar way. The only difference is that the 62.5% binary coding matrix [b ] is replaced by PSI-BLAST profile [a ]. i,j i,j 23 45 67 89 Therefore, PCV is defined as: Fig. 1. Optimization of segment length thresholds to define DSSS ki PCV ¼ ði ¼ 1, 2, .. .,20Þ features. k¼1 Since PSI-BLAST profile values can be negative, while the Table 1. Summary of the feature selection results frequencies of AA pairs should not be negative, we redefine PCV as follows: Features set Total number features Selected features maxða ,0Þ ki PCV ¼ ði ¼ 1, 2, .. .,20Þ k¼1 PCV 20 20 where the negative a values are replaced by 0 and the 20-dimensional ki SSC 33 {PCV , i ¼ 1, 2, .. . , 20} vector corresponds to the PSI-BLAST profile- i Number of DSSS 32 based composition vector. Arrangement of DSSS 27 10 Length 1 1 Total 54 36 2.2.2 Secondary structure predicted with PSI-PRED Predicted secondary structure is proven to be helpful in fold classification. The recently proposed fold classification studies (Ding and Dubchak, 2001; Shen and Chou, 2006) used the secondary structure predicted with strand and coil segments equal 5 and 6, respectively. We note that the relatively older methods (Holley and Karplus, 1989; Quian and accuracies resulting from using different threshold are relatively similar, Sejnowski, 1988). In contrast, we use a more recent PSI-PRED i.e. within 1%, and thus the quality of the proposed method should not method (Jones, 1999), which is shown to provide superior accuracy be sensitive to this parameter. when compared with other state-of-the-art competing secondary Arrangement of DSSS. In some cases, structural folds cannot be structure prediction methods (Birzele and Kramer, 2006; Lin et al., distinguished based on the SSC and DSSS features. For instance, the 2005). We used PSI-PRED25 with default parameters to predict / and þ classes contain both -helices and -strands; the / class secondary structure from the protein sequences. The 3-state predictions includes mainly parallel -strands, while þ class mainly includes (helix, strand and coil) are used to generate the features. anti-parallel strands, which is related to the arrangement of secondary Secondary structure content (SSC) is shown to improve classification structure segments, but not the SSC or DSSS values. Therefore, we also designed another set of features that encode arrangement of three accuracy of a related problem of structural class prediction (Kurgan neighboring secondary structure segments, which meet the minimum and Chen, 2007). To this end, we introduce SSC that is calculated from threshold criteria set for DSSS features. There are 27 possible the secondary structure predicted with PSI-PRED. Let us denote the segment arrangements, i.e. class-class-class where class ¼ ‘H’, ‘E’ and AA sequence as {A , i ¼ 1, 2, .. . , L} and the predicted secondary ‘C’. We count the corresponding number of occurrences for each structure as {S , i ¼ 1, 2, .. . , L}. We count the occurrences of ‘H’, ‘E’ and ‘C’ predictions and denote the corresponding counts as COUNT , arrangement. Finally, we also include the length of the sequence (L) as a feature. COUNT , COUNT , respectively. The SSC is defined as: E C Table 1 summarizes features used in this article. COUNT class Content ¼ class where class ¼ ‘H’, ‘E’ and ‘C’. 2.3 Feature selection Number of distinct secondary structure segments (DSSS). Although Feature selection method was used to reduce the dimensionality and secondary structure content reflects information about the secondary potentially improve the prediction accuracy. An entropy-based feature structure of the entire sequence, it does not provide information selection method (Yu and Liu, 2003), which evaluates each feature by concerning individual secondary structure segments. At the same time, measuring the information gain with respect to the class (protein fold), size (length) of secondary structure segments is one of the deciding was applied. factors when it comes to the classification of the structural classes and The entropy of a feature X is defined as: folds. To this end, we designed features that count the number of HðXÞ¼ Pðx Þ log ðPðx ÞÞ i i occurrences of distinct helix, strand and coil structures which length (number of the corresponding AAs) is above a certain threshold. In this where {x } is a set of values of X and P(x ) is the prior probability of x . i i i way short secondary structure segments, which possibly can be The conditional entropy of X, given another feature Y (in our case the incorrectly predicted, will be filtered out. We varied the threshold protein fold) is defined as: values between 2 and 9 for the strand and coil segments and between 3 X X HðXjYÞ¼ Pðy Þ Pðx jy Þ log ðPðx jy ÞÞ and 9 for the helix segments and run predictions using SVM classifier. j i j i j j i The corresponding results are shown in Figure 1. Based on the graph, the threshold to count helical segments equals 7. The thresholds for where P(x | y ) is the posterior probability of X given the value y of Y. i j i 2845 K.Chen and L.Kurgan For the test set 1, the fold classification accuracies for the six PSI-PRED Sequence PSI-BLAST classifiers that include SVM, Multiple Logistic Regression, Kstar, IB1, Naıve Bayes and Random Forest and when using Feature the selected 36 features to represent sequences are shown in generation Table 2. Random Forest (with 250 trees) gives the highest module accuracy, i.e. 66.8%, among the six classifiers. The two runner- up classifiers, SVM (with RBF kernel with ¼ 0.8 and complexity parameter C ¼ 5.0) and Kstar (with global 36 features based sequence representation blend ¼ 96), obtained 66.1% and 65.0% accuracy, respectively. The same classifiers were also evaluated on the test set 2 by applying the same group of features and the same parameters. SVM Kstar Random forest classifier classifier classifier Random Forest again gives the highest accuracy, i.e. 63.3%, with the same two runner-up classifiers, SVM and Kstar, which Fold type predicted Fold type Fold type obtained 62.4% and 62.7% accuracy, respectively. The accu- by Kstar predicted predicted racies on test set 2 are slightly lower than accuracies on test by SVM by RF set 1. The remaining three classifiers obtained accuracy that is Voting module 3–10% lower than the accuracy of the three best classifiers, and thus were not used to implement the proposed fold classifica- Predicted fold type tion method. Fig. 2. Architecture of the proposed fold classification method. Among the 27 folds, fold 1 and fold 39 are the easiest to classify, i.e. all six classifiers achieved 100% accuracy for these two folds. Folds 3, 7, 9, 26, 33, 47 and 110 are also relatively The amount by which the entropy of X decreases reflects additional easy to classify, i.e. the average accuracy of the six classifiers for information about X provided by Y and is called information gain these folds is above 80%. The average prediction accuracy for IGðXjYÞ¼ HðXÞ HðXjYÞ all- structural class (folds 1–11) is 77.1%, for all- class (folds 20–39) is 64.3%, for / class (folds 46–69) is 55.6% and for According to this measure, Y has stronger correlation with X than with þ class (folds 72–87) is 40%. The folds that belong to all- Z if IG(X|Y)4IG(Z|Y). The feature selection was performed using and all- structural classes are easier to classify, while folds that 10-fold cross-validation on the training set. Among the original set of belong to / and þ classes are more difficult to correctly 54 features, 36 with the best information gain values were selected; see recognize. This is expected as the proposed features, and Table 1. especially those based on the predicted secondary structure, should be able to successfully represent proteins that contain 2.4 Proposed prediction method mainly -helices and -strands. At the same time, although still The proposed prediction method was designed and tested in two steps. well performing, the proposed features are less efficient in First, we selected a set of best-performing classifiers among six state- capturing long range interactions that are characteristic to of-the-art methods that include SVM (Kerthi et al., 2001), Multiple formation of parallel and anti-parallel -strands. Logistic Regression (Le and Houwelingen, 1992), instance learning- based Kstar (Cleary and Trigg, 1995) and IB1 (Aha and Kibler, 1991) algorithms, Naı¨ ve Bayes (John and Langley, 1995), and Random Forest 3.1 Effectiveness of PCV features (Leo, 2001) and when using the selected 36 features to represent The PSI-BLAST profile-based composition vector (PCV), sequences. Second, three different ensembles of the selected classifiers, which is proposed in this article, was directly compared including voting, grading and stacking (Seewald, 2002; Seewald and with the corresponding sequence-based composition vector Fuernkranz, 2001), were tried and the best performing ensemble was (CV) representation that was used in Bologna and Appel used to implement our fold classification method. As a result, voting- (2002), Ding and Dubchak (2001), Nanni (2006) and Okun based ensemble, which combines predictions from the three classifiers based on an unweighted average of the corresponding classification (2004). PCV and CV were compared based on fold classifica- probability estimates, was selected. The architecture of the proposed tion performed with the six classifiers. The prediction results PFRES method is shown in Figure 2. The classification algorithms used are shown in Figure 3. The comparison shows consistent to develop and compare the proposed method were implemented in superior quality of PCV features, i.e. the results based on Weka (Witten and Frank, 2005). PCV features are at least 13% higher than the result from CV features for all six classifiers. For test set 1, the average accuracy when using PCV features is 54.8%, while for CV 3 RESULTS AND DISCUSSION features it drops to 39.1%. For the test set 2, the average The experiments first report results related to the design of the accuracy when using PCV features is 46.3%, while for CV proposed fold classification method. We also test and discuss features it drops to 27.3%. This illustrates that sequential effectiveness of individual feature sets from the proposed evolutionary information is critical for successful classification sequence representation. Finally, the results of our ensemble of protein folds, even for sequences that share low sequence method are compared with the results of five competing identity. The results also indicate that the test set 2 is more methods. challenging. 2846 PFRES Table 2. Comparison of prediction accuracies between different classifiers for the proposed sequence representation that includes the selected 36 features Folds Individual classifiers Ensemble classifiers SVM Kstar Random Forest IB1 Naı¨ ve Bayes Regression Grading Voting Stacking-C 1 100 100 100 100 100 100 100 100 100 3 100 100 100 88.9 100 100 100 100 100 445 45 70 40 65 50 55 60 60 7 100 62.5 100 87.5 100 75 100 75 100 9 100 88.9 88.9 88.9 100 100 88.9 88.9 88.9 11 77.8 66.7 66.7 66.7 66.7 66.7 66.7 66.7 66.7 20 75 84.1 77.3 65.9 52.3 65.9 86.4 81.8 79.5 23 33.3 16.7 33.3 25 33.3 41.7 25 33.3 33.3 26 84.6 100 92.3 84.6 92.3 84.6 100 92.3 92.3 30 66.7 66.7 83.3 66.7 50 66.7 66.7 66.7 83.3 31 50 62.5 37.5 37.5 50 62.5 50 62.5 37.5 32 52.6 47.4 52.6 47.4 63.2 21.1 52.6 52.6 63.2 33 100 75 50 100 100 100 75 75 75 35 25 50 50 50 25 25 50 50 50 39 100 100 100 100 100 100 100 100 100 46 64.6 66.7 66.7 62.5 39.6 47.9 68.8 68.8 66.7 47 83.3 91.7 83.3 83.3 83.3 75 83.3 91.7 91.7 48 30.8 30.8 46.2 38.5 53.8 30.8 38.5 46.2 38.5 51 59.3 70.4 59.3 70.4 48.1 29.6 63 66.7 66.7 54 50 41.7 33.3 41.7 41.7 41.7 41.7 33.3 33.3 57 37.5 50 37.5 37.5 37.5 62.5 50 50 37.5 59 66.7 58.3 66.7 66.7 58.3 58.3 58.3 66.7 66.7 62 57.1 57.1 42.9 85.7 42.9 57.1 42.9 57.1 57.1 69 50 25 50 50 50 25 50 50 50 72 25 25 25 25 25 25 25 25 25 87 55.6 40.7 51.9 33.3 51.9 33.3 48.1 51.9 51.9 110 96.3 88.9 96.3 88.9 92.6 66.7 96.3 96.3 96.3 Overall 66.1 65 66.8 62.1 60.3 55.1 67.6 68.4 68.1 DSSS and the arrangement of DSSS, are also shown to Accuracy CV(Test set1) PCV(Test set1) CV(Test set2) PCV(Test set2) 70% contribute to the improved fold classification. We compared 60% prediction accuracy of the three best classifiers when using the 50% PCV features with accuracy when the features computed from 40% the predicted secondary structure are added, see Figure 4. For test set 1, 55.4%, 59.5% and 59.3% accuracies were 30% obtained for Kstar, Random Forest and SVM classifiers, 20% respectively, when using only PCV to represent sequences. 10% After adding SSC features, the accuracies increased to 57.7%, 0% SVM Kstar Random forest IB1 Logistic 62.4% and 60.1%. By adding the number of DSSS, the Naïve Bayes Regression accuracies again increased to 61.4%, 65.8% and 64.8%. Finally, adding the features related to the arrangement of Fig. 3. Comparison of prediction accuracies (y axis) between PSI- DSSS results in accuracies of 63.4%, 65.8% and 65.5%. BLAST profile-based composition vector and sequence-based composi- Similar results were observed for the test set 2. The accuracies tion vector. Two sets of feature were tested with six classifiers (x axis) of Kstar, Random Forest and SVM classifiers equal 43.6%, on both test sets. 50.9% and 50% when using only PCV, 49.7%, 57.3% and 57.9% after adding SSC features, 55.7%, 63% and 61.5% after 3.2 Effectiveness of features based on the predicted adding the number of DSSS, and finally 61%, 63% and 62.6% secondary structure after adding the features related to the arrangement of DSSS, Features generated from the predicted secondary structure that respectively. These consistent improvements show that each of were proposed in this article, which include SSC, number of the proposed features sets results in improvements and 2847 K.Chen and L.Kurgan Table 3. Comparison between PFRES and the competing fold Accuracy PCV Add SSC Add DSSS Add Arrangment 70% classification methods on test set 1. The best results for each fold are 60% shown in bold 50% 40% Folds Fold classification methods 30% a b c d e 20% SVM HKNN DIMLP SE PFP PFRES 10% (%) (%) (%) (%) (%) this article 0% Kstar(Test set1) Kstar(Test Random Random SVM(Test set1) SVM(Test set2) set2) Forest(Test Forest(Test set1) set2) 1 83.3 83.3 85.0 83.3 83.3 100 3 77.8 77.8 97.8 88.9 55.6 100 4 35.0 50.0 66.0 70.0 85.0 60.0 Fig. 4. Comparison of classification accuracy (y axis) obtained by using 7 50.0 87.5 41.3 50.0 75.0 75.0 features calculated from the secondary structure predicted by PSI- 9 100 88.9 91.1 100 100 88.9 PRED, i.e. PCV features only, PCV and SSC features, PCV, SSC and 11 66.7 44.4 22.2 33.3 33.3 66.7 number of DSSS features, and PCV, SSC, number of DSSS and 20 71.6 56.8 75.7 79.6 70.5 81.8 arrangement of DSSS features. Results of the three best classifiers on 23 16.7 25.0 40.0 25.0 16.7 33.3 both test sets (xaxis) are shown. 26 50.0 84.6 80.8 69.2 100 92.3 30 33.3 50.0 46.7 33.3 33.3 66.7 31 50.0 50.0 75.0 62.5 37.5 62.5 32 26.3 42.1 22.6 36.8 15.8 52.6 illustrates the importance of the secondary structure informa- 33 50.0 50.0 45.0 50.0 75.0 75.0 tion with respect to the classification of protein folds. 35 25.0 50.0 50.0 25.0 50.0 50.0 39 57.1 42.9 74.3 28.6 71.4 100 46 77.1 79.2 83.8 87.5 97.9 68.8 3.3 Comparison of ensemble models 47 58.3 58.3 55.0 58.3 66.7 91.7 Several prior works on protein folds classification applied 48 48.7 53.9 52.3 61.5 15.4 46.2 ensemble models to improve prediction accuracy (Bologna and 51 61.1 40.7 39.3 37.0 44.4 66.7 Appel, 2002; Nanni, 2006; Shen and Chou, 2006). The method 54 36.1 33.3 41.7 50.0 33.3 33.3 by Shen and Chou (Shen and Chou, 2006) ensembles nine 57 50.0 37.5 46.3 50.0 62.5 50.0 evidence-theoretic k-nearest neighbor classifiers that use 59 35.7 71.4 55.0 64.3 66.7 66.7 different input feature sets. The ensemble proposed in 62 71.4 71.4 44.3 71.4 57.1 57.1 69 25.0 25.0 25.0 25.0 50.0 50.0 Bologna and Appel (2002) applies four specialized neural 72 12.5 25.0 23.8 25.0 37.5 25.0 networks that use different subsets of protein sequences from 87 37.0 25.9 41.1 33.3 29.6 51.9 the training set. Finally, the ensemble developed by Nanni 110 83.3 85.2 100 85.2 96.3 96.3 (Nanni, 2006) uses 27 k-local hyperplane-based nearest Overall 56.0 57.1 61.1 61.1 62.1 68.4 neighbor classifiers, each of which uses different subset of features among these proposed in Ding and Dubchak (2001). refers to (Ding and Dubchak, 2001). In contrast to the above methods that ensemble the same type of refers to (Okun, 2004). classifiers, our method ensembles three different classifiers that refers to (Bologna and Appel, 2002). refers to (Nanni, 2006). provide complementary predictions, i.e. SVM provides superior refers to (Shen and Chou, 2006). predictions for folds 9, 11, 33, 54 and 87; Kstar for folds 20, 26, 31, 47, 51 and 57; Random Forest for folds 4, 30 and 48; see Table 2. Three methods for combining multiple classifiers that include voting, grading and stacking were compared on test were removed from test set 1, the PFRES obtains 67% accuracy set 1; see Table 2. All three ensembles are shown to provide on this set, which is only 0.6% higher than accuracy on the better accuracies than the best single classifier, Random Forest. test set 2. The proposed method adopts the best performing voting-based ensemble that achieves 68.4% accuracy on test set 1. For the test set 2, the same voting-based ensemble achieves 66.4% accuracy. 3.4 Comparison with competing prediction methods In case of both test sets, folds 1, 3, 9, 20 and 110 were predicted The proposed PFRES method was compared with five recent with accuracy of above 80%, while accuracy of below 50% was methods that address same task on test set 1; see Table 3. Ding recorded for folds 23, 35, 48 and 72. Results on both test sets and Dubchak’s method uses representation with 125 features show that the application of the ensemble model results in 2–3% and SVM and neural networks as the classifiers (Ding and improvement in prediction accuracy over the prediction based on single classifier. The lower prediction accuracy on the test set Dubchak, 2001). Okun’s method uses features proposed in 2 could be explained by the strict separation (up to 35% Ding and Dubchak (2001) and k-local hyperplane nearest sequence similarity) between this test set and the training set. neighbor classifier (Okun, 2004). Bologna and Appel’s and In contrast, test set 1 is shown to share some redundant and Nanni’s methods again use the same features and the ensemble- similar sequences with the training set. When these 19 sequences based classifiers (Bologna and Appel, 2002; Nanni, 2006). 2848 PFRES Table 4. Average accuracy of predicted secondary structure and Finally, method by Shen and Chou uses a new representation accuracy of fold classification for two subsets of test set 2; subset 1 that includes 283 features and the ensemble-based classifier. includes sequences for which secondary structure was predicted with They substituted composition vector from the feature set accuracy below 75.4%; subset 2 includes the remaining sequences proposed by Ding and Dubchak with 178 features that implement pseudo-amino acid composition. When compared with the competing methods, PFRES uses only 36 features, Number of Average accuracy of Accuracy of fold which is 70% less features than the representation applied in sequences predicted secondary classification with structure (%) PFRES (%) Ding and Dubchak (2001), Bologna and Appel (2002), Nanni (2006) and Okun (2004), and nearly 90% less features than the Subset1 379 67.6 65.2 representation proposed in Shen and Chou (2006). Table 3 Subset2 529 81.1 67.3 shows that PFRES provides 6.3–12.4% higher accuracy than Total 908 75.4 66.4 the prior methods. When compared with the best performing competing method by Shen and Chou, prediction with PFRES results in substantial 6.3/37.9 ¼ 17% error rate reduction. PFRES provides superior accuracy for 13 out of 27 folds, while method by Shen and Chou provides the best predictions (February 2005 to March 2006) (Eyrich et al., 2001). Since for nine folds. the average accuracy of the predicted secondary structure for The statistical significance of the differences between sequences in the test set 2 was 75.4%, we believe that the accuracies obtained by the proposed and the competing presented test results provide a reliable estimate of the future methods over the 27 proteins folds was investigated using performance of the proposed method. paired t-test. The corresponding t-values for the differences between PFRES and PFP (Shen and Chou, 2006), SE (Nanni, 2006), DIMLP (Bologna and Appel, 2002), HKNN (Okun, 4 CONCLUSIONS 2004) and SVM (Ding and Dubchak, 2001) methods equal 2.44, 3.18, 2.82, 3.12 and 4.12, respectively. As the critical t-value A high quality predictor for the protein fold classification for the standard 0.05 significance level equals 1.71, the test would be beneficial for in silico prediction of tertiary structure shows that the proposed method provides statistically signifi- of proteins with low sequence identity, since it would allow for cantly better predictions than the predictions of the five the determination of structural similarity without the sequence competing methods. We also note that critical t-values for similarity. To this end, we propose PFRES method that uses a stronger, 0.01 and 0.005, significance levels equal 2.48 and 2.78, novel protein sequence representation, which consists of a small respectively. set of 36 features, and applies a carefully designed ensemble classifier. The proposed feature representation that is utilized by PFRES includes PSI-BLAST profile-based composition 3.5 Impact of the quality of the secondary structure vector, features based on secondary structure predicted with predicted by PSI-PRED PSI-PRED and sequence length. The experimental evaluation Since 15 features proposed in this article were generated from of the proposed fold classification method was performed with the secondary structure predicted by PSI-PRED, we further a standard benchmark dataset and another large set of over 900 analyze the impact of the quality of the predicted secondary sequences, both with chains with identity below 35% with structure on the accuracy of the fold classification. For the test respect to the training sequences. Using the benchmark set, set 2, the average accuracy of the predicted secondary structure PFRES is shown to predict the protein folds with 68.4% was 75.4%. We divided the test set 2 into two subsets with accuracy, which is over 6% higher than the accuracy of the best sequences for which the secondary structure was predicted with existing method. The results also show that the fold classifica- accuracy below and above the average, correspondingly. The tion accuracy of the proposed method is statistically signifi- PFRES was evaluated on each of these subsets independently, cantly better than the accuracy of all competing methods. see Table 4. The prediction accuracy for the second subset was Similar performance, i.e. 66.4% was achieved by the proposed 67.3%, while for the first subset it was slightly lower and equal method on the second test set. At the same time, PFRES uses 65.2%. As expected, higher quality of predicted secondary 70–90% less features to represent sequences when compared structure results in higher accuracy of fold classification. At the with the existing methods. The proposed PSI-BLAST profile- same time, this difference is relatively small, i.e. 2%, while the based composition vector, which imbeds evolutionary informa- difference in accuracy of the predicted secondary structure tion, was compared with commonly used sequence-based between these two subsets was much larger (over 13%, see composition vector. Our empirical tests with six machine- Table 4). This shows that the proposed method provides learning classifiers have shown that the PSI-BLAST profile- relatively stable quality of predictions with respect to the based composition vector is superior to the composition vector. quality of the predicted secondary structure. We also note The new representation can be extended to other protein that current secondary structure prediction methods achieve the prediction tasks that currently apply AA composition, e.g. average accuracy close to 80%, e.g. EVA server reports that PSI-PRED provides the average accuracy 77.9% for 224 prediction of structural class, secondary structure content, proteins (tested between April 2001 and September 2005), and membrane protein type, enzyme family, etc. to improve their Porter provides the average accuracy of 79.8% for 77 proteins accuracy. 2849 K.Chen and L.Kurgan Le,C.S. and Houwelingen,J.C. (1992) Ridge estimators in logistic regression. ACKNOWLEDGEMENTS Appl. Stat., 41, 191–201. K.C.’s research was supported by the Alberta Ingenuity Leo,B. (2001) Random forests. Mach. Learn., 1, 5–32. Levitt,M. (2007) Growth of novel protein structural data. Proc. Natl Acad. Sci. Scholarship and NSERC Canada. L.K. acknowledges support USA, 104, 3183–3188. from NSERC Canada. Levitt,M. and Chothia,C. (1976) Structural patterns in globular proteins. Nature, 261, 552–558. Conflict of Interest: none declared. Li,W. and Godzik,A. (2006) Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22, 1658–1659. Lin,K. et al. (2005) A simple and fast secondary structure prediction method REFERENCES using hidden neural networks. Bioinformatics, 21, 152–159. Aha,D. and Kibler,D. (1991) Instance-based learning algorithms. Mach. Learn., Murzin,A.G. et al. (1995) SCOP: a structural classification of proteins 6, 37–66. database for the investigation of sequences and structures. J. Mol. Biol., Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of 247, 536–5340. protein database search programs. Nucleic Acids Res., 17, 3389–3402. Nanni,L. (2006) A novel ensemble of classifiers for protein fold recognition. Andreeva,A. et al. (2004) SCOP database in 2004: refinements integrate structure Neurocomputing., 69, 2434–2437. and sequence family data. Nucleic Acids Res., 32, D226–D229. Okun,O. (2004) Protein fold recognition with K-local hyperplane distance nearest Birzele,F. and Kramer,S. (2006) A new representation for protein secondary stru- neighbor algorithm In Proceedings of the 2nd European Workshop on Data cture prediction based on frequent patterns. Bioinformatics, 22, 2628–2634. Mining and Text Mining in Bioinformatics. Vol. 1, pp. 51–57. Bologna,G. and Appel,R.D. (2002) A comparison study on protein fold Paiardini,A. et al. (2004) Evolutionarily conserved regions and hydrophobic recognition. In Proceedings of the 9th International Conference on Neural contacts at the superfamily level: the case of the fold-type I, pyridoxal-5 - Information Processing. Vol. 5, pp. 2492–2496. phosphate-dependent enzymes. Protein Sci., 13, 2992–3005. Bujnicki,J.M. (2006) Protein structure prediction by recombination of fragments. Quian,N. and Sejnowski,T.J. (1988) Predicting the secondary structure of Chem. BioChem., 7, 19–27. globular proteins using neural network models. J. Mol. Biol., 202, 865–884. Chandonia,J.M. and Brenner,S.E. (2006) The impact of structural genomics: Reinhardt,A. and Eisenberg,D. (2004) DPANN: improved sequence to structure expectations and outcomes. Science, 311, 347–351. alignments following fold recognition. Proteins, 56, 528–538. Chen,K. et al. (2007) Prediction of flexible/rigid regions from protein sequences Ruan,J. et al. (2006) Quantitative analysis of the conservation of the tertiary using k-spaced amino acid pairs. BMC Struct. Biol., 7, 25. structure of protein segments. Protein J., 25, 301–315. Chou,K.C. (2005) Progress in protein structural class prediction and its impact to Seewald,A.K. (2002) How to make stacking better and faster while also taking bioinformatics and proteomics. Curr. Protein Pept. Sci., 6, 423–436. care of an unknown weakness. In Proceedings of the 19th International Chothia,C. (1992) Proteins. One thousand families for the molecular biologist. Conference on Machine Learning, pp. 554–561. Nature, 357, 543–544. Seewald,A.K. and Fuernkranz,J. (2001) An evaluation of grading classifiers. Cleary,J.G. and Trigg,L.E. (1995) K*: an instance-based learner using an entropic In Proceedings of 4th International Conference on Advances in Intelligent Data distance measure. In Proceedings of the 12th International Conference on Analysis, pp. 115–124. Machine Learning, pp. 108–114. Shen,H.B. and Chou,K.C. (2006) Ensemble classifier for protein fold pattern Ding,C.H. and Dubchak,I. (2001) Multi-class protein fold recognition using recognition. Bioinformatics, 22, 1717–1722. support vector machines and neural networks. Bioinformatics, 17, 349–358. Smith,T.F. and Waterman,M.S. (1981) Identification of common molecular Eyrich,V.A. et al. (2001) EVA: continuous automatic evaluation of protein subsequences. J. Mol. Biol., 147, 195–197. structure prediction servers. Bioinformatics, 17, 1242–1243. Tomii,K. et al. (2005) Protein structure prediction using a variety of profile Holley,L.H. and Karplus,M. (1989) Protein secondary structure prediction with a libraries and 3D verification. Proteins, 61 (7), 114–121. neural network. Proc. Natl Acad. Sci. USA, 86, 152–156. Tress,M. et al. (2005) Assessment of predictions submitted for the CASP6 John,G.H. and Langley,P. (1995) Estimating continuous distributions in comparative modeling category. Proteins, 61 (7), 27–45. Bayesian classifiers. In Proceedings of the 11th Conference on Uncertainty in Wang,G. et al. (2005) Assessment of fold recognition predictions in CASP6. Artificial Intelligence, pp. 338–345. Proteins, 61 (Suppl. 7), 46–66. Jones,D.T. (1999) Protein secondary structure prediction based on position- Witten,I. and Frank,E. (2005) Data Mining: Practical Machine Learning Tools specific scoring matrices. J. Mol. Biol., 292, 195–202. and Techniques. Morgan Kaufmann, San Francisco. Jones,D.T. (2007) Improving the accuracy of transmembrane protein topology Yu,L. and Liu,H. (2003) Feature selection for high-dimensional data: a fast prediction using evolutionary information. Bioinformatics, 23, 538–544. correlation-based filter solution. In Proceedings of the 10th International Kerthi,S.S. et al. (2001) Improvements to Platt’s SMO algorithm for SVM Conference on Machine Learning, pp. 856–863. classifier design. Neural Comput., 13, 637–649. Yu,Y.K. et al. (2006) Retrieval accuracy, statistical significance and composi- Kim,H. and Park,H. (2004) Prediction of protein relative solvent accessibility tional similarity in protein sequence database searches. Nucleic Acids Res., 34, with support vector machines and long-range interaction 3D local descriptor. 5966–5973. Proteins, 54, 557–562. Zhang,Y. and Skolnick,J. (2005) The protein structure prediction problem could Kurgan,L and Chen,K. (2007) Prediction of protein structural class for the be solved using the current PDB library. Proc. Natl Acad. Sci. USA, 102, twilight zone sequences. Biochem. Biophys. Res. Commun., 357, 453–460. 1029–1034.

Journal

Bioinformatics – Oxford University Press

Published: Oct 17, 2007

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

PFRES: protein fold classification by using evolutionary information and predicted secondary structure

PFRES: protein fold classification by using evolutionary information and predicted secondary structure

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

PFRES: protein fold classification by using evolutionary information and predicted secondary structure

PFRES: protein fold classification by using evolutionary information and predicted secondary structure

References (46)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies