Access the full text.
Sign up today, get DeepDyve free for 14 days.
A. Krogh, Björn Larsson, G. Heijne, E. Sonnhammer (2001)
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.Journal of molecular biology, 305 3
Astrid Reinhardt, T. Hubbard (1998)
Using neural networks for prediction of the subcellular location of proteins.Nucleic acids research, 26 9
Ron Kohavi, George John (1997)
Wrappers for Feature Subset SelectionArtif. Intell., 97
P. Horton, K. Nakai (1997)
Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors ClassifierProceedings. International Conference on Intelligent Systems for Molecular Biology, 5
O. Emanuelsson, Henrik Nielsen, S. Brunak, G. Heijne (2000)
Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.Journal of molecular biology, 300 4
D. Altman, J. Bland (1994)
Statistics Notes: Diagnostic tests 1: sensitivity and specificityBMJ, 308
(2000)
PSORT II Users' Manual
(1979)
Information Retrieval. Butterworths, London (UK). http://www.dcs.gla.ac.uk/Keith/Preface.html
R. Mott, J. Schultz, P. Bork, C. Ponting (2002)
Predicting protein cellular localization using a domain projection method.Genome research, 12 8
K. Nakai, M. Kanehisa (1992)
A knowledge base for predicting protein localization sites in eukaryotic cellsGenomics, 14
S. Hua, Zhirong Sun (2001)
Support vector machine approach for protein subcellular localization predictionBioinformatics, 17 8
(2003)
European Bioinformatics Institute, http://www.ebi.ac
J. Gardy, Cory Spencer, Ke Wang, M. Ester, G. Tusnády, I. Simon, S. Hua, K. Fays, C. Lambert, K. Nakai, F. Brinkman (2003)
PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteriaNucleic acids research, 31 13
O. Emanuelsson (2002)
Predicting Protein Subcellular Localisation From Amino Acid Sequence InformationBriefings in bioinformatics, 3 4
D. Szafron, R. Greiner, P. Lu, D. Wishart, Cam Macdonell, J. Anvik, B. Poulin, Zhiyong Lu, Roman Eisner (2003)
Explaining Naive Bayes Classifications
J. Schultz, R. Copley, T. Doerks, C. Ponting, P. Bork (2000)
SMART: a web-based tool for the study of genetically mobile domainsNucleic acids research, 28 1
R. Duda, P. Hart (1974)
Pattern classification and scene analysis
M. Sahami (1998)
Using machine learning to improve information access
D. Szafron, P. Lu, R. Greiner, D. Wishart, Zhiyong Lu, B. Poulin, Roman Eisner, J. Anvik, Cam Macdonell, Bahram Habibi-Nazhad (2003)
Proteome Analyst - Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors
R. Nair, B. Rost (2002)
Inferring sub-cellular localization through automated lexical analysisBioinformatics, 18 Suppl 1
Dan Jurafsky, James Martin (2000)
Speech and Language Processing
Vol. 20 no. 4 2004, pages 547–556 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg447 Predicting subcellular localization of proteins using machine-learned classifiers Z. Lu, D. Szafron , R. Greiner, P. Lu, D.S. Wishart, B. Poulin, J. Anvik, C. Macdonell and R. Eisner Department of Computing Science, University of Alberta, Edmonton, AB, Canada, T6G 2E8 Received on August 26, 2003; accepted on September 25, 2003 Advance Access publication January 22, 2004 ABSTRACT are three basic approaches. One approach is based on Motivation: Identifying the destination or localization of amino acid composition, using artificial neural nets (ANN) proteins is key to understanding their function and facilitating such as NNPSL (Reinhardt and Hubbard, 1998), or their purification. A number of existing computational predic- support vector machines (SVM) like SubLoc (Hua and tion methods are based on sequence analysis. However, these Sun, 2001, http://www.bioinfo.tsinghua.edu.cn/SubLoc/). methods are limited in scope, accuracy and most particularly A second approach uses the existence of peptide signals, breadth of coverage. Rather than using sequence information which are short sub-sequences of ∼3–70 amino acids to pred- alone, we have explored the use of database text annotations ict specific cell locations, such as TargetP (Emanuelsson et al., from homologs and machine learning to substantially improve 2000). A third approach, such as the one used in LOCkey (Nair the prediction of subcellular location. and Rost, 2002), is to do a similarity search on the sequence, Results: We have constructed five machine-learning clas- extract text from homologs and use a classifier on the text fea- sifiers for predicting subcellular localization of proteins from tures. Some tools, like PSORT (Nakai and Kanehisa, 1992; animals, plants, fungi, Gram-negative bacteria and Gram- Horton and Nakai, 1997, http://psort.nibb.ac.jp/), combine a positive bacteria, which are 81% accurate for fungi and 92– variety of individual predictors. Many tools, like SubLoc, 94% accurate for the other four categories. These are the most PSORT and TMHMM (Krogh et al., 2001, http://www.cbs. accurate subcellular predictors across the widest set of organ- dtu.dk/services/TMHMM/), are available for public use on isms ever published. Our predictors are part of the Proteome the web. Unfortunately, most tools accept only a single Analyst web-service. sequence at a time, with TMHMM being a notable exception. Availability: http://www.cs.ualberta.ca/~bioinfo/PA/Sub, http:// Emanuelsson (2002) provides a good survey of these tools. www.cs.ualberta.ca/~bioinfo/PA Better accuracy and coverage are needed Contact: [email protected] There are two limitations to current techniques. The first is Supplementary information: http://www.cs.ualberta.ca/ the limited accuracy of the predictors, especially for some ~bioinfo/PA/Subcellular organelles. The second is limited coverage. The term cover- age can be used in three ways: location coverage, sequence INTRODUCTION coverage and taxonomic coverage. All three kinds of coverage High-throughput sequencing technology has made it pos- are limited in current tools. sible for many laboratories to sequence the genomes First, location coverage defines the sub-regions (nuclear, of new organisms. There are more than 1200 gen- cytoplasmic, extracellular, etc.) in the cell that are supported ome sequences deposited in public databases (EBI, 2003, by a predictor. Most existing tools limit the location coverage http://www.ebi.ac.uk/genomes/). Given the size and com- to just membranes or just a few organelles. plexity of these datasets, most researchers are compelled Second, given a training/test set, sequence coverage is to use automated annotation systems to identify or clas- defined as the ratio of sequences for which a prediction is made sify individual genes and proteins. As part of this annota- to the total number of sequences of interest. For example, tion process, a number of systems have been developed the LOCkey dataset consists of 3146 labeled sequences from that support automated prediction of subcellular localiza- Swiss-Prot and the predictor obtained an accuracy of 0.87 on a tion, based on amino acid sequence information. There subset of 1161 sequences (coverage = 0.37). Sequence cov- erage can be measured on one organism (1-organism sequence To whom correspondence should be addressed. coverage) or multiple organisms. The 1-organism measure is Bioinformatics 20(4) © Oxford University Press 2004; all rights reserved. 547 Z.Lu et al. Table 1. Accuracies (Acc.) and informal sequence/taxonomic coverage of accuracy, whereas PSORT-B is a newer system with much current subcellular localization predictors better accuracy (Gardy et al., 2003). In general, a classifier takes a query instance, described by a Name Acc. Coverage Technique set of feature-value pairs, and returns one of a fixed number of labels (Mitchell, 1997). In PA, each query instance is a primary PSORT-B 0.75 1443 GN bacterial Combination sequence that is BLASTed against the Swiss-Prot database to LOCkey 0.87 1161 assorted Homology obtain a set of homologs. Each feature of the query instance is SubLoc 0.91 291 prokaryotic AA composition a Boolean value corresponding to the presence or absence of a 0.79 2427 eukaryotic token (word or phrase) from certain fields of the homologous TargetP 0.85 940 plant Signal prediction sequences’ Swiss-Prot database entries. 0.90 2738 non-plant We use a machine-learning (ML) algorithm to learn a map- Proteome 0.93 16 284 animal Homology and ML ping from the features of a query instance to the appropriate Analyst 0.93 3420 plant subcellular localization label for that instance. A common 0.81 2104 fungal technique is to apply a ML algorithm to a set of labeled train- 0.92 3218 GN bacterial ing items to produce a classifier. In our case, each training item 0.94 1571 GP bacterial consists of a primary protein sequence and the ontological label it has been assigned by an expert. Each training instance is first BLASTed against Swiss-Prot to identify its features in the same manner as query instances. Features are not provided important for high-throughput prediction of newly sequenced in the training set—they are computed automatically from organisms. Swiss-Prot data. Third, taxonomic coverage measures the range of organisms In this paper, we use three different sources for labeled for the predictor such as: animal, green plant, Gram-negative training data: Swiss-Prot database entries that have unambigu- bacteria (GN), etc. Most existing predictors have only been ous subcellular localization annotations (26 458 sequences), evaluated on a limited number of sequences from a specific a subset of the Swiss-Prot database developed for LOCkey taxonomic category of organism (e.g. just GN bacteria or just (3146 sequences) and the set of GN bacteria sequences (1443) green plants). used in PSORT-B. These three datasets are used to evaluate Table 1 lists some predictors and gives a measure of accur- the PA classifiers. However, a PA user can also create a cus- acy and the kind of technique employed. It also provides an tom subcellular localization classifier using custom training informal indication of combined sequence coverage and taxo- data, by simply uploading a file of labeled training sequences nomic coverage. Unfortunately, no standardized sequence (Szafron et al., 2003b). No programming is required. coverage ratios have been published for these predictors. In the context of PA, transparency is the ability to provide Using classifiers for prediction formally-sound and intuitively-simple reasons for each pred- This paper describes a novel classification technique for pred- iction (Szafron et al., 2003a). PA bases its predictions on icting subcellular localization (Lu, 2003). This technique is well-understood concepts of conditional probabilities. Its used in our publicly available web-based Proteome Analyst explanations are presented as stacked bar-graphs that clearly (PA). Two tools are available for subcellular localization—a display the evidence for each prediction. simple tool (PA-SUB) that only predicts subcellular loc- Contributions alization (http://www.cs.ualberta.ca/~bioinfo/PA/Sub) and a This paper describes a new subcellular localization prediction more comprehensive tool that predicts subcellular localiza- technique that makes the following scientific contributions: tion along with other annotations, including general func- tion (http://www.cs.ualberta.ca/~bioinfo/PA). The second tool (1) This new ML technique makes the most accurate sub- also allows a user to build a custom classifier from custom cellular localization predictions over the broadest range training data. of organisms (animals, plants, fungi, GN bacteria and A controlled vocabulary or ontology is required for GP bacteria) of all subcellular localization prediction subcellular localization. In fact, since cell structure var- techniques published to date. ies across organisms, several ontologies are required and (2) This technique is publicly available as a high- PA supports five: animal, plant, fungi, GN bacteria and throughput web-based tool in PA. Gram-positive (GP) bacteria, which are based on the (3) Proteome Analyst provides the first explanation facility PSORT ontologies. Among them, PSORT (bacteria/plants), for subcellular localization predictions. PSORT II (animals/yeast) (Nakai, 2000, http://psort.nibb.ac. jp/helpwww2.html) and PSORT-B (GN bacteria) provide a (4) Proteome Analyst can be used to easily create new sub- set of predictors over the same classes of organisms as PA. cellular classifiers using custom training data, without However, PSORT and PSORT II are older systems with poor any programming. 548 Predicting subcellular localization SYSTEMS AND METHODS The prediction process Proteome Analyst predicts the subcellular localization of a query protein sequence using its primary sequence and the Fig. 1. The Swiss-Prot homologs of EXOA_STRPN from BLAST. organism taxonomic category: animal, plant, fungi, GN bac- teria, GP bacteria. Here is the five-step prediction process used by PA. P1. The primary sequence of the query protein is BLASTed against the Swiss-Prot database and a set of homolog- ous sequences is selected. P2. Potential features are computed by extracting text from the Swiss-Prot records of the best homologs. A feature has the value true if a token representing that feature is extracted and false if no such token is extracted. Fig. 2. The features for EXOA_STRPN extracted by PA. P3. The user-provided taxonomic organism category is used to select one of five pre-built Naïve Bayes (NB) classifiers (Duda and Hart, 1973): animal, plant, fungi, GN bacteria, GP bacteria. P4. The features are used by the appropriate classifier to compute the probability of each label in the ontology of that classifier. The label with the highest probability is considered the primary location for the protein. Fig. 3. Proteome Analyst predicted subcellular locations for P5. The user can view a graphical explanation of the EXOA_STRPN. prediction (Szafron et al., 2003a). in a stop-word list (van Rijsbergen, 1979, http://www. We use the GP bacterial protein Exodeoxyribonuclease dcs.gla.ac.uk/Keith/Preface.html). For example, Figure 2 from Streptococcus pneumoniae (EXOA_STRPN) as an shows the potential feature set for the demonstration query example. If this organism was newly sequenced, its pro- sequence (EXOA_STRPN) that were extracted from the top teins would not appear in Swiss-Prot. Therefore, we removed three homologs. They appear under the heading ‘Unique all EXOA_STRPN entries from our Swiss-Prot database for Tokens Extracted for Protein #6’. this demonstration. We experimented with many variations Our classifiers remove other poorly discriminating features of steps P1 (homolog selection) and P2 (feature extraction), as well. When PA builds a classifier, it actually learns the best as described in the Discussion section. In this section, we set of features to use. This process of feature selection is a describe only the best configuration. We select up to three standard ML technique for improving accuracy (Kohavi and homologs with the lowest BLAST E-values that are less John, 1997). In fact, the five classifiers (animal, plant, fungi, than 0.001. GN bacteria and GP bacteria) use different machine-learned Figure 1 shows three homologs of our query protein feature sets. Figure 2 shows the features that were actually sequence. For feature selection, we obtained the best results used by the GP bacteria classifier to classify the demonstration using phrases extracted from selected fields of the Swiss- sequence (EXOA_STRPN). They appear under the heading Prot homologs. Specifically, we extracted each semi-colon ‘Relevant Tokens for Protein #6’. For example, the features delimited phrase from the Swiss-Prot KEYWORD field of ipr003034 and polymorphism appear in the ‘Unique Tokens’ each selected homolog, as well as all InterPro numbers from list, but are not used by the classifier, so they are not in the the DBSOURCE field. Finally, we checked for the inclu- ‘Relevant Tokens’ list. sion of a pre-defined set of phrases in the SUBCELLULAR Proteome Analyst uses a NB classifier, which generates a LOCALIZATION sub-field of the COMMENT field. For probability for each label. Figure 3 shows the probabilities of ease of reference in this paper, we will denote these fields each of the GP bacteria labels for the demonstration sequence by: KWORD, IPR and SCELL, respectively. This set of (EXOA_STRPN) as shown in PA. phrases forms the potential feature set. The Discussion sec- tion describes alternative feature definition strategies that Building a classifier produced less accurate classifiers. After computing the potential feature set, we remove all ubi- A classifier must be trained (built) before it can be used. PA quitous phrases like: ‘complete proteome’, that are contained uses labeled training data to build a simple NB classifier using 549 Z.Lu et al. Table 2. Confusion matrix for the PA GP classifier, trained on Swiss-Prot these basic steps: data B1. Each labeled training instance consists of a primary sequence and a label from the ontology of the classifier Actual Predicted label being built. cyt wal mem ext N.P. ASum cov Sensitivity B2. The primary sequence of each training instance is run TN TN FP TN TN cyt 881 0 13 26 10 930 0.989 0.947 through steps P1 and P2 described in the previous TN TN FP TN TN wal 1 16 0 1 1 19 0.947 0.842 section to produce a set of potential features. TP FN FN FN FN mem 8 1 291 17 23 340 0.932 0.856 + − B3. A set of sufficient statistics, c and c , are com- TN TN FP TN TN ij ij ext 4 2 8 217 21 252 0.917 0.861 PSum 894 19 312 261 55 1541 0.964 R = 0.912 puted for the set of training instances, where c is ij Precision 0.985 0.842 0.933 0.831 P = 0.945 the number of training sequences that were indicated Specificity 0.979 0.998 0.983 0.966 S = 0.978 by label j with F = true and c , is the number of ij training sequences that were indicated by label j with The ontological labels are: cyt(oplasmic), (cell) wal(l), mem(brane) and ext(racellular). F = false. N.P. represents no prediction, ASum and PSum are the sums of the actual and predicted labels, respectively. cov is sequence coverage. The superscripts: TP, true positive; TN, B4. A NB classifier is built using these sufficient statistics. true negative; FP, false positive; and FN, false negative are relative to the mem(brane) ¯ ¯ label and are used in the text for illustration, along with the bolded entries. R, P and In fact, as mentioned earlier, we modify this basic pro- S denote overall sensitivity (recall), precision and specificity, respectively. cess by using feature selection (Kohavi and John, 1997) to improve the accuracy. After building and computing the accur- actually labeled mem(brane). The PSum row indicates the acy using all the potential features, we remove 5% of the number of test sequences whose predicted label is specified by features that have the lowest information content. The inform- the column label. For example, 312 sequences had predicted ation content (information gain) of a feature is a measure of label, membrane. the amount that a feature contributes to classifications in gen- Various statistics can be computed from a confusion matrix eral (Mitchell, 1997). For example, if a feature appears in to evaluate a classifier. In this paper we will use four standard every training instance, it is useless in discriminating between statistics: specificity, precision, sensitivity and recall (the last labels and its information content is zero. On the other hand, two are identical). Given a confusion matrix M and a set of a feature that appears in all training instances that have a labels {L }, the standard definitions (van Rijsbergen, 1979; single label and no training instances with any other labels Altman and Bland, 1994) of these statistics are as follows. is very good for discriminating the one label. Therefore, it The precision for each label L is P defined by: i i has high information content. After removing this 5% of low information content features, we build a second classifier, and TP M M ii ii measure its accuracy. We then remove another 5% of low P = = = TP + FP M PSum ki i k=1 information content features and continue in this way until we have computed the accuracy of 20 different classifiers Here, n is the number of training instances, true positives (TP) with 0%, 5%, 10%, ... , 95% of the original features removed. is the number of labels correctly predicted as L , which were We identify the threshold that produced the classifier with the actually labeled L . The false positives (FP) is the number highest accuracy. The most accurate classifiers for subcellular of labels incorrectly predicted as L that were actually not localization typically had 75–80% of the least discriminating labeled as L . For example, consider the label mem(brane) in features removed. Table 2. The TP and FP counts are denoted by superscripts, where there is a single count for TP, but the three FP entries Classifier evaluation must be summed. From Table 2, we have TP = 291 and FP = To compare classifiers, it is important to define the evaluation 13 + 0 + 8 = 21. Therefore, the precision for membrane is: criteria precisely. Most techniques start with a confusion mat- P = 291/(291 + 21) = 291/312 = 0.933. (mem) rix or contingency table (van Rijsbergen, 1979). Table 2 shows The specificity for each label L is S defined by: i i the confusion matrix for the PA GP bacteria classifier trained on Swiss-Prot data. TN sum − ASum − PSum + M i i ii S = = We will use Table 2 to illustrate our evaluation techniques. TN + FP sum − ASum Each entry in Table 2 represents the number of sequences in the test set whose actual label is the row label and whose Here, true negatives (TN) is the number of labels cor- predicted label is the column label. For example, the number rectly predicted as not L , that were actually not labeled of sequences with actual label mem(brane) that were incor- L and sum is the total number of sequences (1541 in rectly predicted as ext(racellular) is 17. The ASum column Table 2). For example, in Table 2, the TN and FP counts indicates the number of test sequences whose actual label is for the label mem(brane) are denoted by superscripts, where specified by the row label. For example, 340 sequences were the superscripted numbers must be summed. We have 550 Predicting subcellular localization TN = 881 + 0 + 26 + 10 + 1 + 16 + 1 + 1 + 4 + 2 + 217 The overall precision and overall specificity are weighted + 21 = 1180 and FP = 13 + 0 + 8 = 21. The specificity averages over the predicted labels (columns): of label mem(brane) is: S = 1180/(1180 + 21) = (mem) n n PSum P M 1180/1201 = 0.983. i i ii i=1 i=1 P = = The sensitivity or recall for each label L is R defined by: sum − PSum sum − PSum i i n+1 n+1 TP M M ii ii PSum S i i i=1 R = = = ¯ i S = n+1 TP + FN ASum M i sum − PSum ij n+1 j =1 Here, false negatives (FN) is the number of labels incor- For example, the overall precision and overall specificity of rectly predicted as not L that were actually labeled L .For i i ¯ the classifier in Table 2 are P = (881 + 16 + 291 + 217)/ example, consider the label mem(brane) in Table 2. The TP (1541 − 55) = 0.945 and S = 0.978, respectively. The and FN counts are denoted by superscripts, where the FN overall coverage is a weighted average of the label coverage, superscripted numbers must be summed. From Table 2, we so C = 0.964. have TP = 291 and FN = 8 + 1 + 17 + 23 = 49. Note that There are many different ways to organize test sets and the FN number includes the no prediction (N.P.) column as we compute two different kinds of confusion matrices. Our well. Therefore, the sensitivity (recall) of label mem(brane) first technique is a standard ML technique called 5-fold is: R = 291/(291 + 49) = 291/340 = 0.856. (mem) cross-validation (Mitchell, 1997). Each set of labeled training The precision and specificity statistics favor conservative instances is ‘randomly’ divided into five groups (G , ... , G ), 1 5 predictors that make no prediction when there is doubt about while keeping the number of training instances with each label the correctness of a prediction, while the sensitivity (recall) approximately the same in each training group. Then, five dif- statistic favors liberal predictors that make a prediction if ferent classifiers are constructed (C , ... , C ), where C uses 1 5 i there is a chance of success. For example, if two predic- all of the training instances from all of the groups except G . tions are changed from ‘no prediction’ to a prediction, where Next, a confusion matrix is computed for each of the five one is correct and the other is incorrect, then TP increases classifiers, C , using the sequences in group G (that were not i i by 1, FP increases by 1, TN decreases by 1 and FN decreases used in its training) as test data. The final confusion matrix by 1. Therefore, the precision and specificity numbers both is then computed by summing the entries in all the confusion decrease, but the sensitivity (recall) increases: matrices. In our application, there is one important modifica- TP + 1 TP tion that is necessary to ensure ‘fairness’ of the evaluation. P = < Our features are obtained by extracting them from Swiss-Prot TP + 1 + FP + 1 TP + FP homologs. Before searching for homologs, we remove the TN − 1 TN S = < Swiss-Prot entries of each of the test sequences. This sim- TN − 1 + FP + 1 TN + FP ulates the situation where the test sequences correspond to TP + 1 TP newly sequenced proteins that would not appear in the Swiss- R = > TP + 1 + FN − 1 TP + FN Prot database. We used the 5-fold cross-validation accuracy Information retrieval papers report precision and recall, to build the feature selection filter described in the previous while bioinformatics, medical and ML papers tend to report section. A second technique for computing a confusion mat- specificity and sensitivity. We include all of them. However, rix is to build a single classifier from all training data except specificity is not as informative as precision for multi-labeled the sequences from one specific organism. This 1-organism (non-binary) classifiers. We also include sequence coverage, classifier is then applied to the specific organism and a con- which is the ratio of sequences for which a prediction was fusion matrix is constructed. This simulates the situation in made to the total number of sequences in a specific class. which a classifier is used to predict the subcellular locations For example, in Table 2, the coverage of mem(brane) is of all sequences in a newly sequenced organism. In this case, (340 − 23)/340 = 0.932. for fairness, all Swiss-Prot entries for that specific organism An overall version of each statistic is computed as a are removed from the Swiss-Prot database. weighted average. For the overall sensitivity (recall) the After the evaluation is complete, we build a final classifier weights are the number of sequences with each actual label using all of the training instances. This final classifier typically (ASum ), and we also refer to it as the accuracy, A: has better accuracy than any of the five classifiers built during n n 5-fold cross-validation. ASum R ASum (M /ASum ) i i i ii i i=1 i=1 A = R = = sum sum RESULTS ii i=1 Proteome analyst accuracy sum For example, in Table 2, the overall sensitivity (accuracy) is Table 3–7 show the statistics for the five classifiers A = R = (881 + 16 + 291 + 217)/1541 = 0.912. we built using training instances from the Swiss-Prot 551 Z.Lu et al. Table 5. Statistics for the PA fungi classifier. See Tables 3 and 4 for Table 3. Statistics for the PA animal classifier: count, spec(ificity), prec(ision) and sens(itivity), as well as the 1-organism statistics for Bos abbreviations taurus (Bovine) Location 5-fold cross-validate 1-organism: NEUCR Location 5-fold cross-validate 1-organism: BOVINE count spec prec sens count spec prec sens Count spec prec sens count spec prec sens nuc 621 0.975 0.933 0.833 11 1.000 1.000 1.000 nuc 2846 0.996 0.979 0.905 47 1.000 1.000 0.894 mit 406 0.977 0.888 0.744 45 0.976 0.976 0.889 mit 1194 0.998 0.973 0.970 145 0.993 0.972 0.952 cyt 395 0.949 0.786 0.808 15 0.958 0.824 0.933 cyt 1845 0.981 0.866 0.919 84 0.983 0.878 0.940 ext 171 0.993 0.914 0.871 2 1.000 1.000 1.000 ext 3943 0.991 0.972 0.927 197 0.991 0.974 0.964 gol 52 0.991 0.689 0.808 0 1.000 0/0 0/0 gol 167 0.996 0.723 0.892 7 0.996 0.667 0.857 pex 64 0.993 0.786 0.859 0 1.000 0/0 0/0 pex 103 0.999 0.909 0.971 4 0.999 0.800 1.000 end 64 0.993 0.750 0.656 1 1.000 1.000 1.000 end 457 0.996 0.868 0.952 14 0.996 0.824 1.000 mem 302 0.989 0.932 0.861 12 0.987 0.917 0.917 lys 170 0.998 0.861 0.947 12 0.997 0.857 1.000 vac 19 0.996 0.600 0.632 1 1.000 1.000 1.000 mem 4820 0.981 0.957 0.938 218 0.986 0.966 0.917 Overall 2094 0.975 0.871 0.811 87 0.978 0.940 0.908 Overall 15549 0.988 0.946 0.929 728 0.990 0.950 0.941 The 1-organism is Neurospora crassa. The ontological labels are: nuc(lear), mit(ochondria), cyt(oplasmic), ext(racellular), Table 6. Statistics for the PA GN bacteria classifier. See Table 2 for gol(gi), pe(ro)x(isomal), end(oplasmic reticulum), lys(osomal) and mem(brane). abbreviations Table 4. Statistics for the PA green plant classifier. See Table 3 for Location 5-fold cross-validate 1-organism: HAEIN abbreviations count spec prec sens count spec prec sens cyt 1861 0.989 0.992 0.955 73 1.000 1.000 1.000 Location 5-fold cross-validate 1-organism: MAIZE ext 253 0.986 0.838 0.858 15 0.990 0.929 0.867 count spec prec sens count spec prec sens per 385 0.986 0.898 0.873 7 0.991 0.875 1.000 inn 432 0.993 0.958 0.951 5 1.000 1.000 0.800 nuc 168 0.999 0.988 0.964 16 1.000 1.000 1.000 wal 46 0.999 0.956 0.935 0 0.983 0/0 0/0 mit 307 0.992 0.926 0.935 19 0.986 0.900 0.947 out 197 0.996 0.938 0.919 15 1.000 1.000 0.800 cyt 447 0.987 0.923 0.960 36 0.992 0.971 0.917 Overall 3174 0.990 0.959 0.934 115 0.990 0.964 0.922 ext 127 0.996 0.887 0.866 6 0.981 0.667 1.000 gol 35 0.998 0.850 0.971 2 1.000 1.000 1.000 Additional ontological labels are: inn(er membrane), per(iplasmic), (cell) wal(l) and chl 1899 0.973 0.980 0.959 69 0.979 0.969 0.913 out(er membrane). The 1-organism is Haemophilus influenzae. pex 29 0.999 0.993 0.966 1 0.994 0.500 1.000 end 64 0.998 0.903 0.875 6 1.000 1.000 1.000 Table 7. Statistics for the PA GP bacteria classifier. See Tables 3 and 6 for vac 82 0.997 0.870 0.817 2 0.994 0.667 1.000 abbreviations mem 135 0.992 0.805 0.733 9 0.987 0.600 0.333 Overall 3293 0.982 0.951 0.939 728 0.987 0.926 0.904 Location 5-fold cross-validate 1-organism: STRCO Additional labels are chl(oroplast) and vac(uole). The 1-organism is Zea mays. count spec prec sens count spec prec sens cyt 930 0.982 0.988 0.948 37 1.000 1.000 1.000 wal 19 0.997 0.750 0.789 0 0/0 0/0 0/0 database. The training sets are publicly available (PA-SUB, ext 252 0.967 0.841 0.881 6 1.000 1.000 1.000 2003, http://www.cs.ualberta.ca/~bioinfo/PA/Subcellular), mem 340 0.982 0.929 0.853 9 1.000 1.000 0.889 along with the confusion matrices that were used to compute Overall 1541 0.980 0.946 0.914 52 1.000 1.000 0.981 these statistics. Each training set contains a set of sequences in FastA format that includes the correct label (from Swiss-Prot), The 1-organism is Streptomyces coelicolor. the organism tag, the organism name, Swiss-Prot taxonomy information and the primary sequence. These classifiers show excellent 5-fold cross-validation and Rost, 2002), we constructed two custom subcellular localiza- 1-organism statistics over all ontological classes. However, tion classifiers using their ontology and training data. The some small training and test sets produce poor results, such LOCkey paper contains a confusion matrix for a Swiss-Prot as the precision (0.600) for the 19 training/test instances of dataset with 1162 training instances. Table 8 shows the 5-fold vacuolar in the fungi classifier (Table 5). cross-validation specificity, precision and recall, computed We performed additional experiments to compare our work from their confusion matrix and from a PA classifier we built with similar systems. To compare PA to LOCkey (Nair and using their training data and ontology. 552 Predicting subcellular localization Table 8. A comparison of the statistics of a PA classifier built using the Table 9. A comparison of the statistics of a PA classifier built using the LOCkey 1161 sequence training data with the statistics produced by the PSORT-B training data with the statistics produced by the PSORT-B predictor, LOC(Key) classifier on the their training data built from the same training data Location Count Specificity Precision Sensitivity Location Count Precision Sensitivity LOC PA LOC PA LOC PA PSORT-B PA PSORT-B PA mit 190 0.945 0.993 0.763 0.964 0.795 0.979 cyt 252 0.976 0.947 0.694 0.853 ext 334 0.947 0.975 0.879 0.937 0.953 0.973 inn 308 0.967 0.965 0.787 0.906 nuc 352 0.926 0.985 0.850 0.965 0.971 0.929 per 264 0.919 0.915 0.576 0.860 chl 94 0.979 0.997 0.718 0.966 0.609 0.894 out 378 0.988 0.986 0.903 0.947 cyt 136 0.970 0.973 0.656 0.804 0.428 0.846 ext 241 0.944 0.876 0.700 0.880 end 14 0.993 1.000 0.200 1.000 0.154 0.500 Overall 1443 0.965 0.943 0.748 0.895 lys 7 0.999 0.999 0.000 0.833 0.000 0.714 gol 22 0.998 0.999 0.895 0.944 0.810 0.773 See Table 5 for the ontological label abbreviations. pex 8 1.000 1.000 0.000 1.000 0.000 0.375 vac 4 1.000 1.000 0.000 1.000 0.000 0.250 Overall 1161 0.945 0.983 0.815 0.936 0.815 0.912 Note that 139 out of 1443 training sequences in the See Tables 3 and 4 for the ontological label abbreviations. PSORT-B training data have two labels. To accommodate double-labels in our NB classifier, we transformed each train- ing instance that had two labels into two training instances, one with each label. Since we are comparing with the PSORT-B Our specificity, precision and sensitivity results are consist- classifier (Gardy et al., 2003), we followed their lead dur- ently better than the LOCkey results, except for sensitivity ing predictor evaluation and counted a prediction as correct on the golgi class. Our accuracy (overall sensitivity) is almost if it predicted either of the two labels. As a final test, we 10% better at 0.912 versus 0.815. Even though our approaches applied our full Swiss-Prot trained GN bacteria classifier to are similar, there are two reasons for these accuracy differ- the PSORT-B test set and obtained an accuracy of 0.869 (Lu, ences. First, we are using a different classifier technology— 2003; PA-SUB, 2003). (NB) versus an ad-hoc method. Second, we are using different Swiss-Prot database fields (including the IPR field). Their Sequence coverage paper does not include a confusion matrix or accuracy statist- If PA is applied to an entire organism, there will be some ics, for 100% coverage of a larger 3146 sequence set, other sequences without homologs, so no features can be extrac- than to indicate that the accuracy is less than the 0.815 accur- ted and used by the classifier. In some cases, even though acy of their 34% coverage classifier. On this larger set (100% homologs are found, there will be no relevant tokens in the coverage), we achieved an accuracy (overall sensitivity) of FUNCTION, IPR and SCELL fields used by PA to construct 0.889 (PA-SUB, 2003). features. We call such sequences excluded sequences and We also built a custom classifier for GN bacteria using the PA makes no subcellular localization prediction for excluded reliable PSORT-B GN bacteria data (Gardy et al., 2003) as a sequences. Excluded sequences are the only ones that reduce training set. Table 9 shows the 5-fold cross-validation pre- the coverage of PA classifiers. To gain an appreciation for cision and sensitivity (recall) presented in their paper and the PA subcellular localization sequence coverage on vari- the same statistics computed from a PA classifier built using ous organisms, we used the PA classifiers to classify all the the PSORT-B training data and ontology. They do not report sequences in several organisms as shown in Table 10. A more specificity, so it is not in Table 9. Note that our Swiss-Prot complete table is online (PA-SUB, 2003). GN bacteria ontology has one extra label, (cell) wal(l), which Before running PA on an organism, we removed all the they include in the ext(racellular) class. To compare our tech- sequences for that organism from Swiss-Prot, so that no exact nique more directly with theirs, we did not include a (cell) sequence matches would be found. Of course, for these tests, wal(l) label in the classifier we built from their data. The we cannot report accuracy, since we do not know the ‘cor- PA approach is very different than the PSORT-B approach, rect’ subcellular localization for many of them. The organisms since PA uses a simple NB classifier and features extracted use the animal, plant, fungi, GN bacteria and GP bacteria from Swiss-Prot homologs, while PSORT-B uses a set of six classifiers, respectively. Each was selected since its complete sequence-based models. Nevertheless, PA produces results proteome is publicly available. We are currently developing that are somewhat better for sensitivity and accuracy, and very pattern recognition and discovery software that can be used close in precision. Furthermore, the PA technique produces to extract local features from excluded sequences so that the excellent results for animals, plants, fungi and GP bacteria coverage may approach 100%. (with different classifiers of course). 553 Z.Lu et al. Table 10. Sequence coverage of the PA classifiers on some fully sequenced Table 11. The accuracy of PA GN classifiers that use different homolog selec- organisms tion techniques, different Swiss-Prot fields and different feature extraction techniques Organism Class Count Exclude Cov PSI-BLAST Top KWORD IPR SCELL acc iterations homologs M.musculus Animal 27 754 7099 0.745 A.thaliana Plant 26 032 10 043 0.600 S.pombe Fungi 5007 1023 0.787 1 3 phrase yes phrase 0.934 B.subtilis GP bacteria 4098 1346 0.672 1 3 phrase yes no 0.924 P.aeruginosa GN bacteria 5557 1355 0.756 1 3 phrase no phrase 0.934 1 3 no no phrase 0.922 2 3 phrase yes phrase 0.932 The count is the total number of genetic sequences for the organism. An exclude(d) sequence is one for which PA was unable to find at least one homolog whose E-value 1 2 phrase yes phrase 0.935 was less than 0.001 that contained at least one relevant feature. The cov(erage) is the 1 4 phrase yes phrase 0.936 ratio of all non-excluded sequences from an organism to the total number of sequences 1 3 words yes words 0.929 from that organism. extracellular class for the plant and fungi ontologies, since DISCUSSION the Swiss-Prot data is not very accurate in these cases. For Extracting ontological labels for training step 4, we found many Swiss-Prot SCELL annotations for sequences proteins that are not in the cell wall, which contain the phrase We selected all mature sequences (≥40 amino acids) from ‘attached to the cell wall’. the Swiss-Prot database and tried to extract their ontological labels. Although the Swiss-Prot database contains a subcellu- Selecting homologs and extracting features lar localization field, this field does not contain just a single We experimented with many different implementations of the ontological label for each sequence. Therefore, we had to 5-step prediction process described earlier in this paper. For construct a parser that extracted a simple ontological label, step 1, we used PSI-BLAST instead of BLAST and varied the when possible. Here are the rules that our parser uses to label number of iterations. Second, we varied the number of homo- potential training sequences: logs whose features were extracted. The highest accuracies were obtained by using one iteration of PSI-BLAST (so we (1) See if the field contains one of the ontological labels. reverted to BLAST). There is not much difference between If it does not, the sequence is rejected as a training using the top two, three or four homologs (whose E-values sequence. were smaller than 0.001), so we decided to pick three, while (2) If it contains more than one ontological label, it is also we investigate this further. rejected, unless one label is an organelle and the other For step 2, we varied the Swiss-Prot fields that is membrane. we used to extract features. We used combinations of the (3) If it contains an ontological label, but also contains the KEYWORD field (KWORD), the InterPro numbers from the phrase ‘potential’ or ‘by similarity’ it is rejected if the DBSOURCE field (IPR), and the SUBCELLULAR LOCAL- number of training sequences with that label is high. IZATION subfield of the COMMENT field (SCELL). We also However, if the number of training sequences with that varied the way we parsed the fields to extract features. For label is small (<1.5% of the total number of training example, we tried stemming (Jurafsky and Martin, 2000) on instances), it is accepted. the KWORD field so that the words: ‘vacuole’ and ‘vacu- oles’ are the same. We also tried treating semi-colon delimited (4) If the ontological label is ‘cell wall’ and the phrase phrases like: ‘Purine biosynthesis’ as a single feature versus contains the word ‘attached’, it is rejected. two separate features in the KWORD field. The best results Steps 2, 3 and 4 require some explanation. For step 2, it were obtained by using semi-colon delimited phrases without is common to describe a protein as being in a membrane stemming. For the SCELL, we tried using all individual words of a specific organelle. In this case, the correct label is the as features and we tried using a fixed set of pre-defined phrases organelle. In step 3, we want to reject any annotations that (PA-SUB, 2003). The pre-defined phrase approach worked contain words like ‘potential’ or ‘by similarity’. However, for the best. Table 11 shows accuracy results for some of our ontological labels with low numbers of training instances, we experiments. found that accepting ‘higher risk’ annotations is necessary to Notice from Table 11 that using the SCELL, IPR and obtain enough training data so that the classifiers have good KWORD fields of the Swiss-Prot database gives the best accuracy. Note that we followed the PSORT-B lead of includ- prediction results, although the IPR field is the least important ing any sequences that contain the phrase ‘cell wall’ in the for predicting subcellular localization. Therefore, the better 554 Predicting subcellular localization Table 12. A comparison of the accuracy of NB, ANN, SVM and three nearest neighbor classifiers (1NN, 3NN and 5NN) on the five Swiss-Prot datasets, the LOCkey dataset and the PSORT-B dataset Category NB ANN SVM 1NN 2NN 3NN Animal 0.929 0.883 0.956 0.910 0.919 0.919 Plant 0.939 0.971 0.947 0.900 0.912 0.911 Fungi 0.811 0.856 0.814 0.726 0.772 0.752 GP bact 0.914 0.949 0.898 0.812 0.845 0.843 Fig. 4. Part of the PA explain page for protein OMP1_CHLMU. GN bact 0.934 0.956 0.939 0.868 0.899 0.892 LOCkey 0.912 0.943 0.924 0.720 0.763 0.768 PSORT-B 0.895 0.927 0.888 0.615 0.652 0.653 horizontal bar represents the evidence for a particular location on a logarithmic scale. Each sub-bar with different shading accuracy of PA compared to LOCkey cannot be attributed indicates the evidence due to the existence of a single fea- only to the inclusion of the IPR field. The NB classifier ture (porin, outer membrane, ipr000604, integral membrane itself accounts for most of the improvement over LOCkey. protein and transmembrane). In PA, these sub-bars are differ- However, there is hope that IPR numbers may be useful for ent colors, but have been represented by different shadings some localizations. A domain projection technique based on in this paper. The long white bar represents the accumulated SMART domains (Schultz et al., 2000), which are included evidence of the other features that are not currently displayed in Interpro, has been somewhat successful in identifying the (‘Reduced Residual’). labels: extracellular, cytoplasmic and nuclear (Mott et al., Proteome Analyst contains a mechanism for changing the 2002). In addition, we have found that using the IPR number is five features that are displayed and the remaining features that very significant for general function prediction and other spe- are combined into the white bar (‘Reduced Residual’). Notice cialized predictors we have constructed using PA, like K -ion that the evidence for label ‘outer membrane’ over ‘cell wall’ channel protein classification (Szafron et al., 2003b). is overwhelming. Even though a PA–NB classifier and a PA– Selecting a classifier technology ANN classifier both predicted outer membrane, the advantage For step 3, we varied the kinds of classifiers. Table 12 shows a of using an NB classifier instead of an ANN classifier is the summary of results (PA-SUB, 2003) for NB, ANN, SVM and existence of this explanation facility. Note that in the revised three nearest neighbor classifiers (1NN, 3NN and 5NN). For a Swiss-Prot version 42 database that was released in Septem- k-nearest neighbor predictor, after BLASTing for homologs, ber 2003, the SCELL entry of this protein was changed to we ignored all Swiss-Prot fields except for the SCELL field. outer membrane to match the PA prediction. Although the The k homologs with the smallest E-values (<0.001), that had explanation mechanism of PA was not used to influence this a non-empty SCELL field voted for a subcellular localization annotation change, it could have been used in this way. label, based on their own field label. In the case of a tie, the A complete description of the PA explanation facility is homolog with the smallest BLAST E-value won. beyond the scope of this paper. However, Figure 5 shows one As shown in Table 12, the NB accuracy is better than any more PA screen that can be used to view prediction evidence. of the k-nearest neighbor classifiers, but is inferior to the best This screen shows relative evidence from the most import- ANN and SVM classifiers up to 5%. However, it is very dif- ant features, in selecting between the predicted class (in this ficult to explain the predictions of ANN and SVM classifiers, case, outer membrane) and any other class of interest (in so we feel that this small decrease in accuracy is more than this case, cell wall). The darker bars indicate evidence for compensated by the ability to explain the predictions to users. outer membrane and the lighter bars indicate evidence for cell Explanation is an important factor in getting users to trust wall. The (P) notation indicates that a token for that feature predictors (Szafron et al., 2003a). was present in the query sequence (OMP1_CHLMU) and an The explain mechanism of PA allows users to review the (A) indicates that the token for that feature was absent. We evidence used by a classifier to make a prediction. For believe that the convenience and power of the PA–NB explana- example, both the NB and ANN classifiers predict that the tion facility is worth the loss of a few percentage points of GN protein OMP1_CHLMU is an outer membrane protein, precision accuracy that might be gained by using an ANN or even though Swiss-Prot 41 SCELL entry is CELL WALL SVM classifier. SURFACE. However, the PA explain mechanism for NB clas- It is also possible to use multiple classifier technologies and sifiers allow the user to view the evidence, while there is no to report a consensus, although we are not currently using this way to do this in an ANN classifier. Figure 4 shows part of approach. It is not clear how the explanation facility would fit an explain page for the OMP1_CHLMU classification. Each with such an approach. 555 Z.Lu et al. Emanuelsson,O., Nielson,H., Brunak,S. and von Heijne,G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016. Gardy,J.L., Spencer,C., Wang,K., Ester,M., Tusnády,G.E., Simon,I., Hua,S., deFays,K., Lambert,C., Nakai,K. and Brinkman,F.S.L. (2003) PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 31, 3613–3617. Hua,S. and Sun,Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721–728. Horton,P. and Nakai,K. (1997) Better prediction of protein cellular localization sites with the k nearest neighbors classifier. Proc. of the Fifth ISMB, AAAI Press, pp. 298–305. Jurafsky,D. and Martin,J.H. (2000) Speech and Language Pro- cessing. Prentice-Hall, NJ. Kohavi,R. and John,G.H. (1997) Wrappers for feature subset selection. Artif. Intell., 97, 273–324. Krogh,A., Larsson,B., von Hejne,G. and Sonnhammer, E. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol., 305, 567–580. Lu,Z. (2003) Predicting protein subcellular localization from homo- logs using machine learning algorithms. MSc thesis, Department of Computing Science, University of Alberta. Mitchell,T.M. (1997) Machine Learning. McGraw-Hill, N.Y. Mott,R., Schultz,J., Bork,P. and Ponting,C.P. (2002) Predicting Fig. 5. Viewing feature contributions to a PA prediction. protein cellular localization using a domain projection method. Genome Res., 12, 1168–1174. Nair,R. and Rost,B. (2002) Inferring subcellular localization through ACKNOWLEDGEMENTS automated lexical analysis. Bioinformatics, 18, S78–S86. Thanks to the referees for several helpful suggestions. Thanks Nakai,K. and Kanehisa,M. (1992) A knowledge base for predict- ing protein localization sites in eukaryotic cells. Genomics, 14, to Cynthia Luk, Samer Nassar and Kevin McKee for their con- 897–911. tributions to the first prototype of PA. Thanks to Warren Gallin Nakai,K. (2000) PSORT II Users’ Manual. and Kathy Magor for their valuable feedback while using early PA-SUB (2003). versions of PA. Thanks to Rajesh Nair and Burkhard Rost, for Reinhardt,A. and Hubbard,T. (1998) Using neural networks for pred- providing us with their training data. Finally, a big thanks to iction of the subcellular location of proteins. Nucleic Acids Res., Fiona Brinkman and Jennifer Gardy, for not only providing 26, 2230–2236. us with their GN bacteria training data, but also for many van Rijsbergen,K. (1979) Information Retrieval. Butterworths, helpful pointers and ideas. This research was partially funded London (UK). by research or equipment grants from the Protein Engineer- Sahami,M. (1999) Using machine learning to improve information ing Network of Centres of Excellence, the National Science access. Ph.D. thesis, Computer Science Department, Stanford and Engineering Research Council, Sun Microsystems and University, Stanford, CA. Schultz,J., Cople,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) the Alberta Ingenuity Centre for Machine Learning. SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231–234. Szafron,D., Greiner,R., Lu,P., Wishart,D., MacDonell,C., Anvik,J., REFERENCES Poulin,B., Lu,Z. and Eisner,R. (2003a) Explaining Naïve Bayes Altman,D.G. and Bland,J.M. (1994) Statistics notes: diagnostic classifications, TR03–09, Department of Computing Science, tests 1: sensitivity and specificity. BMJ, 308, 1552. University of Alberta. Duda,R.O. and Hart,P.E. (1973) Pattern Classification and Scene Szafron,D., Lu,P., Greiner,R., Wishart,D., Lu,Z., Poulin,B., Analysis. John Wiley & Sons. Eisner,R., Anvik,J. and MacDonell,C. (2003b) Proteome EBI (2003) European Bioinformatics Institute. http://www.ebi.ac.uk/ Analyst–transparent high-throughput protein annotation: func- genomes tion, localization and custom predictors, International Conference Emanuelsson,O. (2002) Predicting protein subcellular localiza- on Machine Learning Workshop on Machine Learning in Bioin- tion from amino acid sequence information. Brief. Bioinfo., 3, formatics (ICML-Bioinformatics) August 2003, Washington, 361–376. 2–10.
Bioinformatics – Oxford University Press
Published: Jan 22, 2004
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.