Predicting subcellular localization of proteins using machine-learned classifiers

Z. Lu; D. Szafron; R. Greiner; P. Lu; D.S. Wishart; B. Poulin; J. Anvik; C. Macdonell; R. Eisner

doi:10.1093/bioinformatics/btg447

Predicting subcellular localization of proteins using machine-learned classifiers

Lu, Z.; Szafron, D.; Greiner, R.; Lu, P.; Wishart, D.S.; Poulin, B.; Anvik, J.; Macdonell, C.; Eisner, R. 2004-01-22 00:00:00 Vol. 20 no. 4 2004, pages 547–556 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg447 Predicting subcellular localization of proteins using machine-learned classiﬁers Z. Lu, D. Szafron , R. Greiner, P. Lu, D.S. Wishart, B. Poulin, J. Anvik, C. Macdonell and R. Eisner Department of Computing Science, University of Alberta, Edmonton, AB, Canada, T6G 2E8 Received on August 26, 2003; accepted on September 25, 2003 Advance Access publication January 22, 2004 ABSTRACT are three basic approaches. One approach is based on Motivation: Identifying the destination or localization of amino acid composition, using artiﬁcial neural nets (ANN) proteins is key to understanding their function and facilitating such as NNPSL (Reinhardt and Hubbard, 1998), or their puriﬁcation. A number of existing computational predic- support vector machines (SVM) like SubLoc (Hua and tion methods are based on sequence analysis. However, these Sun, 2001, http://www.bioinfo.tsinghua.edu.cn/SubLoc/). methods are limited in scope, accuracy and most particularly A second approach uses the existence of peptide signals, breadth of coverage. Rather than using sequence information which are short sub-sequences of ∼3–70 amino acids to pred- alone, we have explored the use of database text annotations ict speciﬁc cell locations, such as TargetP (Emanuelsson et al., from homologs and machine learning to substantially improve 2000). A third approach, such as the one used in LOCkey (Nair the prediction of subcellular location. and Rost, 2002), is to do a similarity search on the sequence, Results: We have constructed ﬁve machine-learning clas- extract text from homologs and use a classiﬁer on the text fea- siﬁers for predicting subcellular localization of proteins from tures. Some tools, like PSORT (Nakai and Kanehisa, 1992; animals, plants, fungi, Gram-negative bacteria and Gram- Horton and Nakai, 1997, http://psort.nibb.ac.jp/), combine a positive bacteria, which are 81% accurate for fungi and 92– variety of individual predictors. Many tools, like SubLoc, 94% accurate for the other four categories. These are the most PSORT and TMHMM (Krogh et al., 2001, http://www.cbs. accurate subcellular predictors across the widest set of organ- dtu.dk/services/TMHMM/), are available for public use on isms ever published. Our predictors are part of the Proteome the web. Unfortunately, most tools accept only a single Analyst web-service. sequence at a time, with TMHMM being a notable exception. Availability: http://www.cs.ualberta.ca/~bioinfo/PA/Sub, http:// Emanuelsson (2002) provides a good survey of these tools. www.cs.ualberta.ca/~bioinfo/PA Better accuracy and coverage are needed Contact: [email protected] There are two limitations to current techniques. The ﬁrst is Supplementary information: http://www.cs.ualberta.ca/ the limited accuracy of the predictors, especially for some ~bioinfo/PA/Subcellular organelles. The second is limited coverage. The term cover- age can be used in three ways: location coverage, sequence INTRODUCTION coverage and taxonomic coverage. All three kinds of coverage High-throughput sequencing technology has made it pos- are limited in current tools. sible for many laboratories to sequence the genomes First, location coverage deﬁnes the sub-regions (nuclear, of new organisms. There are more than 1200 gen- cytoplasmic, extracellular, etc.) in the cell that are supported ome sequences deposited in public databases (EBI, 2003, by a predictor. Most existing tools limit the location coverage http://www.ebi.ac.uk/genomes/). Given the size and com- to just membranes or just a few organelles. plexity of these datasets, most researchers are compelled Second, given a training/test set, sequence coverage is to use automated annotation systems to identify or clas- deﬁned as the ratio of sequences for which a prediction is made sify individual genes and proteins. As part of this annota- to the total number of sequences of interest. For example, tion process, a number of systems have been developed the LOCkey dataset consists of 3146 labeled sequences from that support automated prediction of subcellular localiza- Swiss-Prot and the predictor obtained an accuracy of 0.87 on a tion, based on amino acid sequence information. There subset of 1161 sequences (coverage = 0.37). Sequence cov- erage can be measured on one organism (1-organism sequence To whom correspondence should be addressed. coverage) or multiple organisms. The 1-organism measure is Bioinformatics 20(4) © Oxford University Press 2004; all rights reserved. 547 Z.Lu et al. Table 1. Accuracies (Acc.) and informal sequence/taxonomic coverage of accuracy, whereas PSORT-B is a newer system with much current subcellular localization predictors better accuracy (Gardy et al., 2003). In general, a classiﬁer takes a query instance, described by a Name Acc. Coverage Technique set of feature-value pairs, and returns one of a ﬁxed number of labels (Mitchell, 1997). In PA, each query instance is a primary PSORT-B 0.75 1443 GN bacterial Combination sequence that is BLASTed against the Swiss-Prot database to LOCkey 0.87 1161 assorted Homology obtain a set of homologs. Each feature of the query instance is SubLoc 0.91 291 prokaryotic AA composition a Boolean value corresponding to the presence or absence of a 0.79 2427 eukaryotic token (word or phrase) from certain ﬁelds of the homologous TargetP 0.85 940 plant Signal prediction sequences’ Swiss-Prot database entries. 0.90 2738 non-plant We use a machine-learning (ML) algorithm to learn a map- Proteome 0.93 16 284 animal Homology and ML ping from the features of a query instance to the appropriate Analyst 0.93 3420 plant subcellular localization label for that instance. A common 0.81 2104 fungal technique is to apply a ML algorithm to a set of labeled train- 0.92 3218 GN bacterial ing items to produce a classiﬁer. In our case, each training item 0.94 1571 GP bacterial consists of a primary protein sequence and the ontological label it has been assigned by an expert. Each training instance is ﬁrst BLASTed against Swiss-Prot to identify its features in the same manner as query instances. Features are not provided important for high-throughput prediction of newly sequenced in the training set—they are computed automatically from organisms. Swiss-Prot data. Third, taxonomic coverage measures the range of organisms In this paper, we use three different sources for labeled for the predictor such as: animal, green plant, Gram-negative training data: Swiss-Prot database entries that have unambigu- bacteria (GN), etc. Most existing predictors have only been ous subcellular localization annotations (26 458 sequences), evaluated on a limited number of sequences from a speciﬁc a subset of the Swiss-Prot database developed for LOCkey taxonomic category of organism (e.g. just GN bacteria or just (3146 sequences) and the set of GN bacteria sequences (1443) green plants). used in PSORT-B. These three datasets are used to evaluate Table 1 lists some predictors and gives a measure of accur- the PA classiﬁers. However, a PA user can also create a cus- acy and the kind of technique employed. It also provides an tom subcellular localization classiﬁer using custom training informal indication of combined sequence coverage and taxo- data, by simply uploading a ﬁle of labeled training sequences nomic coverage. Unfortunately, no standardized sequence (Szafron et al., 2003b). No programming is required. coverage ratios have been published for these predictors. In the context of PA, transparency is the ability to provide Using classiﬁers for prediction formally-sound and intuitively-simple reasons for each pred- This paper describes a novel classiﬁcation technique for pred- iction (Szafron et al., 2003a). PA bases its predictions on icting subcellular localization (Lu, 2003). This technique is well-understood concepts of conditional probabilities. Its used in our publicly available web-based Proteome Analyst explanations are presented as stacked bar-graphs that clearly (PA). Two tools are available for subcellular localization—a display the evidence for each prediction. simple tool (PA-SUB) that only predicts subcellular loc- Contributions alization (http://www.cs.ualberta.ca/~bioinfo/PA/Sub) and a This paper describes a new subcellular localization prediction more comprehensive tool that predicts subcellular localiza- technique that makes the following scientiﬁc contributions: tion along with other annotations, including general func- tion (http://www.cs.ualberta.ca/~bioinfo/PA). The second tool (1) This new ML technique makes the most accurate sub- also allows a user to build a custom classiﬁer from custom cellular localization predictions over the broadest range training data. of organisms (animals, plants, fungi, GN bacteria and A controlled vocabulary or ontology is required for GP bacteria) of all subcellular localization prediction subcellular localization. In fact, since cell structure var- techniques published to date. ies across organisms, several ontologies are required and (2) This technique is publicly available as a high- PA supports ﬁve: animal, plant, fungi, GN bacteria and throughput web-based tool in PA. Gram-positive (GP) bacteria, which are based on the (3) Proteome Analyst provides the ﬁrst explanation facility PSORT ontologies. Among them, PSORT (bacteria/plants), for subcellular localization predictions. PSORT II (animals/yeast) (Nakai, 2000, http://psort.nibb.ac. jp/helpwww2.html) and PSORT-B (GN bacteria) provide a (4) Proteome Analyst can be used to easily create new sub- set of predictors over the same classes of organisms as PA. cellular classiﬁers using custom training data, without However, PSORT and PSORT II are older systems with poor any programming. 548 Predicting subcellular localization SYSTEMS AND METHODS The prediction process Proteome Analyst predicts the subcellular localization of a query protein sequence using its primary sequence and the Fig. 1. The Swiss-Prot homologs of EXOA_STRPN from BLAST. organism taxonomic category: animal, plant, fungi, GN bac- teria, GP bacteria. Here is the ﬁve-step prediction process used by PA. P1. The primary sequence of the query protein is BLASTed against the Swiss-Prot database and a set of homolog- ous sequences is selected. P2. Potential features are computed by extracting text from the Swiss-Prot records of the best homologs. A feature has the value true if a token representing that feature is extracted and false if no such token is extracted. Fig. 2. The features for EXOA_STRPN extracted by PA. P3. The user-provided taxonomic organism category is used to select one of ﬁve pre-built Naïve Bayes (NB) classiﬁers (Duda and Hart, 1973): animal, plant, fungi, GN bacteria, GP bacteria. P4. The features are used by the appropriate classiﬁer to compute the probability of each label in the ontology of that classiﬁer. The label with the highest probability is considered the primary location for the protein. Fig. 3. Proteome Analyst predicted subcellular locations for P5. The user can view a graphical explanation of the EXOA_STRPN. prediction (Szafron et al., 2003a). in a stop-word list (van Rijsbergen, 1979, http://www. We use the GP bacterial protein Exodeoxyribonuclease dcs.gla.ac.uk/Keith/Preface.html). For example, Figure 2 from Streptococcus pneumoniae (EXOA_STRPN) as an shows the potential feature set for the demonstration query example. If this organism was newly sequenced, its pro- sequence (EXOA_STRPN) that were extracted from the top teins would not appear in Swiss-Prot. Therefore, we removed three homologs. They appear under the heading ‘Unique all EXOA_STRPN entries from our Swiss-Prot database for Tokens Extracted for Protein #6’. this demonstration. We experimented with many variations Our classiﬁers remove other poorly discriminating features of steps P1 (homolog selection) and P2 (feature extraction), as well. When PA builds a classiﬁer, it actually learns the best as described in the Discussion section. In this section, we set of features to use. This process of feature selection is a describe only the best conﬁguration. We select up to three standard ML technique for improving accuracy (Kohavi and homologs with the lowest BLAST E-values that are less John, 1997). In fact, the ﬁve classiﬁers (animal, plant, fungi, than 0.001. GN bacteria and GP bacteria) use different machine-learned Figure 1 shows three homologs of our query protein feature sets. Figure 2 shows the features that were actually sequence. For feature selection, we obtained the best results used by the GP bacteria classiﬁer to classify the demonstration using phrases extracted from selected ﬁelds of the Swiss- sequence (EXOA_STRPN). They appear under the heading Prot homologs. Speciﬁcally, we extracted each semi-colon ‘Relevant Tokens for Protein #6’. For example, the features delimited phrase from the Swiss-Prot KEYWORD ﬁeld of ipr003034 and polymorphism appear in the ‘Unique Tokens’ each selected homolog, as well as all InterPro numbers from list, but are not used by the classiﬁer, so they are not in the the DBSOURCE ﬁeld. Finally, we checked for the inclu- ‘Relevant Tokens’ list. sion of a pre-deﬁned set of phrases in the SUBCELLULAR Proteome Analyst uses a NB classiﬁer, which generates a LOCALIZATION sub-ﬁeld of the COMMENT ﬁeld. For probability for each label. Figure 3 shows the probabilities of ease of reference in this paper, we will denote these ﬁelds each of the GP bacteria labels for the demonstration sequence by: KWORD, IPR and SCELL, respectively. This set of (EXOA_STRPN) as shown in PA. phrases forms the potential feature set. The Discussion sec- tion describes alternative feature deﬁnition strategies that Building a classiﬁer produced less accurate classiﬁers. After computing the potential feature set, we remove all ubi- A classiﬁer must be trained (built) before it can be used. PA quitous phrases like: ‘complete proteome’, that are contained uses labeled training data to build a simple NB classiﬁer using 549 Z.Lu et al. Table 2. Confusion matrix for the PA GP classiﬁer, trained on Swiss-Prot these basic steps: data B1. Each labeled training instance consists of a primary sequence and a label from the ontology of the classiﬁer Actual Predicted label being built. cyt wal mem ext N.P. ASum cov Sensitivity B2. The primary sequence of each training instance is run TN TN FP TN TN cyt 881 0 13 26 10 930 0.989 0.947 through steps P1 and P2 described in the previous TN TN FP TN TN wal 1 16 0 1 1 19 0.947 0.842 section to produce a set of potential features. TP FN FN FN FN mem 8 1 291 17 23 340 0.932 0.856 + − B3. A set of sufﬁcient statistics, c and c , are com- TN TN FP TN TN ij ij ext 4 2 8 217 21 252 0.917 0.861 PSum 894 19 312 261 55 1541 0.964 R = 0.912 puted for the set of training instances, where c is ij Precision 0.985 0.842 0.933 0.831 P = 0.945 the number of training sequences that were indicated Speciﬁcity 0.979 0.998 0.983 0.966 S = 0.978 by label j with F = true and c , is the number of ij training sequences that were indicated by label j with The ontological labels are: cyt(oplasmic), (cell) wal(l), mem(brane) and ext(racellular). F = false. N.P. represents no prediction, ASum and PSum are the sums of the actual and predicted labels, respectively. cov is sequence coverage. The superscripts: TP, true positive; TN, B4. A NB classiﬁer is built using these sufﬁcient statistics. true negative; FP, false positive; and FN, false negative are relative to the mem(brane) ¯ ¯ label and are used in the text for illustration, along with the bolded entries. R, P and In fact, as mentioned earlier, we modify this basic pro- S denote overall sensitivity (recall), precision and speciﬁcity, respectively. cess by using feature selection (Kohavi and John, 1997) to improve the accuracy. After building and computing the accur- actually labeled mem(brane). The PSum row indicates the acy using all the potential features, we remove 5% of the number of test sequences whose predicted label is speciﬁed by features that have the lowest information content. The inform- the column label. For example, 312 sequences had predicted ation content (information gain) of a feature is a measure of label, membrane. the amount that a feature contributes to classiﬁcations in gen- Various statistics can be computed from a confusion matrix eral (Mitchell, 1997). For example, if a feature appears in to evaluate a classiﬁer. In this paper we will use four standard every training instance, it is useless in discriminating between statistics: speciﬁcity, precision, sensitivity and recall (the last labels and its information content is zero. On the other hand, two are identical). Given a confusion matrix M and a set of a feature that appears in all training instances that have a labels {L }, the standard deﬁnitions (van Rijsbergen, 1979; single label and no training instances with any other labels Altman and Bland, 1994) of these statistics are as follows. is very good for discriminating the one label. Therefore, it The precision for each label L is P deﬁned by: i i has high information content. After removing this 5% of low information content features, we build a second classiﬁer, and TP M M ii ii measure its accuracy. We then remove another 5% of low P = = = TP + FP M PSum ki i k=1 information content features and continue in this way until we have computed the accuracy of 20 different classiﬁers Here, n is the number of training instances, true positives (TP) with 0%, 5%, 10%, ... , 95% of the original features removed. is the number of labels correctly predicted as L , which were We identify the threshold that produced the classiﬁer with the actually labeled L . The false positives (FP) is the number highest accuracy. The most accurate classiﬁers for subcellular of labels incorrectly predicted as L that were actually not localization typically had 75–80% of the least discriminating labeled as L . For example, consider the label mem(brane) in features removed. Table 2. The TP and FP counts are denoted by superscripts, where there is a single count for TP, but the three FP entries Classiﬁer evaluation must be summed. From Table 2, we have TP = 291 and FP = To compare classiﬁers, it is important to deﬁne the evaluation 13 + 0 + 8 = 21. Therefore, the precision for membrane is: criteria precisely. Most techniques start with a confusion mat- P = 291/(291 + 21) = 291/312 = 0.933. (mem) rix or contingency table (van Rijsbergen, 1979). Table 2 shows The speciﬁcity for each label L is S deﬁned by: i i the confusion matrix for the PA GP bacteria classiﬁer trained on Swiss-Prot data. TN sum − ASum − PSum + M i i ii S = = We will use Table 2 to illustrate our evaluation techniques. TN + FP sum − ASum Each entry in Table 2 represents the number of sequences in the test set whose actual label is the row label and whose Here, true negatives (TN) is the number of labels cor- predicted label is the column label. For example, the number rectly predicted as not L , that were actually not labeled of sequences with actual label mem(brane) that were incor- L and sum is the total number of sequences (1541 in rectly predicted as ext(racellular) is 17. The ASum column Table 2). For example, in Table 2, the TN and FP counts indicates the number of test sequences whose actual label is for the label mem(brane) are denoted by superscripts, where speciﬁed by the row label. For example, 340 sequences were the superscripted numbers must be summed. We have 550 Predicting subcellular localization TN = 881 + 0 + 26 + 10 + 1 + 16 + 1 + 1 + 4 + 2 + 217 The overall precision and overall speciﬁcity are weighted + 21 = 1180 and FP = 13 + 0 + 8 = 21. The speciﬁcity averages over the predicted labels (columns): of label mem(brane) is: S = 1180/(1180 + 21) = (mem) n n PSum P M 1180/1201 = 0.983. i i ii i=1 i=1 P = = The sensitivity or recall for each label L is R deﬁned by: sum − PSum sum − PSum i i n+1 n+1 TP M M ii ii PSum S i i i=1 R = = = ¯ i S = n+1 TP + FN ASum M i sum − PSum ij n+1 j =1 Here, false negatives (FN) is the number of labels incor- For example, the overall precision and overall speciﬁcity of rectly predicted as not L that were actually labeled L .For i i ¯ the classiﬁer in Table 2 are P = (881 + 16 + 291 + 217)/ example, consider the label mem(brane) in Table 2. The TP (1541 − 55) = 0.945 and S = 0.978, respectively. The and FN counts are denoted by superscripts, where the FN overall coverage is a weighted average of the label coverage, superscripted numbers must be summed. From Table 2, we so C = 0.964. have TP = 291 and FN = 8 + 1 + 17 + 23 = 49. Note that There are many different ways to organize test sets and the FN number includes the no prediction (N.P.) column as we compute two different kinds of confusion matrices. Our well. Therefore, the sensitivity (recall) of label mem(brane) ﬁrst technique is a standard ML technique called 5-fold is: R = 291/(291 + 49) = 291/340 = 0.856. (mem) cross-validation (Mitchell, 1997). Each set of labeled training The precision and speciﬁcity statistics favor conservative instances is ‘randomly’ divided into ﬁve groups (G , ... , G ), 1 5 predictors that make no prediction when there is doubt about while keeping the number of training instances with each label the correctness of a prediction, while the sensitivity (recall) approximately the same in each training group. Then, ﬁve dif- statistic favors liberal predictors that make a prediction if ferent classiﬁers are constructed (C , ... , C ), where C uses 1 5 i there is a chance of success. For example, if two predic- all of the training instances from all of the groups except G . tions are changed from ‘no prediction’ to a prediction, where Next, a confusion matrix is computed for each of the ﬁve one is correct and the other is incorrect, then TP increases classiﬁers, C , using the sequences in group G (that were not i i by 1, FP increases by 1, TN decreases by 1 and FN decreases used in its training) as test data. The ﬁnal confusion matrix by 1. Therefore, the precision and speciﬁcity numbers both is then computed by summing the entries in all the confusion decrease, but the sensitivity (recall) increases: matrices. In our application, there is one important modiﬁca- TP + 1 TP tion that is necessary to ensure ‘fairness’ of the evaluation. P = < Our features are obtained by extracting them from Swiss-Prot TP + 1 + FP + 1 TP + FP homologs. Before searching for homologs, we remove the TN − 1 TN S = < Swiss-Prot entries of each of the test sequences. This sim- TN − 1 + FP + 1 TN + FP ulates the situation where the test sequences correspond to TP + 1 TP newly sequenced proteins that would not appear in the Swiss- R = > TP + 1 + FN − 1 TP + FN Prot database. We used the 5-fold cross-validation accuracy Information retrieval papers report precision and recall, to build the feature selection ﬁlter described in the previous while bioinformatics, medical and ML papers tend to report section. A second technique for computing a confusion mat- speciﬁcity and sensitivity. We include all of them. However, rix is to build a single classiﬁer from all training data except speciﬁcity is not as informative as precision for multi-labeled the sequences from one speciﬁc organism. This 1-organism (non-binary) classiﬁers. We also include sequence coverage, classiﬁer is then applied to the speciﬁc organism and a con- which is the ratio of sequences for which a prediction was fusion matrix is constructed. This simulates the situation in made to the total number of sequences in a speciﬁc class. which a classiﬁer is used to predict the subcellular locations For example, in Table 2, the coverage of mem(brane) is of all sequences in a newly sequenced organism. In this case, (340 − 23)/340 = 0.932. for fairness, all Swiss-Prot entries for that speciﬁc organism An overall version of each statistic is computed as a are removed from the Swiss-Prot database. weighted average. For the overall sensitivity (recall) the After the evaluation is complete, we build a ﬁnal classiﬁer weights are the number of sequences with each actual label using all of the training instances. This ﬁnal classiﬁer typically (ASum ), and we also refer to it as the accuracy, A: has better accuracy than any of the ﬁve classiﬁers built during n n 5-fold cross-validation. ASum R ASum (M /ASum ) i i i ii i i=1 i=1 A = R = = sum sum RESULTS ii i=1 Proteome analyst accuracy sum For example, in Table 2, the overall sensitivity (accuracy) is Table 3–7 show the statistics for the ﬁve classiﬁers A = R = (881 + 16 + 291 + 217)/1541 = 0.912. we built using training instances from the Swiss-Prot 551 Z.Lu et al. Table 5. Statistics for the PA fungi classiﬁer. See Tables 3 and 4 for Table 3. Statistics for the PA animal classiﬁer: count, spec(iﬁcity), prec(ision) and sens(itivity), as well as the 1-organism statistics for Bos abbreviations taurus (Bovine) Location 5-fold cross-validate 1-organism: NEUCR Location 5-fold cross-validate 1-organism: BOVINE count spec prec sens count spec prec sens Count spec prec sens count spec prec sens nuc 621 0.975 0.933 0.833 11 1.000 1.000 1.000 nuc 2846 0.996 0.979 0.905 47 1.000 1.000 0.894 mit 406 0.977 0.888 0.744 45 0.976 0.976 0.889 mit 1194 0.998 0.973 0.970 145 0.993 0.972 0.952 cyt 395 0.949 0.786 0.808 15 0.958 0.824 0.933 cyt 1845 0.981 0.866 0.919 84 0.983 0.878 0.940 ext 171 0.993 0.914 0.871 2 1.000 1.000 1.000 ext 3943 0.991 0.972 0.927 197 0.991 0.974 0.964 gol 52 0.991 0.689 0.808 0 1.000 0/0 0/0 gol 167 0.996 0.723 0.892 7 0.996 0.667 0.857 pex 64 0.993 0.786 0.859 0 1.000 0/0 0/0 pex 103 0.999 0.909 0.971 4 0.999 0.800 1.000 end 64 0.993 0.750 0.656 1 1.000 1.000 1.000 end 457 0.996 0.868 0.952 14 0.996 0.824 1.000 mem 302 0.989 0.932 0.861 12 0.987 0.917 0.917 lys 170 0.998 0.861 0.947 12 0.997 0.857 1.000 vac 19 0.996 0.600 0.632 1 1.000 1.000 1.000 mem 4820 0.981 0.957 0.938 218 0.986 0.966 0.917 Overall 2094 0.975 0.871 0.811 87 0.978 0.940 0.908 Overall 15549 0.988 0.946 0.929 728 0.990 0.950 0.941 The 1-organism is Neurospora crassa. The ontological labels are: nuc(lear), mit(ochondria), cyt(oplasmic), ext(racellular), Table 6. Statistics for the PA GN bacteria classiﬁer. See Table 2 for gol(gi), pe(ro)x(isomal), end(oplasmic reticulum), lys(osomal) and mem(brane). abbreviations Table 4. Statistics for the PA green plant classiﬁer. See Table 3 for Location 5-fold cross-validate 1-organism: HAEIN abbreviations count spec prec sens count spec prec sens cyt 1861 0.989 0.992 0.955 73 1.000 1.000 1.000 Location 5-fold cross-validate 1-organism: MAIZE ext 253 0.986 0.838 0.858 15 0.990 0.929 0.867 count spec prec sens count spec prec sens per 385 0.986 0.898 0.873 7 0.991 0.875 1.000 inn 432 0.993 0.958 0.951 5 1.000 1.000 0.800 nuc 168 0.999 0.988 0.964 16 1.000 1.000 1.000 wal 46 0.999 0.956 0.935 0 0.983 0/0 0/0 mit 307 0.992 0.926 0.935 19 0.986 0.900 0.947 out 197 0.996 0.938 0.919 15 1.000 1.000 0.800 cyt 447 0.987 0.923 0.960 36 0.992 0.971 0.917 Overall 3174 0.990 0.959 0.934 115 0.990 0.964 0.922 ext 127 0.996 0.887 0.866 6 0.981 0.667 1.000 gol 35 0.998 0.850 0.971 2 1.000 1.000 1.000 Additional ontological labels are: inn(er membrane), per(iplasmic), (cell) wal(l) and chl 1899 0.973 0.980 0.959 69 0.979 0.969 0.913 out(er membrane). The 1-organism is Haemophilus inﬂuenzae. pex 29 0.999 0.993 0.966 1 0.994 0.500 1.000 end 64 0.998 0.903 0.875 6 1.000 1.000 1.000 Table 7. Statistics for the PA GP bacteria classiﬁer. See Tables 3 and 6 for vac 82 0.997 0.870 0.817 2 0.994 0.667 1.000 abbreviations mem 135 0.992 0.805 0.733 9 0.987 0.600 0.333 Overall 3293 0.982 0.951 0.939 728 0.987 0.926 0.904 Location 5-fold cross-validate 1-organism: STRCO Additional labels are chl(oroplast) and vac(uole). The 1-organism is Zea mays. count spec prec sens count spec prec sens cyt 930 0.982 0.988 0.948 37 1.000 1.000 1.000 wal 19 0.997 0.750 0.789 0 0/0 0/0 0/0 database. The training sets are publicly available (PA-SUB, ext 252 0.967 0.841 0.881 6 1.000 1.000 1.000 2003, http://www.cs.ualberta.ca/~bioinfo/PA/Subcellular), mem 340 0.982 0.929 0.853 9 1.000 1.000 0.889 along with the confusion matrices that were used to compute Overall 1541 0.980 0.946 0.914 52 1.000 1.000 0.981 these statistics. Each training set contains a set of sequences in FastA format that includes the correct label (from Swiss-Prot), The 1-organism is Streptomyces coelicolor. the organism tag, the organism name, Swiss-Prot taxonomy information and the primary sequence. These classiﬁers show excellent 5-fold cross-validation and Rost, 2002), we constructed two custom subcellular localiza- 1-organism statistics over all ontological classes. However, tion classiﬁers using their ontology and training data. The some small training and test sets produce poor results, such LOCkey paper contains a confusion matrix for a Swiss-Prot as the precision (0.600) for the 19 training/test instances of dataset with 1162 training instances. Table 8 shows the 5-fold vacuolar in the fungi classiﬁer (Table 5). cross-validation speciﬁcity, precision and recall, computed We performed additional experiments to compare our work from their confusion matrix and from a PA classiﬁer we built with similar systems. To compare PA to LOCkey (Nair and using their training data and ontology. 552 Predicting subcellular localization Table 8. A comparison of the statistics of a PA classiﬁer built using the Table 9. A comparison of the statistics of a PA classiﬁer built using the LOCkey 1161 sequence training data with the statistics produced by the PSORT-B training data with the statistics produced by the PSORT-B predictor, LOC(Key) classiﬁer on the their training data built from the same training data Location Count Speciﬁcity Precision Sensitivity Location Count Precision Sensitivity LOC PA LOC PA LOC PA PSORT-B PA PSORT-B PA mit 190 0.945 0.993 0.763 0.964 0.795 0.979 cyt 252 0.976 0.947 0.694 0.853 ext 334 0.947 0.975 0.879 0.937 0.953 0.973 inn 308 0.967 0.965 0.787 0.906 nuc 352 0.926 0.985 0.850 0.965 0.971 0.929 per 264 0.919 0.915 0.576 0.860 chl 94 0.979 0.997 0.718 0.966 0.609 0.894 out 378 0.988 0.986 0.903 0.947 cyt 136 0.970 0.973 0.656 0.804 0.428 0.846 ext 241 0.944 0.876 0.700 0.880 end 14 0.993 1.000 0.200 1.000 0.154 0.500 Overall 1443 0.965 0.943 0.748 0.895 lys 7 0.999 0.999 0.000 0.833 0.000 0.714 gol 22 0.998 0.999 0.895 0.944 0.810 0.773 See Table 5 for the ontological label abbreviations. pex 8 1.000 1.000 0.000 1.000 0.000 0.375 vac 4 1.000 1.000 0.000 1.000 0.000 0.250 Overall 1161 0.945 0.983 0.815 0.936 0.815 0.912 Note that 139 out of 1443 training sequences in the See Tables 3 and 4 for the ontological label abbreviations. PSORT-B training data have two labels. To accommodate double-labels in our NB classiﬁer, we transformed each train- ing instance that had two labels into two training instances, one with each label. Since we are comparing with the PSORT-B Our speciﬁcity, precision and sensitivity results are consist- classiﬁer (Gardy et al., 2003), we followed their lead dur- ently better than the LOCkey results, except for sensitivity ing predictor evaluation and counted a prediction as correct on the golgi class. Our accuracy (overall sensitivity) is almost if it predicted either of the two labels. As a ﬁnal test, we 10% better at 0.912 versus 0.815. Even though our approaches applied our full Swiss-Prot trained GN bacteria classiﬁer to are similar, there are two reasons for these accuracy differ- the PSORT-B test set and obtained an accuracy of 0.869 (Lu, ences. First, we are using a different classiﬁer technology— 2003; PA-SUB, 2003). (NB) versus an ad-hoc method. Second, we are using different Swiss-Prot database ﬁelds (including the IPR ﬁeld). Their Sequence coverage paper does not include a confusion matrix or accuracy statist- If PA is applied to an entire organism, there will be some ics, for 100% coverage of a larger 3146 sequence set, other sequences without homologs, so no features can be extrac- than to indicate that the accuracy is less than the 0.815 accur- ted and used by the classiﬁer. In some cases, even though acy of their 34% coverage classiﬁer. On this larger set (100% homologs are found, there will be no relevant tokens in the coverage), we achieved an accuracy (overall sensitivity) of FUNCTION, IPR and SCELL ﬁelds used by PA to construct 0.889 (PA-SUB, 2003). features. We call such sequences excluded sequences and We also built a custom classiﬁer for GN bacteria using the PA makes no subcellular localization prediction for excluded reliable PSORT-B GN bacteria data (Gardy et al., 2003) as a sequences. Excluded sequences are the only ones that reduce training set. Table 9 shows the 5-fold cross-validation pre- the coverage of PA classiﬁers. To gain an appreciation for cision and sensitivity (recall) presented in their paper and the PA subcellular localization sequence coverage on vari- the same statistics computed from a PA classiﬁer built using ous organisms, we used the PA classiﬁers to classify all the the PSORT-B training data and ontology. They do not report sequences in several organisms as shown in Table 10. A more speciﬁcity, so it is not in Table 9. Note that our Swiss-Prot complete table is online (PA-SUB, 2003). GN bacteria ontology has one extra label, (cell) wal(l), which Before running PA on an organism, we removed all the they include in the ext(racellular) class. To compare our tech- sequences for that organism from Swiss-Prot, so that no exact nique more directly with theirs, we did not include a (cell) sequence matches would be found. Of course, for these tests, wal(l) label in the classiﬁer we built from their data. The we cannot report accuracy, since we do not know the ‘cor- PA approach is very different than the PSORT-B approach, rect’ subcellular localization for many of them. The organisms since PA uses a simple NB classiﬁer and features extracted use the animal, plant, fungi, GN bacteria and GP bacteria from Swiss-Prot homologs, while PSORT-B uses a set of six classiﬁers, respectively. Each was selected since its complete sequence-based models. Nevertheless, PA produces results proteome is publicly available. We are currently developing that are somewhat better for sensitivity and accuracy, and very pattern recognition and discovery software that can be used close in precision. Furthermore, the PA technique produces to extract local features from excluded sequences so that the excellent results for animals, plants, fungi and GP bacteria coverage may approach 100%. (with different classiﬁers of course). 553 Z.Lu et al. Table 10. Sequence coverage of the PA classiﬁers on some fully sequenced Table 11. The accuracy of PA GN classiﬁers that use different homolog selec- organisms tion techniques, different Swiss-Prot ﬁelds and different feature extraction techniques Organism Class Count Exclude Cov PSI-BLAST Top KWORD IPR SCELL acc iterations homologs M.musculus Animal 27 754 7099 0.745 A.thaliana Plant 26 032 10 043 0.600 S.pombe Fungi 5007 1023 0.787 1 3 phrase yes phrase 0.934 B.subtilis GP bacteria 4098 1346 0.672 1 3 phrase yes no 0.924 P.aeruginosa GN bacteria 5557 1355 0.756 1 3 phrase no phrase 0.934 1 3 no no phrase 0.922 2 3 phrase yes phrase 0.932 The count is the total number of genetic sequences for the organism. An exclude(d) sequence is one for which PA was unable to ﬁnd at least one homolog whose E-value 1 2 phrase yes phrase 0.935 was less than 0.001 that contained at least one relevant feature. The cov(erage) is the 1 4 phrase yes phrase 0.936 ratio of all non-excluded sequences from an organism to the total number of sequences 1 3 words yes words 0.929 from that organism. extracellular class for the plant and fungi ontologies, since DISCUSSION the Swiss-Prot data is not very accurate in these cases. For Extracting ontological labels for training step 4, we found many Swiss-Prot SCELL annotations for sequences proteins that are not in the cell wall, which contain the phrase We selected all mature sequences (≥40 amino acids) from ‘attached to the cell wall’. the Swiss-Prot database and tried to extract their ontological labels. Although the Swiss-Prot database contains a subcellu- Selecting homologs and extracting features lar localization ﬁeld, this ﬁeld does not contain just a single We experimented with many different implementations of the ontological label for each sequence. Therefore, we had to 5-step prediction process described earlier in this paper. For construct a parser that extracted a simple ontological label, step 1, we used PSI-BLAST instead of BLAST and varied the when possible. Here are the rules that our parser uses to label number of iterations. Second, we varied the number of homo- potential training sequences: logs whose features were extracted. The highest accuracies were obtained by using one iteration of PSI-BLAST (so we (1) See if the ﬁeld contains one of the ontological labels. reverted to BLAST). There is not much difference between If it does not, the sequence is rejected as a training using the top two, three or four homologs (whose E-values sequence. were smaller than 0.001), so we decided to pick three, while (2) If it contains more than one ontological label, it is also we investigate this further. rejected, unless one label is an organelle and the other For step 2, we varied the Swiss-Prot ﬁelds that is membrane. we used to extract features. We used combinations of the (3) If it contains an ontological label, but also contains the KEYWORD ﬁeld (KWORD), the InterPro numbers from the phrase ‘potential’ or ‘by similarity’ it is rejected if the DBSOURCE ﬁeld (IPR), and the SUBCELLULAR LOCAL- number of training sequences with that label is high. IZATION subﬁeld of the COMMENT ﬁeld (SCELL). We also However, if the number of training sequences with that varied the way we parsed the ﬁelds to extract features. For label is small (<1.5% of the total number of training example, we tried stemming (Jurafsky and Martin, 2000) on instances), it is accepted. the KWORD ﬁeld so that the words: ‘vacuole’ and ‘vacu- oles’ are the same. We also tried treating semi-colon delimited (4) If the ontological label is ‘cell wall’ and the phrase phrases like: ‘Purine biosynthesis’ as a single feature versus contains the word ‘attached’, it is rejected. two separate features in the KWORD ﬁeld. The best results Steps 2, 3 and 4 require some explanation. For step 2, it were obtained by using semi-colon delimited phrases without is common to describe a protein as being in a membrane stemming. For the SCELL, we tried using all individual words of a speciﬁc organelle. In this case, the correct label is the as features and we tried using a ﬁxed set of pre-deﬁned phrases organelle. In step 3, we want to reject any annotations that (PA-SUB, 2003). The pre-deﬁned phrase approach worked contain words like ‘potential’ or ‘by similarity’. However, for the best. Table 11 shows accuracy results for some of our ontological labels with low numbers of training instances, we experiments. found that accepting ‘higher risk’ annotations is necessary to Notice from Table 11 that using the SCELL, IPR and obtain enough training data so that the classiﬁers have good KWORD ﬁelds of the Swiss-Prot database gives the best accuracy. Note that we followed the PSORT-B lead of includ- prediction results, although the IPR ﬁeld is the least important ing any sequences that contain the phrase ‘cell wall’ in the for predicting subcellular localization. Therefore, the better 554 Predicting subcellular localization Table 12. A comparison of the accuracy of NB, ANN, SVM and three nearest neighbor classiﬁers (1NN, 3NN and 5NN) on the ﬁve Swiss-Prot datasets, the LOCkey dataset and the PSORT-B dataset Category NB ANN SVM 1NN 2NN 3NN Animal 0.929 0.883 0.956 0.910 0.919 0.919 Plant 0.939 0.971 0.947 0.900 0.912 0.911 Fungi 0.811 0.856 0.814 0.726 0.772 0.752 GP bact 0.914 0.949 0.898 0.812 0.845 0.843 Fig. 4. Part of the PA explain page for protein OMP1_CHLMU. GN bact 0.934 0.956 0.939 0.868 0.899 0.892 LOCkey 0.912 0.943 0.924 0.720 0.763 0.768 PSORT-B 0.895 0.927 0.888 0.615 0.652 0.653 horizontal bar represents the evidence for a particular location on a logarithmic scale. Each sub-bar with different shading accuracy of PA compared to LOCkey cannot be attributed indicates the evidence due to the existence of a single fea- only to the inclusion of the IPR ﬁeld. The NB classiﬁer ture (porin, outer membrane, ipr000604, integral membrane itself accounts for most of the improvement over LOCkey. protein and transmembrane). In PA, these sub-bars are differ- However, there is hope that IPR numbers may be useful for ent colors, but have been represented by different shadings some localizations. A domain projection technique based on in this paper. The long white bar represents the accumulated SMART domains (Schultz et al., 2000), which are included evidence of the other features that are not currently displayed in Interpro, has been somewhat successful in identifying the (‘Reduced Residual’). labels: extracellular, cytoplasmic and nuclear (Mott et al., Proteome Analyst contains a mechanism for changing the 2002). In addition, we have found that using the IPR number is ﬁve features that are displayed and the remaining features that very signiﬁcant for general function prediction and other spe- are combined into the white bar (‘Reduced Residual’). Notice cialized predictors we have constructed using PA, like K -ion that the evidence for label ‘outer membrane’ over ‘cell wall’ channel protein classiﬁcation (Szafron et al., 2003b). is overwhelming. Even though a PA–NB classiﬁer and a PA– Selecting a classiﬁer technology ANN classiﬁer both predicted outer membrane, the advantage For step 3, we varied the kinds of classiﬁers. Table 12 shows a of using an NB classiﬁer instead of an ANN classiﬁer is the summary of results (PA-SUB, 2003) for NB, ANN, SVM and existence of this explanation facility. Note that in the revised three nearest neighbor classiﬁers (1NN, 3NN and 5NN). For a Swiss-Prot version 42 database that was released in Septem- k-nearest neighbor predictor, after BLASTing for homologs, ber 2003, the SCELL entry of this protein was changed to we ignored all Swiss-Prot ﬁelds except for the SCELL ﬁeld. outer membrane to match the PA prediction. Although the The k homologs with the smallest E-values (<0.001), that had explanation mechanism of PA was not used to inﬂuence this a non-empty SCELL ﬁeld voted for a subcellular localization annotation change, it could have been used in this way. label, based on their own ﬁeld label. In the case of a tie, the A complete description of the PA explanation facility is homolog with the smallest BLAST E-value won. beyond the scope of this paper. However, Figure 5 shows one As shown in Table 12, the NB accuracy is better than any more PA screen that can be used to view prediction evidence. of the k-nearest neighbor classiﬁers, but is inferior to the best This screen shows relative evidence from the most import- ANN and SVM classiﬁers up to 5%. However, it is very dif- ant features, in selecting between the predicted class (in this ﬁcult to explain the predictions of ANN and SVM classiﬁers, case, outer membrane) and any other class of interest (in so we feel that this small decrease in accuracy is more than this case, cell wall). The darker bars indicate evidence for compensated by the ability to explain the predictions to users. outer membrane and the lighter bars indicate evidence for cell Explanation is an important factor in getting users to trust wall. The (P) notation indicates that a token for that feature predictors (Szafron et al., 2003a). was present in the query sequence (OMP1_CHLMU) and an The explain mechanism of PA allows users to review the (A) indicates that the token for that feature was absent. We evidence used by a classiﬁer to make a prediction. For believe that the convenience and power of the PA–NB explana- example, both the NB and ANN classiﬁers predict that the tion facility is worth the loss of a few percentage points of GN protein OMP1_CHLMU is an outer membrane protein, precision accuracy that might be gained by using an ANN or even though Swiss-Prot 41 SCELL entry is CELL WALL SVM classiﬁer. SURFACE. However, the PA explain mechanism for NB clas- It is also possible to use multiple classiﬁer technologies and siﬁers allow the user to view the evidence, while there is no to report a consensus, although we are not currently using this way to do this in an ANN classiﬁer. Figure 4 shows part of approach. It is not clear how the explanation facility would ﬁt an explain page for the OMP1_CHLMU classiﬁcation. Each with such an approach. 555 Z.Lu et al. Emanuelsson,O., Nielson,H., Brunak,S. and von Heijne,G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016. Gardy,J.L., Spencer,C., Wang,K., Ester,M., Tusnády,G.E., Simon,I., Hua,S., deFays,K., Lambert,C., Nakai,K. and Brinkman,F.S.L. (2003) PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 31, 3613–3617. Hua,S. and Sun,Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721–728. Horton,P. and Nakai,K. (1997) Better prediction of protein cellular localization sites with the k nearest neighbors classiﬁer. Proc. of the Fifth ISMB, AAAI Press, pp. 298–305. Jurafsky,D. and Martin,J.H. (2000) Speech and Language Pro- cessing. Prentice-Hall, NJ. Kohavi,R. and John,G.H. (1997) Wrappers for feature subset selection. Artif. Intell., 97, 273–324. Krogh,A., Larsson,B., von Hejne,G. and Sonnhammer, E. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol., 305, 567–580. Lu,Z. (2003) Predicting protein subcellular localization from homo- logs using machine learning algorithms. MSc thesis, Department of Computing Science, University of Alberta. Mitchell,T.M. (1997) Machine Learning. McGraw-Hill, N.Y. Mott,R., Schultz,J., Bork,P. and Ponting,C.P. (2002) Predicting Fig. 5. Viewing feature contributions to a PA prediction. protein cellular localization using a domain projection method. Genome Res., 12, 1168–1174. Nair,R. and Rost,B. (2002) Inferring subcellular localization through ACKNOWLEDGEMENTS automated lexical analysis. Bioinformatics, 18, S78–S86. Thanks to the referees for several helpful suggestions. Thanks Nakai,K. and Kanehisa,M. (1992) A knowledge base for predict- ing protein localization sites in eukaryotic cells. Genomics, 14, to Cynthia Luk, Samer Nassar and Kevin McKee for their con- 897–911. tributions to the ﬁrst prototype of PA. Thanks to Warren Gallin Nakai,K. (2000) PSORT II Users’ Manual. and Kathy Magor for their valuable feedback while using early PA-SUB (2003). versions of PA. Thanks to Rajesh Nair and Burkhard Rost, for Reinhardt,A. and Hubbard,T. (1998) Using neural networks for pred- providing us with their training data. Finally, a big thanks to iction of the subcellular location of proteins. Nucleic Acids Res., Fiona Brinkman and Jennifer Gardy, for not only providing 26, 2230–2236. us with their GN bacteria training data, but also for many van Rijsbergen,K. (1979) Information Retrieval. Butterworths, helpful pointers and ideas. This research was partially funded London (UK). by research or equipment grants from the Protein Engineer- Sahami,M. (1999) Using machine learning to improve information ing Network of Centres of Excellence, the National Science access. Ph.D. thesis, Computer Science Department, Stanford and Engineering Research Council, Sun Microsystems and University, Stanford, CA. Schultz,J., Cople,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) the Alberta Ingenuity Centre for Machine Learning. SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231–234. Szafron,D., Greiner,R., Lu,P., Wishart,D., MacDonell,C., Anvik,J., REFERENCES Poulin,B., Lu,Z. and Eisner,R. (2003a) Explaining Naïve Bayes Altman,D.G. and Bland,J.M. (1994) Statistics notes: diagnostic classiﬁcations, TR03–09, Department of Computing Science, tests 1: sensitivity and speciﬁcity. BMJ, 308, 1552. University of Alberta. Duda,R.O. and Hart,P.E. (1973) Pattern Classiﬁcation and Scene Szafron,D., Lu,P., Greiner,R., Wishart,D., Lu,Z., Poulin,B., Analysis. John Wiley & Sons. Eisner,R., Anvik,J. and MacDonell,C. (2003b) Proteome EBI (2003) European Bioinformatics Institute. http://www.ebi.ac.uk/ Analyst–transparent high-throughput protein annotation: func- genomes tion, localization and custom predictors, International Conference Emanuelsson,O. (2002) Predicting protein subcellular localiza- on Machine Learning Workshop on Machine Learning in Bioin- tion from amino acid sequence information. Brief. Bioinfo., 3, formatics (ICML-Bioinformatics) August 2003, Washington, 361–376. 2–10. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/predicting-subcellular-localization-of-proteins-using-machine-learned-dBw3mnStZr

Loading next page...

References (21)

A. Krogh, Björn Larsson, G. Heijne, E. Sonnhammer (2001)
Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes.
Journal of molecular biology, 305 3
Astrid Reinhardt, T. Hubbard (1998)
Using neural networks for prediction of the subcellular location of proteins.
Nucleic acids research, 26 9
Ron Kohavi, George John (1997)
Wrappers for Feature Subset Selection
Artif. Intell., 97
P. Horton, K. Nakai (1997)
Better Prediction of Protein Cellular Localization Sites with the it k Nearest Neighbors Classifier
Proceedings. International Conference on Intelligent Systems for Molecular Biology, 5
O. Emanuelsson, Henrik Nielsen, S. Brunak, G. Heijne (2000)
Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.
Journal of molecular biology, 300 4
D. Altman, J. Bland (1994)
Statistics Notes: Diagnostic tests 1: sensitivity and specificity
BMJ, 308
(2000)
PSORT II Users' Manual
(1979)
Information Retrieval. Butterworths, London (UK). http://www.dcs.gla.ac.uk/Keith/Preface.html
R. Mott, J. Schultz, P. Bork, C. Ponting (2002)
Predicting protein cellular localization using a domain projection method.
Genome research, 12 8
K. Nakai, M. Kanehisa (1992)
A knowledge base for predicting protein localization sites in eukaryotic cells
Genomics, 14
S. Hua, Zhirong Sun (2001)
Support vector machine approach for protein subcellular localization prediction
Bioinformatics, 17 8
(2003)
European Bioinformatics Institute, http://www.ebi.ac
J. Gardy, Cory Spencer, Ke Wang, M. Ester, G. Tusnády, I. Simon, S. Hua, K. Fays, C. Lambert, K. Nakai, F. Brinkman (2003)
PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria
Nucleic acids research, 31 13
O. Emanuelsson (2002)
Predicting Protein Subcellular Localisation From Amino Acid Sequence Information
Briefings in bioinformatics, 3 4
D. Szafron, R. Greiner, P. Lu, D. Wishart, Cam Macdonell, J. Anvik, B. Poulin, Zhiyong Lu, Roman Eisner (2003)
Explaining Naive Bayes Classifications
J. Schultz, R. Copley, T. Doerks, C. Ponting, P. Bork (2000)
SMART: a web-based tool for the study of genetically mobile domains
Nucleic acids research, 28 1
R. Duda, P. Hart (1974)
Pattern classification and scene analysis
M. Sahami (1998)
Using machine learning to improve information access
D. Szafron, P. Lu, R. Greiner, D. Wishart, Zhiyong Lu, B. Poulin, Roman Eisner, J. Anvik, Cam Macdonell, Bahram Habibi-Nazhad (2003)
Proteome Analyst - Transparent High-throughput Protein Annotation: Function, Localization and Custom Predictors
R. Nair, B. Rost (2002)
Inferring sub-cellular localization through automated lexical analysis
Bioinformatics, 18 Suppl 1
Dan Jurafsky, James Martin (2000)
Speech and Language Processing

Publisher: Oxford University Press
Copyright: Bioinformatics 20(4) © Oxford University Press 2004; all rights reserved.
ISSN: 1367-4803
eISSN: 1460-2059
DOI: 10.1093/bioinformatics/btg447
Publisher site: See Article on Publisher Site

Abstract

Vol. 20 no. 4 2004, pages 547–556 BIOINFORMATICS DOI: 10.1093/bioinformatics/btg447 Predicting subcellular localization of proteins using machine-learned classiﬁers Z. Lu, D. Szafron , R. Greiner, P. Lu, D.S. Wishart, B. Poulin, J. Anvik, C. Macdonell and R. Eisner Department of Computing Science, University of Alberta, Edmonton, AB, Canada, T6G 2E8 Received on August 26, 2003; accepted on September 25, 2003 Advance Access publication January 22, 2004 ABSTRACT are three basic approaches. One approach is based on Motivation: Identifying the destination or localization of amino acid composition, using artiﬁcial neural nets (ANN) proteins is key to understanding their function and facilitating such as NNPSL (Reinhardt and Hubbard, 1998), or their puriﬁcation. A number of existing computational predic- support vector machines (SVM) like SubLoc (Hua and tion methods are based on sequence analysis. However, these Sun, 2001, http://www.bioinfo.tsinghua.edu.cn/SubLoc/). methods are limited in scope, accuracy and most particularly A second approach uses the existence of peptide signals, breadth of coverage. Rather than using sequence information which are short sub-sequences of ∼3–70 amino acids to pred- alone, we have explored the use of database text annotations ict speciﬁc cell locations, such as TargetP (Emanuelsson et al., from homologs and machine learning to substantially improve 2000). A third approach, such as the one used in LOCkey (Nair the prediction of subcellular location. and Rost, 2002), is to do a similarity search on the sequence, Results: We have constructed ﬁve machine-learning clas- extract text from homologs and use a classiﬁer on the text fea- siﬁers for predicting subcellular localization of proteins from tures. Some tools, like PSORT (Nakai and Kanehisa, 1992; animals, plants, fungi, Gram-negative bacteria and Gram- Horton and Nakai, 1997, http://psort.nibb.ac.jp/), combine a positive bacteria, which are 81% accurate for fungi and 92– variety of individual predictors. Many tools, like SubLoc, 94% accurate for the other four categories. These are the most PSORT and TMHMM (Krogh et al., 2001, http://www.cbs. accurate subcellular predictors across the widest set of organ- dtu.dk/services/TMHMM/), are available for public use on isms ever published. Our predictors are part of the Proteome the web. Unfortunately, most tools accept only a single Analyst web-service. sequence at a time, with TMHMM being a notable exception. Availability: http://www.cs.ualberta.ca/~bioinfo/PA/Sub, http:// Emanuelsson (2002) provides a good survey of these tools. www.cs.ualberta.ca/~bioinfo/PA Better accuracy and coverage are needed Contact: [email protected] There are two limitations to current techniques. The ﬁrst is Supplementary information: http://www.cs.ualberta.ca/ the limited accuracy of the predictors, especially for some ~bioinfo/PA/Subcellular organelles. The second is limited coverage. The term cover- age can be used in three ways: location coverage, sequence INTRODUCTION coverage and taxonomic coverage. All three kinds of coverage High-throughput sequencing technology has made it pos- are limited in current tools. sible for many laboratories to sequence the genomes First, location coverage deﬁnes the sub-regions (nuclear, of new organisms. There are more than 1200 gen- cytoplasmic, extracellular, etc.) in the cell that are supported ome sequences deposited in public databases (EBI, 2003, by a predictor. Most existing tools limit the location coverage http://www.ebi.ac.uk/genomes/). Given the size and com- to just membranes or just a few organelles. plexity of these datasets, most researchers are compelled Second, given a training/test set, sequence coverage is to use automated annotation systems to identify or clas- deﬁned as the ratio of sequences for which a prediction is made sify individual genes and proteins. As part of this annota- to the total number of sequences of interest. For example, tion process, a number of systems have been developed the LOCkey dataset consists of 3146 labeled sequences from that support automated prediction of subcellular localiza- Swiss-Prot and the predictor obtained an accuracy of 0.87 on a tion, based on amino acid sequence information. There subset of 1161 sequences (coverage = 0.37). Sequence cov- erage can be measured on one organism (1-organism sequence To whom correspondence should be addressed. coverage) or multiple organisms. The 1-organism measure is Bioinformatics 20(4) © Oxford University Press 2004; all rights reserved. 547 Z.Lu et al. Table 1. Accuracies (Acc.) and informal sequence/taxonomic coverage of accuracy, whereas PSORT-B is a newer system with much current subcellular localization predictors better accuracy (Gardy et al., 2003). In general, a classiﬁer takes a query instance, described by a Name Acc. Coverage Technique set of feature-value pairs, and returns one of a ﬁxed number of labels (Mitchell, 1997). In PA, each query instance is a primary PSORT-B 0.75 1443 GN bacterial Combination sequence that is BLASTed against the Swiss-Prot database to LOCkey 0.87 1161 assorted Homology obtain a set of homologs. Each feature of the query instance is SubLoc 0.91 291 prokaryotic AA composition a Boolean value corresponding to the presence or absence of a 0.79 2427 eukaryotic token (word or phrase) from certain ﬁelds of the homologous TargetP 0.85 940 plant Signal prediction sequences’ Swiss-Prot database entries. 0.90 2738 non-plant We use a machine-learning (ML) algorithm to learn a map- Proteome 0.93 16 284 animal Homology and ML ping from the features of a query instance to the appropriate Analyst 0.93 3420 plant subcellular localization label for that instance. A common 0.81 2104 fungal technique is to apply a ML algorithm to a set of labeled train- 0.92 3218 GN bacterial ing items to produce a classiﬁer. In our case, each training item 0.94 1571 GP bacterial consists of a primary protein sequence and the ontological label it has been assigned by an expert. Each training instance is ﬁrst BLASTed against Swiss-Prot to identify its features in the same manner as query instances. Features are not provided important for high-throughput prediction of newly sequenced in the training set—they are computed automatically from organisms. Swiss-Prot data. Third, taxonomic coverage measures the range of organisms In this paper, we use three different sources for labeled for the predictor such as: animal, green plant, Gram-negative training data: Swiss-Prot database entries that have unambigu- bacteria (GN), etc. Most existing predictors have only been ous subcellular localization annotations (26 458 sequences), evaluated on a limited number of sequences from a speciﬁc a subset of the Swiss-Prot database developed for LOCkey taxonomic category of organism (e.g. just GN bacteria or just (3146 sequences) and the set of GN bacteria sequences (1443) green plants). used in PSORT-B. These three datasets are used to evaluate Table 1 lists some predictors and gives a measure of accur- the PA classiﬁers. However, a PA user can also create a cus- acy and the kind of technique employed. It also provides an tom subcellular localization classiﬁer using custom training informal indication of combined sequence coverage and taxo- data, by simply uploading a ﬁle of labeled training sequences nomic coverage. Unfortunately, no standardized sequence (Szafron et al., 2003b). No programming is required. coverage ratios have been published for these predictors. In the context of PA, transparency is the ability to provide Using classiﬁers for prediction formally-sound and intuitively-simple reasons for each pred- This paper describes a novel classiﬁcation technique for pred- iction (Szafron et al., 2003a). PA bases its predictions on icting subcellular localization (Lu, 2003). This technique is well-understood concepts of conditional probabilities. Its used in our publicly available web-based Proteome Analyst explanations are presented as stacked bar-graphs that clearly (PA). Two tools are available for subcellular localization—a display the evidence for each prediction. simple tool (PA-SUB) that only predicts subcellular loc- Contributions alization (http://www.cs.ualberta.ca/~bioinfo/PA/Sub) and a This paper describes a new subcellular localization prediction more comprehensive tool that predicts subcellular localiza- technique that makes the following scientiﬁc contributions: tion along with other annotations, including general func- tion (http://www.cs.ualberta.ca/~bioinfo/PA). The second tool (1) This new ML technique makes the most accurate sub- also allows a user to build a custom classiﬁer from custom cellular localization predictions over the broadest range training data. of organisms (animals, plants, fungi, GN bacteria and A controlled vocabulary or ontology is required for GP bacteria) of all subcellular localization prediction subcellular localization. In fact, since cell structure var- techniques published to date. ies across organisms, several ontologies are required and (2) This technique is publicly available as a high- PA supports ﬁve: animal, plant, fungi, GN bacteria and throughput web-based tool in PA. Gram-positive (GP) bacteria, which are based on the (3) Proteome Analyst provides the ﬁrst explanation facility PSORT ontologies. Among them, PSORT (bacteria/plants), for subcellular localization predictions. PSORT II (animals/yeast) (Nakai, 2000, http://psort.nibb.ac. jp/helpwww2.html) and PSORT-B (GN bacteria) provide a (4) Proteome Analyst can be used to easily create new sub- set of predictors over the same classes of organisms as PA. cellular classiﬁers using custom training data, without However, PSORT and PSORT II are older systems with poor any programming. 548 Predicting subcellular localization SYSTEMS AND METHODS The prediction process Proteome Analyst predicts the subcellular localization of a query protein sequence using its primary sequence and the Fig. 1. The Swiss-Prot homologs of EXOA_STRPN from BLAST. organism taxonomic category: animal, plant, fungi, GN bac- teria, GP bacteria. Here is the ﬁve-step prediction process used by PA. P1. The primary sequence of the query protein is BLASTed against the Swiss-Prot database and a set of homolog- ous sequences is selected. P2. Potential features are computed by extracting text from the Swiss-Prot records of the best homologs. A feature has the value true if a token representing that feature is extracted and false if no such token is extracted. Fig. 2. The features for EXOA_STRPN extracted by PA. P3. The user-provided taxonomic organism category is used to select one of ﬁve pre-built Naïve Bayes (NB) classiﬁers (Duda and Hart, 1973): animal, plant, fungi, GN bacteria, GP bacteria. P4. The features are used by the appropriate classiﬁer to compute the probability of each label in the ontology of that classiﬁer. The label with the highest probability is considered the primary location for the protein. Fig. 3. Proteome Analyst predicted subcellular locations for P5. The user can view a graphical explanation of the EXOA_STRPN. prediction (Szafron et al., 2003a). in a stop-word list (van Rijsbergen, 1979, http://www. We use the GP bacterial protein Exodeoxyribonuclease dcs.gla.ac.uk/Keith/Preface.html). For example, Figure 2 from Streptococcus pneumoniae (EXOA_STRPN) as an shows the potential feature set for the demonstration query example. If this organism was newly sequenced, its pro- sequence (EXOA_STRPN) that were extracted from the top teins would not appear in Swiss-Prot. Therefore, we removed three homologs. They appear under the heading ‘Unique all EXOA_STRPN entries from our Swiss-Prot database for Tokens Extracted for Protein #6’. this demonstration. We experimented with many variations Our classiﬁers remove other poorly discriminating features of steps P1 (homolog selection) and P2 (feature extraction), as well. When PA builds a classiﬁer, it actually learns the best as described in the Discussion section. In this section, we set of features to use. This process of feature selection is a describe only the best conﬁguration. We select up to three standard ML technique for improving accuracy (Kohavi and homologs with the lowest BLAST E-values that are less John, 1997). In fact, the ﬁve classiﬁers (animal, plant, fungi, than 0.001. GN bacteria and GP bacteria) use different machine-learned Figure 1 shows three homologs of our query protein feature sets. Figure 2 shows the features that were actually sequence. For feature selection, we obtained the best results used by the GP bacteria classiﬁer to classify the demonstration using phrases extracted from selected ﬁelds of the Swiss- sequence (EXOA_STRPN). They appear under the heading Prot homologs. Speciﬁcally, we extracted each semi-colon ‘Relevant Tokens for Protein #6’. For example, the features delimited phrase from the Swiss-Prot KEYWORD ﬁeld of ipr003034 and polymorphism appear in the ‘Unique Tokens’ each selected homolog, as well as all InterPro numbers from list, but are not used by the classiﬁer, so they are not in the the DBSOURCE ﬁeld. Finally, we checked for the inclu- ‘Relevant Tokens’ list. sion of a pre-deﬁned set of phrases in the SUBCELLULAR Proteome Analyst uses a NB classiﬁer, which generates a LOCALIZATION sub-ﬁeld of the COMMENT ﬁeld. For probability for each label. Figure 3 shows the probabilities of ease of reference in this paper, we will denote these ﬁelds each of the GP bacteria labels for the demonstration sequence by: KWORD, IPR and SCELL, respectively. This set of (EXOA_STRPN) as shown in PA. phrases forms the potential feature set. The Discussion sec- tion describes alternative feature deﬁnition strategies that Building a classiﬁer produced less accurate classiﬁers. After computing the potential feature set, we remove all ubi- A classiﬁer must be trained (built) before it can be used. PA quitous phrases like: ‘complete proteome’, that are contained uses labeled training data to build a simple NB classiﬁer using 549 Z.Lu et al. Table 2. Confusion matrix for the PA GP classiﬁer, trained on Swiss-Prot these basic steps: data B1. Each labeled training instance consists of a primary sequence and a label from the ontology of the classiﬁer Actual Predicted label being built. cyt wal mem ext N.P. ASum cov Sensitivity B2. The primary sequence of each training instance is run TN TN FP TN TN cyt 881 0 13 26 10 930 0.989 0.947 through steps P1 and P2 described in the previous TN TN FP TN TN wal 1 16 0 1 1 19 0.947 0.842 section to produce a set of potential features. TP FN FN FN FN mem 8 1 291 17 23 340 0.932 0.856 + − B3. A set of sufﬁcient statistics, c and c , are com- TN TN FP TN TN ij ij ext 4 2 8 217 21 252 0.917 0.861 PSum 894 19 312 261 55 1541 0.964 R = 0.912 puted for the set of training instances, where c is ij Precision 0.985 0.842 0.933 0.831 P = 0.945 the number of training sequences that were indicated Speciﬁcity 0.979 0.998 0.983 0.966 S = 0.978 by label j with F = true and c , is the number of ij training sequences that were indicated by label j with The ontological labels are: cyt(oplasmic), (cell) wal(l), mem(brane) and ext(racellular). F = false. N.P. represents no prediction, ASum and PSum are the sums of the actual and predicted labels, respectively. cov is sequence coverage. The superscripts: TP, true positive; TN, B4. A NB classiﬁer is built using these sufﬁcient statistics. true negative; FP, false positive; and FN, false negative are relative to the mem(brane) ¯ ¯ label and are used in the text for illustration, along with the bolded entries. R, P and In fact, as mentioned earlier, we modify this basic pro- S denote overall sensitivity (recall), precision and speciﬁcity, respectively. cess by using feature selection (Kohavi and John, 1997) to improve the accuracy. After building and computing the accur- actually labeled mem(brane). The PSum row indicates the acy using all the potential features, we remove 5% of the number of test sequences whose predicted label is speciﬁed by features that have the lowest information content. The inform- the column label. For example, 312 sequences had predicted ation content (information gain) of a feature is a measure of label, membrane. the amount that a feature contributes to classiﬁcations in gen- Various statistics can be computed from a confusion matrix eral (Mitchell, 1997). For example, if a feature appears in to evaluate a classiﬁer. In this paper we will use four standard every training instance, it is useless in discriminating between statistics: speciﬁcity, precision, sensitivity and recall (the last labels and its information content is zero. On the other hand, two are identical). Given a confusion matrix M and a set of a feature that appears in all training instances that have a labels {L }, the standard deﬁnitions (van Rijsbergen, 1979; single label and no training instances with any other labels Altman and Bland, 1994) of these statistics are as follows. is very good for discriminating the one label. Therefore, it The precision for each label L is P deﬁned by: i i has high information content. After removing this 5% of low information content features, we build a second classiﬁer, and TP M M ii ii measure its accuracy. We then remove another 5% of low P = = = TP + FP M PSum ki i k=1 information content features and continue in this way until we have computed the accuracy of 20 different classiﬁers Here, n is the number of training instances, true positives (TP) with 0%, 5%, 10%, ... , 95% of the original features removed. is the number of labels correctly predicted as L , which were We identify the threshold that produced the classiﬁer with the actually labeled L . The false positives (FP) is the number highest accuracy. The most accurate classiﬁers for subcellular of labels incorrectly predicted as L that were actually not localization typically had 75–80% of the least discriminating labeled as L . For example, consider the label mem(brane) in features removed. Table 2. The TP and FP counts are denoted by superscripts, where there is a single count for TP, but the three FP entries Classiﬁer evaluation must be summed. From Table 2, we have TP = 291 and FP = To compare classiﬁers, it is important to deﬁne the evaluation 13 + 0 + 8 = 21. Therefore, the precision for membrane is: criteria precisely. Most techniques start with a confusion mat- P = 291/(291 + 21) = 291/312 = 0.933. (mem) rix or contingency table (van Rijsbergen, 1979). Table 2 shows The speciﬁcity for each label L is S deﬁned by: i i the confusion matrix for the PA GP bacteria classiﬁer trained on Swiss-Prot data. TN sum − ASum − PSum + M i i ii S = = We will use Table 2 to illustrate our evaluation techniques. TN + FP sum − ASum Each entry in Table 2 represents the number of sequences in the test set whose actual label is the row label and whose Here, true negatives (TN) is the number of labels cor- predicted label is the column label. For example, the number rectly predicted as not L , that were actually not labeled of sequences with actual label mem(brane) that were incor- L and sum is the total number of sequences (1541 in rectly predicted as ext(racellular) is 17. The ASum column Table 2). For example, in Table 2, the TN and FP counts indicates the number of test sequences whose actual label is for the label mem(brane) are denoted by superscripts, where speciﬁed by the row label. For example, 340 sequences were the superscripted numbers must be summed. We have 550 Predicting subcellular localization TN = 881 + 0 + 26 + 10 + 1 + 16 + 1 + 1 + 4 + 2 + 217 The overall precision and overall speciﬁcity are weighted + 21 = 1180 and FP = 13 + 0 + 8 = 21. The speciﬁcity averages over the predicted labels (columns): of label mem(brane) is: S = 1180/(1180 + 21) = (mem) n n PSum P M 1180/1201 = 0.983. i i ii i=1 i=1 P = = The sensitivity or recall for each label L is R deﬁned by: sum − PSum sum − PSum i i n+1 n+1 TP M M ii ii PSum S i i i=1 R = = = ¯ i S = n+1 TP + FN ASum M i sum − PSum ij n+1 j =1 Here, false negatives (FN) is the number of labels incor- For example, the overall precision and overall speciﬁcity of rectly predicted as not L that were actually labeled L .For i i ¯ the classiﬁer in Table 2 are P = (881 + 16 + 291 + 217)/ example, consider the label mem(brane) in Table 2. The TP (1541 − 55) = 0.945 and S = 0.978, respectively. The and FN counts are denoted by superscripts, where the FN overall coverage is a weighted average of the label coverage, superscripted numbers must be summed. From Table 2, we so C = 0.964. have TP = 291 and FN = 8 + 1 + 17 + 23 = 49. Note that There are many different ways to organize test sets and the FN number includes the no prediction (N.P.) column as we compute two different kinds of confusion matrices. Our well. Therefore, the sensitivity (recall) of label mem(brane) ﬁrst technique is a standard ML technique called 5-fold is: R = 291/(291 + 49) = 291/340 = 0.856. (mem) cross-validation (Mitchell, 1997). Each set of labeled training The precision and speciﬁcity statistics favor conservative instances is ‘randomly’ divided into ﬁve groups (G , ... , G ), 1 5 predictors that make no prediction when there is doubt about while keeping the number of training instances with each label the correctness of a prediction, while the sensitivity (recall) approximately the same in each training group. Then, ﬁve dif- statistic favors liberal predictors that make a prediction if ferent classiﬁers are constructed (C , ... , C ), where C uses 1 5 i there is a chance of success. For example, if two predic- all of the training instances from all of the groups except G . tions are changed from ‘no prediction’ to a prediction, where Next, a confusion matrix is computed for each of the ﬁve one is correct and the other is incorrect, then TP increases classiﬁers, C , using the sequences in group G (that were not i i by 1, FP increases by 1, TN decreases by 1 and FN decreases used in its training) as test data. The ﬁnal confusion matrix by 1. Therefore, the precision and speciﬁcity numbers both is then computed by summing the entries in all the confusion decrease, but the sensitivity (recall) increases: matrices. In our application, there is one important modiﬁca- TP + 1 TP tion that is necessary to ensure ‘fairness’ of the evaluation. P = < Our features are obtained by extracting them from Swiss-Prot TP + 1 + FP + 1 TP + FP homologs. Before searching for homologs, we remove the TN − 1 TN S = < Swiss-Prot entries of each of the test sequences. This sim- TN − 1 + FP + 1 TN + FP ulates the situation where the test sequences correspond to TP + 1 TP newly sequenced proteins that would not appear in the Swiss- R = > TP + 1 + FN − 1 TP + FN Prot database. We used the 5-fold cross-validation accuracy Information retrieval papers report precision and recall, to build the feature selection ﬁlter described in the previous while bioinformatics, medical and ML papers tend to report section. A second technique for computing a confusion mat- speciﬁcity and sensitivity. We include all of them. However, rix is to build a single classiﬁer from all training data except speciﬁcity is not as informative as precision for multi-labeled the sequences from one speciﬁc organism. This 1-organism (non-binary) classiﬁers. We also include sequence coverage, classiﬁer is then applied to the speciﬁc organism and a con- which is the ratio of sequences for which a prediction was fusion matrix is constructed. This simulates the situation in made to the total number of sequences in a speciﬁc class. which a classiﬁer is used to predict the subcellular locations For example, in Table 2, the coverage of mem(brane) is of all sequences in a newly sequenced organism. In this case, (340 − 23)/340 = 0.932. for fairness, all Swiss-Prot entries for that speciﬁc organism An overall version of each statistic is computed as a are removed from the Swiss-Prot database. weighted average. For the overall sensitivity (recall) the After the evaluation is complete, we build a ﬁnal classiﬁer weights are the number of sequences with each actual label using all of the training instances. This ﬁnal classiﬁer typically (ASum ), and we also refer to it as the accuracy, A: has better accuracy than any of the ﬁve classiﬁers built during n n 5-fold cross-validation. ASum R ASum (M /ASum ) i i i ii i i=1 i=1 A = R = = sum sum RESULTS ii i=1 Proteome analyst accuracy sum For example, in Table 2, the overall sensitivity (accuracy) is Table 3–7 show the statistics for the ﬁve classiﬁers A = R = (881 + 16 + 291 + 217)/1541 = 0.912. we built using training instances from the Swiss-Prot 551 Z.Lu et al. Table 5. Statistics for the PA fungi classiﬁer. See Tables 3 and 4 for Table 3. Statistics for the PA animal classiﬁer: count, spec(iﬁcity), prec(ision) and sens(itivity), as well as the 1-organism statistics for Bos abbreviations taurus (Bovine) Location 5-fold cross-validate 1-organism: NEUCR Location 5-fold cross-validate 1-organism: BOVINE count spec prec sens count spec prec sens Count spec prec sens count spec prec sens nuc 621 0.975 0.933 0.833 11 1.000 1.000 1.000 nuc 2846 0.996 0.979 0.905 47 1.000 1.000 0.894 mit 406 0.977 0.888 0.744 45 0.976 0.976 0.889 mit 1194 0.998 0.973 0.970 145 0.993 0.972 0.952 cyt 395 0.949 0.786 0.808 15 0.958 0.824 0.933 cyt 1845 0.981 0.866 0.919 84 0.983 0.878 0.940 ext 171 0.993 0.914 0.871 2 1.000 1.000 1.000 ext 3943 0.991 0.972 0.927 197 0.991 0.974 0.964 gol 52 0.991 0.689 0.808 0 1.000 0/0 0/0 gol 167 0.996 0.723 0.892 7 0.996 0.667 0.857 pex 64 0.993 0.786 0.859 0 1.000 0/0 0/0 pex 103 0.999 0.909 0.971 4 0.999 0.800 1.000 end 64 0.993 0.750 0.656 1 1.000 1.000 1.000 end 457 0.996 0.868 0.952 14 0.996 0.824 1.000 mem 302 0.989 0.932 0.861 12 0.987 0.917 0.917 lys 170 0.998 0.861 0.947 12 0.997 0.857 1.000 vac 19 0.996 0.600 0.632 1 1.000 1.000 1.000 mem 4820 0.981 0.957 0.938 218 0.986 0.966 0.917 Overall 2094 0.975 0.871 0.811 87 0.978 0.940 0.908 Overall 15549 0.988 0.946 0.929 728 0.990 0.950 0.941 The 1-organism is Neurospora crassa. The ontological labels are: nuc(lear), mit(ochondria), cyt(oplasmic), ext(racellular), Table 6. Statistics for the PA GN bacteria classiﬁer. See Table 2 for gol(gi), pe(ro)x(isomal), end(oplasmic reticulum), lys(osomal) and mem(brane). abbreviations Table 4. Statistics for the PA green plant classiﬁer. See Table 3 for Location 5-fold cross-validate 1-organism: HAEIN abbreviations count spec prec sens count spec prec sens cyt 1861 0.989 0.992 0.955 73 1.000 1.000 1.000 Location 5-fold cross-validate 1-organism: MAIZE ext 253 0.986 0.838 0.858 15 0.990 0.929 0.867 count spec prec sens count spec prec sens per 385 0.986 0.898 0.873 7 0.991 0.875 1.000 inn 432 0.993 0.958 0.951 5 1.000 1.000 0.800 nuc 168 0.999 0.988 0.964 16 1.000 1.000 1.000 wal 46 0.999 0.956 0.935 0 0.983 0/0 0/0 mit 307 0.992 0.926 0.935 19 0.986 0.900 0.947 out 197 0.996 0.938 0.919 15 1.000 1.000 0.800 cyt 447 0.987 0.923 0.960 36 0.992 0.971 0.917 Overall 3174 0.990 0.959 0.934 115 0.990 0.964 0.922 ext 127 0.996 0.887 0.866 6 0.981 0.667 1.000 gol 35 0.998 0.850 0.971 2 1.000 1.000 1.000 Additional ontological labels are: inn(er membrane), per(iplasmic), (cell) wal(l) and chl 1899 0.973 0.980 0.959 69 0.979 0.969 0.913 out(er membrane). The 1-organism is Haemophilus inﬂuenzae. pex 29 0.999 0.993 0.966 1 0.994 0.500 1.000 end 64 0.998 0.903 0.875 6 1.000 1.000 1.000 Table 7. Statistics for the PA GP bacteria classiﬁer. See Tables 3 and 6 for vac 82 0.997 0.870 0.817 2 0.994 0.667 1.000 abbreviations mem 135 0.992 0.805 0.733 9 0.987 0.600 0.333 Overall 3293 0.982 0.951 0.939 728 0.987 0.926 0.904 Location 5-fold cross-validate 1-organism: STRCO Additional labels are chl(oroplast) and vac(uole). The 1-organism is Zea mays. count spec prec sens count spec prec sens cyt 930 0.982 0.988 0.948 37 1.000 1.000 1.000 wal 19 0.997 0.750 0.789 0 0/0 0/0 0/0 database. The training sets are publicly available (PA-SUB, ext 252 0.967 0.841 0.881 6 1.000 1.000 1.000 2003, http://www.cs.ualberta.ca/~bioinfo/PA/Subcellular), mem 340 0.982 0.929 0.853 9 1.000 1.000 0.889 along with the confusion matrices that were used to compute Overall 1541 0.980 0.946 0.914 52 1.000 1.000 0.981 these statistics. Each training set contains a set of sequences in FastA format that includes the correct label (from Swiss-Prot), The 1-organism is Streptomyces coelicolor. the organism tag, the organism name, Swiss-Prot taxonomy information and the primary sequence. These classiﬁers show excellent 5-fold cross-validation and Rost, 2002), we constructed two custom subcellular localiza- 1-organism statistics over all ontological classes. However, tion classiﬁers using their ontology and training data. The some small training and test sets produce poor results, such LOCkey paper contains a confusion matrix for a Swiss-Prot as the precision (0.600) for the 19 training/test instances of dataset with 1162 training instances. Table 8 shows the 5-fold vacuolar in the fungi classiﬁer (Table 5). cross-validation speciﬁcity, precision and recall, computed We performed additional experiments to compare our work from their confusion matrix and from a PA classiﬁer we built with similar systems. To compare PA to LOCkey (Nair and using their training data and ontology. 552 Predicting subcellular localization Table 8. A comparison of the statistics of a PA classiﬁer built using the Table 9. A comparison of the statistics of a PA classiﬁer built using the LOCkey 1161 sequence training data with the statistics produced by the PSORT-B training data with the statistics produced by the PSORT-B predictor, LOC(Key) classiﬁer on the their training data built from the same training data Location Count Speciﬁcity Precision Sensitivity Location Count Precision Sensitivity LOC PA LOC PA LOC PA PSORT-B PA PSORT-B PA mit 190 0.945 0.993 0.763 0.964 0.795 0.979 cyt 252 0.976 0.947 0.694 0.853 ext 334 0.947 0.975 0.879 0.937 0.953 0.973 inn 308 0.967 0.965 0.787 0.906 nuc 352 0.926 0.985 0.850 0.965 0.971 0.929 per 264 0.919 0.915 0.576 0.860 chl 94 0.979 0.997 0.718 0.966 0.609 0.894 out 378 0.988 0.986 0.903 0.947 cyt 136 0.970 0.973 0.656 0.804 0.428 0.846 ext 241 0.944 0.876 0.700 0.880 end 14 0.993 1.000 0.200 1.000 0.154 0.500 Overall 1443 0.965 0.943 0.748 0.895 lys 7 0.999 0.999 0.000 0.833 0.000 0.714 gol 22 0.998 0.999 0.895 0.944 0.810 0.773 See Table 5 for the ontological label abbreviations. pex 8 1.000 1.000 0.000 1.000 0.000 0.375 vac 4 1.000 1.000 0.000 1.000 0.000 0.250 Overall 1161 0.945 0.983 0.815 0.936 0.815 0.912 Note that 139 out of 1443 training sequences in the See Tables 3 and 4 for the ontological label abbreviations. PSORT-B training data have two labels. To accommodate double-labels in our NB classiﬁer, we transformed each train- ing instance that had two labels into two training instances, one with each label. Since we are comparing with the PSORT-B Our speciﬁcity, precision and sensitivity results are consist- classiﬁer (Gardy et al., 2003), we followed their lead dur- ently better than the LOCkey results, except for sensitivity ing predictor evaluation and counted a prediction as correct on the golgi class. Our accuracy (overall sensitivity) is almost if it predicted either of the two labels. As a ﬁnal test, we 10% better at 0.912 versus 0.815. Even though our approaches applied our full Swiss-Prot trained GN bacteria classiﬁer to are similar, there are two reasons for these accuracy differ- the PSORT-B test set and obtained an accuracy of 0.869 (Lu, ences. First, we are using a different classiﬁer technology— 2003; PA-SUB, 2003). (NB) versus an ad-hoc method. Second, we are using different Swiss-Prot database ﬁelds (including the IPR ﬁeld). Their Sequence coverage paper does not include a confusion matrix or accuracy statist- If PA is applied to an entire organism, there will be some ics, for 100% coverage of a larger 3146 sequence set, other sequences without homologs, so no features can be extrac- than to indicate that the accuracy is less than the 0.815 accur- ted and used by the classiﬁer. In some cases, even though acy of their 34% coverage classiﬁer. On this larger set (100% homologs are found, there will be no relevant tokens in the coverage), we achieved an accuracy (overall sensitivity) of FUNCTION, IPR and SCELL ﬁelds used by PA to construct 0.889 (PA-SUB, 2003). features. We call such sequences excluded sequences and We also built a custom classiﬁer for GN bacteria using the PA makes no subcellular localization prediction for excluded reliable PSORT-B GN bacteria data (Gardy et al., 2003) as a sequences. Excluded sequences are the only ones that reduce training set. Table 9 shows the 5-fold cross-validation pre- the coverage of PA classiﬁers. To gain an appreciation for cision and sensitivity (recall) presented in their paper and the PA subcellular localization sequence coverage on vari- the same statistics computed from a PA classiﬁer built using ous organisms, we used the PA classiﬁers to classify all the the PSORT-B training data and ontology. They do not report sequences in several organisms as shown in Table 10. A more speciﬁcity, so it is not in Table 9. Note that our Swiss-Prot complete table is online (PA-SUB, 2003). GN bacteria ontology has one extra label, (cell) wal(l), which Before running PA on an organism, we removed all the they include in the ext(racellular) class. To compare our tech- sequences for that organism from Swiss-Prot, so that no exact nique more directly with theirs, we did not include a (cell) sequence matches would be found. Of course, for these tests, wal(l) label in the classiﬁer we built from their data. The we cannot report accuracy, since we do not know the ‘cor- PA approach is very different than the PSORT-B approach, rect’ subcellular localization for many of them. The organisms since PA uses a simple NB classiﬁer and features extracted use the animal, plant, fungi, GN bacteria and GP bacteria from Swiss-Prot homologs, while PSORT-B uses a set of six classiﬁers, respectively. Each was selected since its complete sequence-based models. Nevertheless, PA produces results proteome is publicly available. We are currently developing that are somewhat better for sensitivity and accuracy, and very pattern recognition and discovery software that can be used close in precision. Furthermore, the PA technique produces to extract local features from excluded sequences so that the excellent results for animals, plants, fungi and GP bacteria coverage may approach 100%. (with different classiﬁers of course). 553 Z.Lu et al. Table 10. Sequence coverage of the PA classiﬁers on some fully sequenced Table 11. The accuracy of PA GN classiﬁers that use different homolog selec- organisms tion techniques, different Swiss-Prot ﬁelds and different feature extraction techniques Organism Class Count Exclude Cov PSI-BLAST Top KWORD IPR SCELL acc iterations homologs M.musculus Animal 27 754 7099 0.745 A.thaliana Plant 26 032 10 043 0.600 S.pombe Fungi 5007 1023 0.787 1 3 phrase yes phrase 0.934 B.subtilis GP bacteria 4098 1346 0.672 1 3 phrase yes no 0.924 P.aeruginosa GN bacteria 5557 1355 0.756 1 3 phrase no phrase 0.934 1 3 no no phrase 0.922 2 3 phrase yes phrase 0.932 The count is the total number of genetic sequences for the organism. An exclude(d) sequence is one for which PA was unable to ﬁnd at least one homolog whose E-value 1 2 phrase yes phrase 0.935 was less than 0.001 that contained at least one relevant feature. The cov(erage) is the 1 4 phrase yes phrase 0.936 ratio of all non-excluded sequences from an organism to the total number of sequences 1 3 words yes words 0.929 from that organism. extracellular class for the plant and fungi ontologies, since DISCUSSION the Swiss-Prot data is not very accurate in these cases. For Extracting ontological labels for training step 4, we found many Swiss-Prot SCELL annotations for sequences proteins that are not in the cell wall, which contain the phrase We selected all mature sequences (≥40 amino acids) from ‘attached to the cell wall’. the Swiss-Prot database and tried to extract their ontological labels. Although the Swiss-Prot database contains a subcellu- Selecting homologs and extracting features lar localization ﬁeld, this ﬁeld does not contain just a single We experimented with many different implementations of the ontological label for each sequence. Therefore, we had to 5-step prediction process described earlier in this paper. For construct a parser that extracted a simple ontological label, step 1, we used PSI-BLAST instead of BLAST and varied the when possible. Here are the rules that our parser uses to label number of iterations. Second, we varied the number of homo- potential training sequences: logs whose features were extracted. The highest accuracies were obtained by using one iteration of PSI-BLAST (so we (1) See if the ﬁeld contains one of the ontological labels. reverted to BLAST). There is not much difference between If it does not, the sequence is rejected as a training using the top two, three or four homologs (whose E-values sequence. were smaller than 0.001), so we decided to pick three, while (2) If it contains more than one ontological label, it is also we investigate this further. rejected, unless one label is an organelle and the other For step 2, we varied the Swiss-Prot ﬁelds that is membrane. we used to extract features. We used combinations of the (3) If it contains an ontological label, but also contains the KEYWORD ﬁeld (KWORD), the InterPro numbers from the phrase ‘potential’ or ‘by similarity’ it is rejected if the DBSOURCE ﬁeld (IPR), and the SUBCELLULAR LOCAL- number of training sequences with that label is high. IZATION subﬁeld of the COMMENT ﬁeld (SCELL). We also However, if the number of training sequences with that varied the way we parsed the ﬁelds to extract features. For label is small (<1.5% of the total number of training example, we tried stemming (Jurafsky and Martin, 2000) on instances), it is accepted. the KWORD ﬁeld so that the words: ‘vacuole’ and ‘vacu- oles’ are the same. We also tried treating semi-colon delimited (4) If the ontological label is ‘cell wall’ and the phrase phrases like: ‘Purine biosynthesis’ as a single feature versus contains the word ‘attached’, it is rejected. two separate features in the KWORD ﬁeld. The best results Steps 2, 3 and 4 require some explanation. For step 2, it were obtained by using semi-colon delimited phrases without is common to describe a protein as being in a membrane stemming. For the SCELL, we tried using all individual words of a speciﬁc organelle. In this case, the correct label is the as features and we tried using a ﬁxed set of pre-deﬁned phrases organelle. In step 3, we want to reject any annotations that (PA-SUB, 2003). The pre-deﬁned phrase approach worked contain words like ‘potential’ or ‘by similarity’. However, for the best. Table 11 shows accuracy results for some of our ontological labels with low numbers of training instances, we experiments. found that accepting ‘higher risk’ annotations is necessary to Notice from Table 11 that using the SCELL, IPR and obtain enough training data so that the classiﬁers have good KWORD ﬁelds of the Swiss-Prot database gives the best accuracy. Note that we followed the PSORT-B lead of includ- prediction results, although the IPR ﬁeld is the least important ing any sequences that contain the phrase ‘cell wall’ in the for predicting subcellular localization. Therefore, the better 554 Predicting subcellular localization Table 12. A comparison of the accuracy of NB, ANN, SVM and three nearest neighbor classiﬁers (1NN, 3NN and 5NN) on the ﬁve Swiss-Prot datasets, the LOCkey dataset and the PSORT-B dataset Category NB ANN SVM 1NN 2NN 3NN Animal 0.929 0.883 0.956 0.910 0.919 0.919 Plant 0.939 0.971 0.947 0.900 0.912 0.911 Fungi 0.811 0.856 0.814 0.726 0.772 0.752 GP bact 0.914 0.949 0.898 0.812 0.845 0.843 Fig. 4. Part of the PA explain page for protein OMP1_CHLMU. GN bact 0.934 0.956 0.939 0.868 0.899 0.892 LOCkey 0.912 0.943 0.924 0.720 0.763 0.768 PSORT-B 0.895 0.927 0.888 0.615 0.652 0.653 horizontal bar represents the evidence for a particular location on a logarithmic scale. Each sub-bar with different shading accuracy of PA compared to LOCkey cannot be attributed indicates the evidence due to the existence of a single fea- only to the inclusion of the IPR ﬁeld. The NB classiﬁer ture (porin, outer membrane, ipr000604, integral membrane itself accounts for most of the improvement over LOCkey. protein and transmembrane). In PA, these sub-bars are differ- However, there is hope that IPR numbers may be useful for ent colors, but have been represented by different shadings some localizations. A domain projection technique based on in this paper. The long white bar represents the accumulated SMART domains (Schultz et al., 2000), which are included evidence of the other features that are not currently displayed in Interpro, has been somewhat successful in identifying the (‘Reduced Residual’). labels: extracellular, cytoplasmic and nuclear (Mott et al., Proteome Analyst contains a mechanism for changing the 2002). In addition, we have found that using the IPR number is ﬁve features that are displayed and the remaining features that very signiﬁcant for general function prediction and other spe- are combined into the white bar (‘Reduced Residual’). Notice cialized predictors we have constructed using PA, like K -ion that the evidence for label ‘outer membrane’ over ‘cell wall’ channel protein classiﬁcation (Szafron et al., 2003b). is overwhelming. Even though a PA–NB classiﬁer and a PA– Selecting a classiﬁer technology ANN classiﬁer both predicted outer membrane, the advantage For step 3, we varied the kinds of classiﬁers. Table 12 shows a of using an NB classiﬁer instead of an ANN classiﬁer is the summary of results (PA-SUB, 2003) for NB, ANN, SVM and existence of this explanation facility. Note that in the revised three nearest neighbor classiﬁers (1NN, 3NN and 5NN). For a Swiss-Prot version 42 database that was released in Septem- k-nearest neighbor predictor, after BLASTing for homologs, ber 2003, the SCELL entry of this protein was changed to we ignored all Swiss-Prot ﬁelds except for the SCELL ﬁeld. outer membrane to match the PA prediction. Although the The k homologs with the smallest E-values (<0.001), that had explanation mechanism of PA was not used to inﬂuence this a non-empty SCELL ﬁeld voted for a subcellular localization annotation change, it could have been used in this way. label, based on their own ﬁeld label. In the case of a tie, the A complete description of the PA explanation facility is homolog with the smallest BLAST E-value won. beyond the scope of this paper. However, Figure 5 shows one As shown in Table 12, the NB accuracy is better than any more PA screen that can be used to view prediction evidence. of the k-nearest neighbor classiﬁers, but is inferior to the best This screen shows relative evidence from the most import- ANN and SVM classiﬁers up to 5%. However, it is very dif- ant features, in selecting between the predicted class (in this ﬁcult to explain the predictions of ANN and SVM classiﬁers, case, outer membrane) and any other class of interest (in so we feel that this small decrease in accuracy is more than this case, cell wall). The darker bars indicate evidence for compensated by the ability to explain the predictions to users. outer membrane and the lighter bars indicate evidence for cell Explanation is an important factor in getting users to trust wall. The (P) notation indicates that a token for that feature predictors (Szafron et al., 2003a). was present in the query sequence (OMP1_CHLMU) and an The explain mechanism of PA allows users to review the (A) indicates that the token for that feature was absent. We evidence used by a classiﬁer to make a prediction. For believe that the convenience and power of the PA–NB explana- example, both the NB and ANN classiﬁers predict that the tion facility is worth the loss of a few percentage points of GN protein OMP1_CHLMU is an outer membrane protein, precision accuracy that might be gained by using an ANN or even though Swiss-Prot 41 SCELL entry is CELL WALL SVM classiﬁer. SURFACE. However, the PA explain mechanism for NB clas- It is also possible to use multiple classiﬁer technologies and siﬁers allow the user to view the evidence, while there is no to report a consensus, although we are not currently using this way to do this in an ANN classiﬁer. Figure 4 shows part of approach. It is not clear how the explanation facility would ﬁt an explain page for the OMP1_CHLMU classiﬁcation. Each with such an approach. 555 Z.Lu et al. Emanuelsson,O., Nielson,H., Brunak,S. and von Heijne,G. (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016. Gardy,J.L., Spencer,C., Wang,K., Ester,M., Tusnády,G.E., Simon,I., Hua,S., deFays,K., Lambert,C., Nakai,K. and Brinkman,F.S.L. (2003) PSORT-B: improving protein subcellular localization prediction for Gram-negative bacteria. Nucleic Acids Res. 31, 3613–3617. Hua,S. and Sun,Z. (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics, 17, 721–728. Horton,P. and Nakai,K. (1997) Better prediction of protein cellular localization sites with the k nearest neighbors classiﬁer. Proc. of the Fifth ISMB, AAAI Press, pp. 298–305. Jurafsky,D. and Martin,J.H. (2000) Speech and Language Pro- cessing. Prentice-Hall, NJ. Kohavi,R. and John,G.H. (1997) Wrappers for feature subset selection. Artif. Intell., 97, 273–324. Krogh,A., Larsson,B., von Hejne,G. and Sonnhammer, E. (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol., 305, 567–580. Lu,Z. (2003) Predicting protein subcellular localization from homo- logs using machine learning algorithms. MSc thesis, Department of Computing Science, University of Alberta. Mitchell,T.M. (1997) Machine Learning. McGraw-Hill, N.Y. Mott,R., Schultz,J., Bork,P. and Ponting,C.P. (2002) Predicting Fig. 5. Viewing feature contributions to a PA prediction. protein cellular localization using a domain projection method. Genome Res., 12, 1168–1174. Nair,R. and Rost,B. (2002) Inferring subcellular localization through ACKNOWLEDGEMENTS automated lexical analysis. Bioinformatics, 18, S78–S86. Thanks to the referees for several helpful suggestions. Thanks Nakai,K. and Kanehisa,M. (1992) A knowledge base for predict- ing protein localization sites in eukaryotic cells. Genomics, 14, to Cynthia Luk, Samer Nassar and Kevin McKee for their con- 897–911. tributions to the ﬁrst prototype of PA. Thanks to Warren Gallin Nakai,K. (2000) PSORT II Users’ Manual. and Kathy Magor for their valuable feedback while using early PA-SUB (2003). versions of PA. Thanks to Rajesh Nair and Burkhard Rost, for Reinhardt,A. and Hubbard,T. (1998) Using neural networks for pred- providing us with their training data. Finally, a big thanks to iction of the subcellular location of proteins. Nucleic Acids Res., Fiona Brinkman and Jennifer Gardy, for not only providing 26, 2230–2236. us with their GN bacteria training data, but also for many van Rijsbergen,K. (1979) Information Retrieval. Butterworths, helpful pointers and ideas. This research was partially funded London (UK). by research or equipment grants from the Protein Engineer- Sahami,M. (1999) Using machine learning to improve information ing Network of Centres of Excellence, the National Science access. Ph.D. thesis, Computer Science Department, Stanford and Engineering Research Council, Sun Microsystems and University, Stanford, CA. Schultz,J., Cople,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) the Alberta Ingenuity Centre for Machine Learning. SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231–234. Szafron,D., Greiner,R., Lu,P., Wishart,D., MacDonell,C., Anvik,J., REFERENCES Poulin,B., Lu,Z. and Eisner,R. (2003a) Explaining Naïve Bayes Altman,D.G. and Bland,J.M. (1994) Statistics notes: diagnostic classiﬁcations, TR03–09, Department of Computing Science, tests 1: sensitivity and speciﬁcity. BMJ, 308, 1552. University of Alberta. Duda,R.O. and Hart,P.E. (1973) Pattern Classiﬁcation and Scene Szafron,D., Lu,P., Greiner,R., Wishart,D., Lu,Z., Poulin,B., Analysis. John Wiley & Sons. Eisner,R., Anvik,J. and MacDonell,C. (2003b) Proteome EBI (2003) European Bioinformatics Institute. http://www.ebi.ac.uk/ Analyst–transparent high-throughput protein annotation: func- genomes tion, localization and custom predictors, International Conference Emanuelsson,O. (2002) Predicting protein subcellular localiza- on Machine Learning Workshop on Machine Learning in Bioin- tion from amino acid sequence information. Brief. Bioinfo., 3, formatics (ICML-Bioinformatics) August 2003, Washington, 361–376. 2–10.

Journal

Bioinformatics – Oxford University Press

Published: Jan 22, 2004

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Predicting subcellular localization of proteins using machine-learned classifiers

Predicting subcellular localization of proteins using machine-learned classifiers

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Predicting subcellular localization of proteins using machine-learned classifiers

Predicting subcellular localization of proteins using machine-learned classifiers

References (21)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies