Choosing negative examples for the prediction of protein-protein interactions

Asa Ben-Hur; William Noble

doi:10.1186/1471-2105-7-S1-S2

Choosing negative examples for the prediction of protein-protein interactions

Ben-Hur, Asa; Noble, William 2006-03-20 00:00:00 The protein-protein interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. This need has prompted the development of a number of methods for predicting protein-protein interactions based on various sources of data and methodologies. The common method for choosing negative examples for training a predictor of protein-protein interactions is based on annotations of cellular localization, and the observation that pairs of proteins that have different localization patterns are unlikely to interact. While this method leads to high quality sets of non-interacting proteins, we find that this choice can lead to biased estimates of prediction accuracy, because the constraints placed on the distribution of the negative examples makes the task easier. The effects of this bias are demonstrated in the context of both sequence-based and non-sequence based features used for predicting protein-protein interactions. Background of domains. Domain and motif composition is also the Despite advances in high-throughput experimental meth- basis of several Bayesian network models that aim to ods for detecting protein-protein interactions, the interac- explain an observed interaction network in terms of inter- tion networks for even well studied model organisms are actions between pairs of motifs or domains [3-5]. In the far from complete. In addition, high throughput assays context of kernel methods, similar kernels designed for typically have a high rate of false positives [1]. Therefore, predicting interactions from sequence were proposed in there is a continuing need for computational methods to [6,7]. Other sequence-based methods use co-evolution of complement existing experimental approaches. interacting proteins by comparing phylogenetic trees [8], correlated mutations [9], or gene fusion [10]. An alterna- Methods for predicting protein-protein interaction use a tive approach is to combine multiple sources of genomic variety of data sources. Sequence-based methods are usu- information — gene expression, Gene Ontology annota- ally based on the domain, motif, or k-mer composition of tions, transcriptional regulation, etc. — to predict co- the sequences. Sprinzak and Margalit [2] have noted that membership in a complex [11-13]. many pairs of structural domains tend to appear in inter- acting proteins, and have used this intuition to predict All the above-mentioned methods require an informed interactions according to the over-representation of pairs choice of positive examples (interacting pairs of proteins) Page 1 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 Table 1: The dependence of ROC scores of several variables on the co-localization threshold for the MIPS/DIP interaction data. The variables are: GO process similarity, GO function similarity, and correlations between microarray data under various environmental conditions [19]. For each threshold we computed the average ROC scores for 10 drawings of the negative examples. The standard deviation is shown in parentheses. threshold GO process GO function microarray 1.00 0.81 (0.001) 0.64 (0.002) 0.64 (0.005) 0.50 0.82 (0.001) 0.65 (0.004) 0.64 (0.003) 0.20 0.82 (0.002) 0.66 (0.005) 0.65 (0.005) 0.10 0.83 (0.002) 0.66 (0.005) 0.66 (0.003) 0.04 0.83 (0.001) 0.67 (0.004) 0.66 (0.004) and negative examples (non-interacting pairs of proteins) negative examples can be contaminated with interacting for training and assessing the performance of a classifier. proteins. This contamination, however, is likely to be very In view of the large fraction of false positive interactions small: it has been estimated that the number of interac- generated by high throughput methods, positive exam- tions in yeast is well below 100,000 [1,18], a number ples need to be chosen with care. These are often chosen which is 0.25 percent of the total number of protein pairs as interactions generated by reliable methods (small scale in yeast. This effect is likely to be much smaller than the experiments), interactions confirmed by several methods, contamination of even high-quality positive examples; or interactions confirmed by interacting paralogs moreover, our results show that a support vector machine [1,11,14,15]. classifier is resistant to even higher levels of label contam- ination. Negative examples also need to be chosen with care, and two such selection methods have been described in the lit- Results erature. Because there are no "gold standard" non-interac- In this paper we postulate that testing a classifier of pro- tions, some authors suggest that high quality non- tein-protein interactions on negative examples composed interactions can be generated by considering pairs of pro- of pairs of proteins that are not co-localized results in a teins whose cellular localization is different, most likely biased assessment of classifier accuracy. In order to test preventing the proteins from participating in a biologi- this hypothesis we need to define "co-localization." We cally relevant interaction [11,16]. Other authors use a sim- do this using the subcellular localization component of pler scheme, selecting non-interacting pairs uniformly at the Gene Ontology (GO). GO keywords are becoming the random from the set of all proteins pairs that are not standard in annotating gene products [20]. These key- known to interact [4,7,12,17]. words are arranged in a hierarchical manner in a rooted, directed acyclic graph, where keywords lower in the hier- In this paper, we argue that that the first method is not archy represent more specific terms. Therefore, one can- appropriate for assessing classifier accuracy. In particular, not say that two proteins are not co-localized simply we show that restricting negative examples to non co- because they don't share the exact same GO terms. As a localized protein pairs leads to a biased estimate of the similarity measure between two GO terms we use the neg- accuracy of a predictor of protein-protein interactions. ative log of the fraction of genes annotated with the lowest The basic assumption underlying the assessment of the common ancestor of the two terms. This similarity was accuracy of a classifier is that the distribution of testing introduced as a similarity measure on a hierarchy in [21], examples reflects the intended use of the method. In the used in the context of GO annotations in [22], and used case of predicting protein-protein interactions, a simple in a kernel in [7]. Using this measure of similarity allows uniform random choice of non-interacting protein pairs us to generate parameterized sets of negative examples yields an unbiased estimate of the true distribution. In characterized by a maximum degree of similarity allowed contrast, imposing the constraint of non co-localization between their GO cellular compartment annotations. may induce a different distribution on the features that are used for classification. The resulting biased distribution of Perhaps the simplest way to predict protein-protein inter- negative examples leads to over-optimistic estimates of actions is to represent pairs of proteins by a set of genomic classifier accuracy. This bias is likely to affect results features that reflect how likely they are to interact. Exam- reported in several papers [5,6,11]. ples of features that were used for this task are similarity of GO process and GO function annotations, correlation The simpler selection scheme — choosing negative exam- of gene expression, presence of similar transcription factor ples uniformly at random — also has potential pitfalls: binding sites in the upstream region of the genes, partici- because the interaction network is not complete, the set of pation in common regulatory modules and so on [11-13]. Page 2 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 Th used Figure 1 e dependence of p to choose negative example rediction accuracy, s quantified by the area under the ROC/ROC curves, on the co-localization threshold The dependence of prediction accuracy, quantified by the area under the ROC/ROC curves, on the co-localization threshold used to choose negative examples. Enforcing the condition that no two proteins in the set of negative examples have a GO component similarity that is greater than a given threshold (the co-localization threshold) imposes a constraint on the distribu- tion of negative examples. This constraint makes it easier for the classifier to distinguish between positive and negative exam- ples, and the effect gets stronger as the co-localization threshold becomes smaller. All methods are SVM-based classifiers trained using different kernels on two interaction datasets. Results are computed using five-fold cross-validation, averaged over five drawings of negative examples. The spectrum kernel method uses pairs of k-mers as features; the motif method uses the composition of discrete sequence motifs, and the non-sequence method uses features such as co-expression as measured in microarray experiments, similarity in GO process and function annotations etc. We performed our experiment on two yeast physical interaction datasets: the BIND data is derived from the BIND database; the experiments using the non-sequence data were performed on a subset of reliable interactions that are found by multiple assays in BIND; DIP/MIPS is a dataset of reliable interactions derived from the DIP and MIPS databases. Table 1 illustrates that as we vary the upper bound on the case the features are pairs of sequence features, e.g., motifs allowed similarity between the cellular compartment or k-mers that belong to a pair of protein sequences. Such annotations of pairs of proteins in the negative examples a kernel was used in [6,7] with a support vector machine (called the co-localization threshold in what follows), GO (SVM) classifier. The dimensionality of the feature space function and process annotations, and microarray data of these kernels is very high, and in fact, the method become more predictive of protein-protein interactions, doesn't use an explicit representation of the features. For as measured using the ROC score (the area under the the sequence-based features we show the existence of the receiver operating characteristic curve). This observation bias incurred by using non-co-localized negative exam- is not surprising. Consider, for example, biological proc- ples by showing that the accuracy of a classifier depends ess annotations. Interacting proteins often participate in on the co-localization threshold of the negative examples similar processes. Conversely, negative examples that are on which the method was tested. Figure 1 illustrates the not co-localized will be less likely to participate in similar increase in classifier accuracy as the co-localization thresh- biological processes, making this variable more predictive old is decreased. This effect is much larger than the varia- of interaction. A similar argument holds for the GO func- bility that results from the randomness in the choice of tion annotations and gene expression correlations. Note negative examples and the cross-validation (CV) estimate: that GO function annotations are less predictive than the the standard deviation of the ROC score on 10 drawings process annotations because interactions are often of the negative examples was 0.003, and the variability required for carrying out a particular process, whereas pro- between different runs of CV is even lower. We can teins that carry out the same function can do so in differ- explain the higher accuracy for low co-localization thresh- ent contexts, not requring interaction. old by the fact that the constraint on localization restricts the negative examples to a sub-space of sequence space, Using non-co-localized negative examples can lead to a making the learning problem easier than when there is no bias when using sequence-based features as well. In this constraint. Page 3 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 In our experiments we used sets of negative examples non-co-localized may be the result of higher quality neg- characterized by the similarity of the localization annota- ative examples. To address this concern we performed the tions of two proteins. To see the relevance of our results to following experiment to test the effect of changing the other published work, we need to establish a relationship labels of a small fraction of the negative examples. We between our co-localization threshold, and criteria used considered the MIPS/DIP dataset with the spectrum ker- elsewhere. The data of [11] is used in several studies of nel, and negative examples chosen with a co-localization protein-protein interactions. They considered five very threhsold of 0.1. We divided the dataset into two parts: broad cellular compartments (cytoplasm, mitochon- training data (80%), test data (20%), and flipped the drion, nucleus, plasma membrane, and secretory pathway labels of 2% of the negative examples, a fraction likely to organelles). Four of these have corresponding nodes in be much higher than the level of contamination under a the cellular compartment part of GO. The average GO choice of unconstrained selection of negative examples. similarity between these compartments ranges from 0.002 SVMs were trained on both flipped and unflipped ver- to 0.36, and is 0.13 on average. At this level of the co- sions of the data. The average ROC (ROC ) scores for 10 localization threshold our results show a strong effect. draws of the data were 0.874 (0.361) for the unflipped data, and 0.871 (0.356) for the flipped data. This experi- ment illustrates that SVMs can easily handle a larger Discussion There are many pitfalls in designing machine learning amount of noise in the negative examples than is expected experiments (see [23] for an example in the context of fea- in the actual data. Thus, the effect shown above is not a ture selection). Design of experiments in the field of bio- result of better quality negative examples. informatics, where various sources of data are often correlated, requires special care to make sure no informa- Without being aware of the bias in using gold standard tion on the testing example labels leaks to the representa- non-interactions, one may think, looking at a couple of tion of the training examples. In this paper, we illustrated papers that describe methods for predicting protein pro- a phenomenon where, by constraining the distribution of tein interactions from sequence [5,6], that the problem is negative examples, the classification problem becomes well addressed by these methods. However, this is not the easier. Although choosing negative examples as pairs of case: the good performance is in fact a result of the biased proteins that are localized to different cellular compart- selection of negative examples, and prediction of protein- ments creates high-quality negative examples, it also protein interactions from sequence is a difficult problem makes them easier to distinguish from interacting pro- that can still be considered unsolved. teins. In the case where the data is characterized by fea- tures such as similarity of GO process or function Methods annotations, constraining the distribution of the compo- Positive Examples nent similarity has a direct effect on the distribution of the We focus on prediction of physical interactions in yeast GO process annotation. and use interaction data derived from several sources. These interactions are used as positive examples when In the case of the sequence-based classifiers, the improve- training our classifiers. ment in classifier performance is the result of constraining the negative examples to a smaller region of sequence- � Data from BIND [24]. BIND includes published interac- space. We see a difference between the behavior of the tion data from high-throughput experiments as well as motif/pfam kernels and the spectrum kernel: the results curated entries derived from published papers. Using with the spectrum kernel are more strongly affected by the physical interactions yields a dataset comprised of 10,517 distribution of negative examples. We believe that this dif- interactions among 4233 yeast proteins (downloaded July ference is the result of the greater flexibility of the spec- 9th, 2004). Selecting interactions that were verified by trum kernel, which allows it to fit arbitrary training sets. multiple experimental assays yields a dataset of 750 The motif/pfam kernels, by contrast, use features that are trusted interactions. We used all the interactions for train- more biologically relevant, so cannot be biased as much ing, but assessed the performance only on trusted interac- as the spectrum kernel. The gold standard negative exam- tions. ples of [11] were not only constrained by lack of co-local- ization; they also demanded that both pairs of proteins � A curated set of high quality interactions from MIPS and have GO annotations in both the function and process DIP [25,26], also used in [5]. This set contains MIPS inter- components. This constraint would likely increase classi- actions that were annotated as physical interactions fier accuracy even further. derived from small scale experiments, DIP interactions from small scale experiments, and DIP interactions veri- The reader may suspect that the improvement in classifier fied by multiple experiments, for a total of 4838 interac- accuracy when constraining the negative examples to be tions. Page 4 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 In both cases we avoided using interactions that were val- task of predicting protein-protein interactions, which idated by interacting paralogs in yeast to define trusted requires a similarity between two pairs of proteins. Thus, interactions, since those are likely to be easier to predict ′ ′ we want a function K((X , X ), ( X , X )) that returns the 1 2 1 2 using the sequence-based methods. We eliminated self- similarity between the proteins X and X compared to the interactions from each dataset, since many of the features 1 2 we use are based on measures of similarity between the ′ ′ proteins X and X . We call a kernel that operates on 1 2 two proteins, e.g., gene expression correlation, and simi- individual genes or proteins a genomic kernel, and a kernel larity of GO annotations. that compares pairs of genes or proteins a pairwise kernel. Two recent papers proposed an approach for converting a Negative Examples genomic kernel into a pairwise kernel [6,7]. They define We compared two methods for choosing negative exam- ples in this paper: the kernel � Random pairs of proteins that are not known to physi- K((X , X ), ( X ′ , X ′ )) = K'(X, ) X ′ K'(X, ) X ′ + K'(X , 1 2 1 2 1 1 2 2 1 cally interact. X ′ ) K'(X, ), X ′ (1) 2 2 1 � Parameterized sets of negative examples were chosen as random pairs of protein that are not known to physically where K'(·, ·) is any genomic kernel. The intuition interact, such that the similarity of their GO cellular com- behind the kernel is that for the pair (X , X ) to be consid- 1 2 partment annotations is below some threshold. ′ ′ ′ ered similar to ( X , X ), X needs to be similar to X and 1 2 1 1 In each case the number of negative examples was chosen X needs to be similar to X (the first term) or X is similar 2 2 1 to be equal to the number of positive examples in the ′ ′ to X and X is similar to X (the second term). The fea- 2 2 1 dataset. ture space for this kernel is a vector space of (sym- metrized) pairs of features from the underlying genomic Support Vector Machines The support vector machine (SVM) [27] is a classification kernel. method that provides state-of-the-art performance in Sequence kernels many domains including bioinformatics [28,29]. SVMs access the data only through the kernel function which We use two sequence kernels in this work: the spectrum defines the similarity between data objects. This allows kernel [30] and the motif kernel [31]. The spectrum kernel the use of SVMs even when an explicit vector-space repre- models a sequence in the space of all k-mers, and its fea- sentation of the data is not available, but a kernel function ture space is a vector of counts of the number of times is provided. This is the case for one of the kernels used in each k-mer appears in the sequence. For the motif kernel this work, where a kernel between two pairs of sequences we use discrete sequence motifs, representing a sequence is defined (see below and [6,7]). in terms of a motif composition vector that counts how Figures of merit many times a discrete sequence motif matches the In this paper we evaluate the accuracy of a trained classi- sequence. To compute the motif kernel we used discrete fier using two metrics. Both metrics — the area under the sequence motifs constructed from the eBlocks database receiver operating characteristic curve (ROC score), and [32]. Yeast ORFs contain occurrences of 17,768 motifs out the normalized area under that curve up to the first 50 of a set of 42,718 motifs. For both kernels we used a nor- false positives, the ROC score — aim to measure both malized linear kernel in the space of k-mer/motif counts: sensitivity and specificity by integrating over a curve that Kx() ,y / Kx( ,x)K(y,y) . plots the true positive rate as a function of the false posi- tive rate. The motivation for using both metrics is pro- vided for example in [7]. Availability Data and code related to this work are available at: http:/ Pairwise kernels /noble.gs.washington.edu/proj/sppi. All the classification experiments were performed using the PyML machine The kernels proposed in the literature for handling learning library available at http://pyml.sourceforge.net. genomic information, e.g., sequence kernels such as the motif and spectrum kernels presented below, provide a Acknowledgements similarity between two sequences, or more generally, a This work is funded by NCRR NIH award P41 RR11823, by NHGRI NIH similarity between a representation of two proteins. award R33 HG003070, and by NSF award BDI-0243257. WSN is an Alfred Therefore, such kernels are not directly applicable to the P. Sloan Research Fellow. Page 5 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 25. Mewes HW, Frishman D, Gruber C, Geier B, Haase D, Kaps A, Lem- References cke K, Mannhaupt G, Pfeiffer F, Schüller C, Stocker S, Weil B: MIPS: 1. von Mering C, Krause R, Snel B, Cornell M, Olivier SG, Fields S, Bork a database for genomes and protein sequences. Nucleic Acids P: Comparative assessment of large-scale data sets of pro- Research 2000, 28:37-40. tein-protein interactions. Nature 2002, 417:399-403. 26. Xenarios I, Salwinski L, Duan XQJ, Higney P, Kim SM, Eisenberg D: 2. Sprinzak E, Margalit H: Correlated sequence-signatures as DIP: the Database of Interacting Proteins: a research tool for markers of protein-protein interaction. Journal of Molecular Biol- studying cellular networks of protein interactions. Nucleic ogy 2001, 311:681-692. Acids Research 2002, 30:303-305. 3. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain inter- 27. Boser BE, Guyon IM, Vapnik VN: A Training Algorithm for Opti- actions from protein-protein interactions. Genome Research mal Margin Classifiers. 5th Annual ACM Workshop on COLT 2002, 12(10):1540-1548. 1992:144-152 [http://www.clopinet.com/isabelle/Papers/]. Pittsburgh, 4. Gomez SM, Noble WS, Rzhetsky A: Learning to predict protein- PA: ACM Press protein interactions. Bioinformatics 2003, 19:1875-1881. 28. Schölkopf B, Smola A: Learning with Kernels Cambridge, MA: MIT Press; 5. Wang H, Segal E, Ben-Hur A, Koller D, Brutlag DL: Identifying Pro- tein-Protein Interaction Sites on a Genome-Wide Scale. In 29. Noble WS: Kernel methods in computational biology, chap. Support vector Advances in Neural Information Processing Systems 17 Edited by: Saul LK, machine applications in computational biology Cambridge, MA: MIT Press; Weiss Y, Bottou L. Cambridge, MA: MIT Press; 2005:1465-1472. 2004:71-92. 6. Martin S, Roe D, Faulon JL: Predicting protein-protein interac- 30. Leslie C, Eskin E, Noble WS: The spectrum kernel: A string ker- tions using signature products. Bioinformatics 2005, nel for SVM protein classification. In Proceedings of the Pacific 21(2):218-226. Symposium on Biocomputing Edited by: Altman RB, Dunker AK, Hunter L, 7. Ben-Hur A, Noble WS: Kernel methods for predicting protein- Lauderdale K, Klein TE. New Jersey: World Scientific; 2002:564-575. protein interactions. Bioinformatics 2005, 21(suppl 1):i38-i46. 31. Ben-hur A, Brutlag D: Remote homology detection: a motif 8. Ramani A, Marcotte E: Exploiting the co-evolution of interact- based approach. Proceedings of the Eleventh International Conference ing proteins to discover interaction specificity. Journal of Molec- on Intelligent Systems for Molecular Biology 2003, 19(suppl 1):i26-i33. ular Biology 2003, 327:273-284. 32. Su Q, Liu L, Saxonov S, Brutlag D: eBLOCKS: enumerating con- 9. Pazos F, Valencia A: In silico two-hybrid system for the selec- served protein blocks to achieve maximal sensitivity and tion of physically interacting protein pairs. Proteins: Structure, specificity. Nucleic Acids Research 2005, 33:178-182. Function and Genetics 2002, 47(2):219-227. 10. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interac- tions from genome sequences. Science 1999, 285:751-753. 11. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302:449-453. 12. Zhang LV, Wong S, King O, Roth F: Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 2004, 5:38-53. 13. Lin N, Wu B, Jansen R, Gerstein M, Zhao H: Information assess- ment on predicting protein-protein interactions. BMC Bioin- formatics 2004, 5:154. 14. Sprinzak E, Sattath S, Margalit H: How Reliable are Experimental Protein-Protein Interaction Data? Journal of Molecular Biology 2003, 327(5):919-923. 15. Deane C, Salwinski L, Xenarios I, Eisenberg D: Two Methods for Assessment of the Reliability of High Throughput Observa- tions. Molecular & Cellular Proteomics 2002, 1:349-356. 16. Jansen R, Gerstein M: Analyzing protein function on a genomic scale: the importance of gold-standard positives and nega- tives for network prediction. Current Opnion in Microbiology 2004, 7:535-545. 17. Qi Y, Klein-Seetharaman J, Bar-Joseph Z: Random Forest Similar- ity for Protein-Protein Interaction Prediction from Multiple Sources. Proceedings of the Pacific Symposium on Biocomputing 2005. 18. Grigoriev A: On the number of protein-protein interactions in the yeast proteome. nar 2003, 31(14):4157-4161. 19. Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Bot- stein D, Brown P: Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Molecular Biology of the Cell 2000, 11:4241-4257. 20. Gene Ontology Consortium: Gene ontology: tool for the unifica- Publish with Bio Med Central and every tion of biology. Nat Genet 2000, 25:25-9. 21. Resnik P: Using Information Content to Evaluate Semantic scientist can read your work free of charge Similarity in a Taxonomy. IJCAI 1995:448-453. [cite- "BioMed Central will be the most significant development for seer.ist.psu.edu/resnik95using.html] 22. Lord P, Stevens R, Brass A, Goble C: Investigating semantic sim- disseminating the results of biomedical researc h in our lifetime." ilarity measures across the Gene Ontology: the relationship Sir Paul Nurse, Cancer Research UK between sequence and annotation. Bioinformatics 2003, 19(10):1275-1283. Your research papers will be: 23. Ambroise C, McLachlan GJ: Selection bias in gene extraction on available free of charge to the entire biomedical community the basis of microarray gene-expression data. Proceedings of peer reviewed and published immediately upon acceptance the National Academy of Sciences of the United States of America 2002, 99(10):6562-6566. cited in PubMed and archived on PubMed Central 24. Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue yours — you keep the copyright CW: BIND-The Biomolecular Interaction Network Data- base. Nucleic Acids Res 2001, 29:242-245. BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 6 of 6 (page number not for citation purposes) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Bioinformatics Springer Journals http://www.deepdyve.com/lp/springer-journals/choosing-negative-examples-for-the-prediction-of-protein-protein-al66o7Lss7

Loading next page...

References (41)

Einat Sprinzak, Shmuel Sattath, H. Margalit (2003)
How reliable are experimental protein-protein interaction data?
Journal of molecular biology, 327 5
WS Noble (2004)
Kernel methods in computational biology, chap. Support vector machine applications in computational biology
A. Grigoriev (2003)
On the number of protein-protein interactions in the yeast proteome.
Nucleic acids research, 31 14
R Jansen, H Yu, D Greenbaum, Y Kluger, NJ Krogan, S Chung, A Emili, M Snyder, JF Greenblatt, M Gerstein (2003)
A Bayesian networks approach for predicting protein-protein interactions from genomic data
Science, 302
C Deane, L Salwinski, I Xenarios, D Eisenberg (2002)
Two Methods for Assessment of the Reliability of High Throughput Observations
Molecular & Cellular Proteomics, 1
P Resnik (1995)
IJCAI
Gary Bader, I. Donaldson, Cheryl Wolting, B. Ouellette, T. Pawson, C. Hogue (2003)
BIND: the Biomolecular Interaction Network Database
Nucleic acids research, 31 1
E. Marcotte, M. Pellegrini, H. Ng, Danny Rice, T. Yeates, D. Eisenberg (1999)
Detecting protein function and protein-protein interactions from genome sequences.
Science, 285 5428
H. Mewes, D. Frishman, U. Güldener, G. Mannhaupt, K. Mayer, M. Mokrejs, B. Morgenstern, M. Münsterkötter, S. Rudd, B. Weil (2002)
MIPS: a database for genomes and protein sequences
Nucleic acids research, 27 1
Haidong Wang, Eran Segal, Asa Ben-Hur, Daphne Koller, D. Brutlag (2004)
Identifying Protein-Protein Interaction Sites on a Genome-Wide Scale
R. Jansen, M. Gerstein (2004)
Analyzing protein function on a genomic scale: the importance of gold-standard positives and negatives for network prediction.
Current opinion in microbiology, 7 5
M. Ashburner, C. Ball, J. Blake, D. Botstein, Heather Butler, J. Cherry, A. Davis, K. Dolinski, S. Dwight, J. Eppig, M. Harris, D. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J. Matese, J. Richardson, M. Ringwald, G. Rubin, G. Sherlock (2000)
Gene Ontology: tool for the unification of biology
Nature Genetics, 25
B. Scholkopf, K. Tsuda, Jean-Philippe Vert (2004)
Support Vector Machine Applications in Computational Biology
A Bayesian networks approach for predicting protein-protein interactions from genomic data Materials and methods Datasets
S. Gomez, William Noble, A. Rzhetsky (2003)
Learning to predict protein-protein interactions from protein sequences
Bioinformatics, 19 15
Q. Su, Lin Lu, Serge Saxonov, D. Brutlag (2004)
eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity
Nucleic Acids Research, 33
Yanjun Qi, J. Klein-Seetharaman, Z. Bar-Joseph (2004)
Random Forest Similarity for Protein-Protein Interaction Prediction from Multiple Sources
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Arun Ramani, E. Marcotte (2003)
Exploiting the co-evolution of interacting proteins to discover interaction specificity.
Journal of molecular biology, 327 1
(2001)
Learning with kernels
A. Ben-Hur, D. Brutlag (2003)
Remote homology detection: a motif based approach
Bioinformatics, 19 Suppl 1
Einat Sprinzak, H. Margalit (2001)
Correlated sequence-signatures as markers of protein-protein interaction.
Journal of molecular biology, 311 4
C. Leslie, E. Eskin, William Noble (2001)
The Spectrum Kernel: A String Kernel for SVM Protein Classification
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
I. Xenarios, L. Salwínski, X. Duan, Patrick Higney, Sul-Min Kim, D. Eisenberg (2002)
DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions
Nucleic acids research, 30 1
Lan Zhang, Sharyl Wong, O. King, F. Roth (2004)
Predicting co-complexed protein pairs using genomic and proteomic data integration
BMC Bioinformatics, 5
C Leslie, E Eskin, WS Noble (2002)
Proceedings of the Pacific Symposium on Biocomputing
Shawn Martin, D. Roe, J. Faulon (2005)
Predicting protein-protein interactions using signature products
Bioinformatics, 21 2
A. Ben-Hur, William Noble (2005)
Kernel methods for predicting protein-protein interactions
Bioinformatics, 21 Suppl 1
B. Scholkopf, K. Tsuda, Jean-Philippe Vert (2005)
Kernel Methods in Computational Biology
Nan Lin, Baolin Wu, R. Jansen, M. Gerstein, Hongyu Zhao (2004)
Information assessment on predicting protein-protein interactions
BMC Bioinformatics, 5
P. Lord, R. Stevens, A. Brass, C. Goble (2003)
Investigating Semantic Similarity Measures Across the Gene Ontology: The Relationship Between Sequence and Annotation
Bioinformatics, 19 10
BE Boser, IM Guyon, VN Vapnik (1992)
5th Annual ACM Workshop on COLT
F. Pazos, A. Valencia (2002)
In silico two‐hybrid system for the selection of physically interacting protein pairs
Proteins: Structure, 47
(2002)
Selection bias in gene extraction on the basis of microarray gene-expression data
P. Resnik (1995)
Using Information Content to Evaluate Semantic Similarity in a Taxonomy
SM Gomez, WS Noble, A Rzhetsky (2003)
Learning to predict protein-protein interactions
Bioinformatics, 19
Minghua Deng, Shipra Mehta, Fengzhu Sun, Ting Chen (2002)
Inferring domain-domain interactions from protein-protein interactions
Genome research, 12 10
H Wang, E Segal, A Ben-Hur, D Koller, DL Brutlag (2005)
Advances in Neural Information Processing Systems 17
C. Mering, R. Krause, B. Snel, M. Cornell, S. Oliver, S. Fields, P. Bork (2002)
Comparative assessment of large-scale data sets of protein–protein interactions
Nature, 417
C. Deane, Łukasz Salwiński, I. Xenarios, D. Eisenberg (2002)
Protein Interactions
Molecular & Cellular Proteomics, 1
A. Gasch, P. Spellman, C. Kao, O. Carmel-Harel, M. Eisen, G. Storz, D. Botstein, P. Brown (2000)
Genomic expression programs in the response of yeast cells to environmental changes.
Molecular biology of the cell, 11 12
B. Boser, Isabelle Guyon, V. Vapnik (1992)
A training algorithm for optimal margin classifiers

Publisher: Springer Journals
Copyright: Copyright © 2006 by Author(s); licensee BioMed Central Ltd.
Subject: Life Sciences; Bioinformatics; Microarrays; Computational Biology/Bioinformatics; Computer Appl. in Life Sciences; Combinatorial Libraries; Algorithms
eISSN: 1471-2105
DOI: 10.1186/1471-2105-7-S1-S2
pmid: 16723005
Publisher site: See Article on Publisher Site

Abstract

The protein-protein interaction networks of even well-studied model organisms are sketchy at best, highlighting the continued need for computational methods to help direct experimentalists in the search for novel interactions. This need has prompted the development of a number of methods for predicting protein-protein interactions based on various sources of data and methodologies. The common method for choosing negative examples for training a predictor of protein-protein interactions is based on annotations of cellular localization, and the observation that pairs of proteins that have different localization patterns are unlikely to interact. While this method leads to high quality sets of non-interacting proteins, we find that this choice can lead to biased estimates of prediction accuracy, because the constraints placed on the distribution of the negative examples makes the task easier. The effects of this bias are demonstrated in the context of both sequence-based and non-sequence based features used for predicting protein-protein interactions. Background of domains. Domain and motif composition is also the Despite advances in high-throughput experimental meth- basis of several Bayesian network models that aim to ods for detecting protein-protein interactions, the interac- explain an observed interaction network in terms of inter- tion networks for even well studied model organisms are actions between pairs of motifs or domains [3-5]. In the far from complete. In addition, high throughput assays context of kernel methods, similar kernels designed for typically have a high rate of false positives [1]. Therefore, predicting interactions from sequence were proposed in there is a continuing need for computational methods to [6,7]. Other sequence-based methods use co-evolution of complement existing experimental approaches. interacting proteins by comparing phylogenetic trees [8], correlated mutations [9], or gene fusion [10]. An alterna- Methods for predicting protein-protein interaction use a tive approach is to combine multiple sources of genomic variety of data sources. Sequence-based methods are usu- information — gene expression, Gene Ontology annota- ally based on the domain, motif, or k-mer composition of tions, transcriptional regulation, etc. — to predict co- the sequences. Sprinzak and Margalit [2] have noted that membership in a complex [11-13]. many pairs of structural domains tend to appear in inter- acting proteins, and have used this intuition to predict All the above-mentioned methods require an informed interactions according to the over-representation of pairs choice of positive examples (interacting pairs of proteins) Page 1 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 Table 1: The dependence of ROC scores of several variables on the co-localization threshold for the MIPS/DIP interaction data. The variables are: GO process similarity, GO function similarity, and correlations between microarray data under various environmental conditions [19]. For each threshold we computed the average ROC scores for 10 drawings of the negative examples. The standard deviation is shown in parentheses. threshold GO process GO function microarray 1.00 0.81 (0.001) 0.64 (0.002) 0.64 (0.005) 0.50 0.82 (0.001) 0.65 (0.004) 0.64 (0.003) 0.20 0.82 (0.002) 0.66 (0.005) 0.65 (0.005) 0.10 0.83 (0.002) 0.66 (0.005) 0.66 (0.003) 0.04 0.83 (0.001) 0.67 (0.004) 0.66 (0.004) and negative examples (non-interacting pairs of proteins) negative examples can be contaminated with interacting for training and assessing the performance of a classifier. proteins. This contamination, however, is likely to be very In view of the large fraction of false positive interactions small: it has been estimated that the number of interac- generated by high throughput methods, positive exam- tions in yeast is well below 100,000 [1,18], a number ples need to be chosen with care. These are often chosen which is 0.25 percent of the total number of protein pairs as interactions generated by reliable methods (small scale in yeast. This effect is likely to be much smaller than the experiments), interactions confirmed by several methods, contamination of even high-quality positive examples; or interactions confirmed by interacting paralogs moreover, our results show that a support vector machine [1,11,14,15]. classifier is resistant to even higher levels of label contam- ination. Negative examples also need to be chosen with care, and two such selection methods have been described in the lit- Results erature. Because there are no "gold standard" non-interac- In this paper we postulate that testing a classifier of pro- tions, some authors suggest that high quality non- tein-protein interactions on negative examples composed interactions can be generated by considering pairs of pro- of pairs of proteins that are not co-localized results in a teins whose cellular localization is different, most likely biased assessment of classifier accuracy. In order to test preventing the proteins from participating in a biologi- this hypothesis we need to define "co-localization." We cally relevant interaction [11,16]. Other authors use a sim- do this using the subcellular localization component of pler scheme, selecting non-interacting pairs uniformly at the Gene Ontology (GO). GO keywords are becoming the random from the set of all proteins pairs that are not standard in annotating gene products [20]. These key- known to interact [4,7,12,17]. words are arranged in a hierarchical manner in a rooted, directed acyclic graph, where keywords lower in the hier- In this paper, we argue that that the first method is not archy represent more specific terms. Therefore, one can- appropriate for assessing classifier accuracy. In particular, not say that two proteins are not co-localized simply we show that restricting negative examples to non co- because they don't share the exact same GO terms. As a localized protein pairs leads to a biased estimate of the similarity measure between two GO terms we use the neg- accuracy of a predictor of protein-protein interactions. ative log of the fraction of genes annotated with the lowest The basic assumption underlying the assessment of the common ancestor of the two terms. This similarity was accuracy of a classifier is that the distribution of testing introduced as a similarity measure on a hierarchy in [21], examples reflects the intended use of the method. In the used in the context of GO annotations in [22], and used case of predicting protein-protein interactions, a simple in a kernel in [7]. Using this measure of similarity allows uniform random choice of non-interacting protein pairs us to generate parameterized sets of negative examples yields an unbiased estimate of the true distribution. In characterized by a maximum degree of similarity allowed contrast, imposing the constraint of non co-localization between their GO cellular compartment annotations. may induce a different distribution on the features that are used for classification. The resulting biased distribution of Perhaps the simplest way to predict protein-protein inter- negative examples leads to over-optimistic estimates of actions is to represent pairs of proteins by a set of genomic classifier accuracy. This bias is likely to affect results features that reflect how likely they are to interact. Exam- reported in several papers [5,6,11]. ples of features that were used for this task are similarity of GO process and GO function annotations, correlation The simpler selection scheme — choosing negative exam- of gene expression, presence of similar transcription factor ples uniformly at random — also has potential pitfalls: binding sites in the upstream region of the genes, partici- because the interaction network is not complete, the set of pation in common regulatory modules and so on [11-13]. Page 2 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 Th used Figure 1 e dependence of p to choose negative example rediction accuracy, s quantified by the area under the ROC/ROC curves, on the co-localization threshold The dependence of prediction accuracy, quantified by the area under the ROC/ROC curves, on the co-localization threshold used to choose negative examples. Enforcing the condition that no two proteins in the set of negative examples have a GO component similarity that is greater than a given threshold (the co-localization threshold) imposes a constraint on the distribu- tion of negative examples. This constraint makes it easier for the classifier to distinguish between positive and negative exam- ples, and the effect gets stronger as the co-localization threshold becomes smaller. All methods are SVM-based classifiers trained using different kernels on two interaction datasets. Results are computed using five-fold cross-validation, averaged over five drawings of negative examples. The spectrum kernel method uses pairs of k-mers as features; the motif method uses the composition of discrete sequence motifs, and the non-sequence method uses features such as co-expression as measured in microarray experiments, similarity in GO process and function annotations etc. We performed our experiment on two yeast physical interaction datasets: the BIND data is derived from the BIND database; the experiments using the non-sequence data were performed on a subset of reliable interactions that are found by multiple assays in BIND; DIP/MIPS is a dataset of reliable interactions derived from the DIP and MIPS databases. Table 1 illustrates that as we vary the upper bound on the case the features are pairs of sequence features, e.g., motifs allowed similarity between the cellular compartment or k-mers that belong to a pair of protein sequences. Such annotations of pairs of proteins in the negative examples a kernel was used in [6,7] with a support vector machine (called the co-localization threshold in what follows), GO (SVM) classifier. The dimensionality of the feature space function and process annotations, and microarray data of these kernels is very high, and in fact, the method become more predictive of protein-protein interactions, doesn't use an explicit representation of the features. For as measured using the ROC score (the area under the the sequence-based features we show the existence of the receiver operating characteristic curve). This observation bias incurred by using non-co-localized negative exam- is not surprising. Consider, for example, biological proc- ples by showing that the accuracy of a classifier depends ess annotations. Interacting proteins often participate in on the co-localization threshold of the negative examples similar processes. Conversely, negative examples that are on which the method was tested. Figure 1 illustrates the not co-localized will be less likely to participate in similar increase in classifier accuracy as the co-localization thresh- biological processes, making this variable more predictive old is decreased. This effect is much larger than the varia- of interaction. A similar argument holds for the GO func- bility that results from the randomness in the choice of tion annotations and gene expression correlations. Note negative examples and the cross-validation (CV) estimate: that GO function annotations are less predictive than the the standard deviation of the ROC score on 10 drawings process annotations because interactions are often of the negative examples was 0.003, and the variability required for carrying out a particular process, whereas pro- between different runs of CV is even lower. We can teins that carry out the same function can do so in differ- explain the higher accuracy for low co-localization thresh- ent contexts, not requring interaction. old by the fact that the constraint on localization restricts the negative examples to a sub-space of sequence space, Using non-co-localized negative examples can lead to a making the learning problem easier than when there is no bias when using sequence-based features as well. In this constraint. Page 3 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 In our experiments we used sets of negative examples non-co-localized may be the result of higher quality neg- characterized by the similarity of the localization annota- ative examples. To address this concern we performed the tions of two proteins. To see the relevance of our results to following experiment to test the effect of changing the other published work, we need to establish a relationship labels of a small fraction of the negative examples. We between our co-localization threshold, and criteria used considered the MIPS/DIP dataset with the spectrum ker- elsewhere. The data of [11] is used in several studies of nel, and negative examples chosen with a co-localization protein-protein interactions. They considered five very threhsold of 0.1. We divided the dataset into two parts: broad cellular compartments (cytoplasm, mitochon- training data (80%), test data (20%), and flipped the drion, nucleus, plasma membrane, and secretory pathway labels of 2% of the negative examples, a fraction likely to organelles). Four of these have corresponding nodes in be much higher than the level of contamination under a the cellular compartment part of GO. The average GO choice of unconstrained selection of negative examples. similarity between these compartments ranges from 0.002 SVMs were trained on both flipped and unflipped ver- to 0.36, and is 0.13 on average. At this level of the co- sions of the data. The average ROC (ROC ) scores for 10 localization threshold our results show a strong effect. draws of the data were 0.874 (0.361) for the unflipped data, and 0.871 (0.356) for the flipped data. This experi- ment illustrates that SVMs can easily handle a larger Discussion There are many pitfalls in designing machine learning amount of noise in the negative examples than is expected experiments (see [23] for an example in the context of fea- in the actual data. Thus, the effect shown above is not a ture selection). Design of experiments in the field of bio- result of better quality negative examples. informatics, where various sources of data are often correlated, requires special care to make sure no informa- Without being aware of the bias in using gold standard tion on the testing example labels leaks to the representa- non-interactions, one may think, looking at a couple of tion of the training examples. In this paper, we illustrated papers that describe methods for predicting protein pro- a phenomenon where, by constraining the distribution of tein interactions from sequence [5,6], that the problem is negative examples, the classification problem becomes well addressed by these methods. However, this is not the easier. Although choosing negative examples as pairs of case: the good performance is in fact a result of the biased proteins that are localized to different cellular compart- selection of negative examples, and prediction of protein- ments creates high-quality negative examples, it also protein interactions from sequence is a difficult problem makes them easier to distinguish from interacting pro- that can still be considered unsolved. teins. In the case where the data is characterized by fea- tures such as similarity of GO process or function Methods annotations, constraining the distribution of the compo- Positive Examples nent similarity has a direct effect on the distribution of the We focus on prediction of physical interactions in yeast GO process annotation. and use interaction data derived from several sources. These interactions are used as positive examples when In the case of the sequence-based classifiers, the improve- training our classifiers. ment in classifier performance is the result of constraining the negative examples to a smaller region of sequence- � Data from BIND [24]. BIND includes published interac- space. We see a difference between the behavior of the tion data from high-throughput experiments as well as motif/pfam kernels and the spectrum kernel: the results curated entries derived from published papers. Using with the spectrum kernel are more strongly affected by the physical interactions yields a dataset comprised of 10,517 distribution of negative examples. We believe that this dif- interactions among 4233 yeast proteins (downloaded July ference is the result of the greater flexibility of the spec- 9th, 2004). Selecting interactions that were verified by trum kernel, which allows it to fit arbitrary training sets. multiple experimental assays yields a dataset of 750 The motif/pfam kernels, by contrast, use features that are trusted interactions. We used all the interactions for train- more biologically relevant, so cannot be biased as much ing, but assessed the performance only on trusted interac- as the spectrum kernel. The gold standard negative exam- tions. ples of [11] were not only constrained by lack of co-local- ization; they also demanded that both pairs of proteins � A curated set of high quality interactions from MIPS and have GO annotations in both the function and process DIP [25,26], also used in [5]. This set contains MIPS inter- components. This constraint would likely increase classi- actions that were annotated as physical interactions fier accuracy even further. derived from small scale experiments, DIP interactions from small scale experiments, and DIP interactions veri- The reader may suspect that the improvement in classifier fied by multiple experiments, for a total of 4838 interac- accuracy when constraining the negative examples to be tions. Page 4 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 In both cases we avoided using interactions that were val- task of predicting protein-protein interactions, which idated by interacting paralogs in yeast to define trusted requires a similarity between two pairs of proteins. Thus, interactions, since those are likely to be easier to predict ′ ′ we want a function K((X , X ), ( X , X )) that returns the 1 2 1 2 using the sequence-based methods. We eliminated self- similarity between the proteins X and X compared to the interactions from each dataset, since many of the features 1 2 we use are based on measures of similarity between the ′ ′ proteins X and X . We call a kernel that operates on 1 2 two proteins, e.g., gene expression correlation, and simi- individual genes or proteins a genomic kernel, and a kernel larity of GO annotations. that compares pairs of genes or proteins a pairwise kernel. Two recent papers proposed an approach for converting a Negative Examples genomic kernel into a pairwise kernel [6,7]. They define We compared two methods for choosing negative exam- ples in this paper: the kernel � Random pairs of proteins that are not known to physi- K((X , X ), ( X ′ , X ′ )) = K'(X, ) X ′ K'(X, ) X ′ + K'(X , 1 2 1 2 1 1 2 2 1 cally interact. X ′ ) K'(X, ), X ′ (1) 2 2 1 � Parameterized sets of negative examples were chosen as random pairs of protein that are not known to physically where K'(·, ·) is any genomic kernel. The intuition interact, such that the similarity of their GO cellular com- behind the kernel is that for the pair (X , X ) to be consid- 1 2 partment annotations is below some threshold. ′ ′ ′ ered similar to ( X , X ), X needs to be similar to X and 1 2 1 1 In each case the number of negative examples was chosen X needs to be similar to X (the first term) or X is similar 2 2 1 to be equal to the number of positive examples in the ′ ′ to X and X is similar to X (the second term). The fea- 2 2 1 dataset. ture space for this kernel is a vector space of (sym- metrized) pairs of features from the underlying genomic Support Vector Machines The support vector machine (SVM) [27] is a classification kernel. method that provides state-of-the-art performance in Sequence kernels many domains including bioinformatics [28,29]. SVMs access the data only through the kernel function which We use two sequence kernels in this work: the spectrum defines the similarity between data objects. This allows kernel [30] and the motif kernel [31]. The spectrum kernel the use of SVMs even when an explicit vector-space repre- models a sequence in the space of all k-mers, and its fea- sentation of the data is not available, but a kernel function ture space is a vector of counts of the number of times is provided. This is the case for one of the kernels used in each k-mer appears in the sequence. For the motif kernel this work, where a kernel between two pairs of sequences we use discrete sequence motifs, representing a sequence is defined (see below and [6,7]). in terms of a motif composition vector that counts how Figures of merit many times a discrete sequence motif matches the In this paper we evaluate the accuracy of a trained classi- sequence. To compute the motif kernel we used discrete fier using two metrics. Both metrics — the area under the sequence motifs constructed from the eBlocks database receiver operating characteristic curve (ROC score), and [32]. Yeast ORFs contain occurrences of 17,768 motifs out the normalized area under that curve up to the first 50 of a set of 42,718 motifs. For both kernels we used a nor- false positives, the ROC score — aim to measure both malized linear kernel in the space of k-mer/motif counts: sensitivity and specificity by integrating over a curve that Kx() ,y / Kx( ,x)K(y,y) . plots the true positive rate as a function of the false posi- tive rate. The motivation for using both metrics is pro- vided for example in [7]. Availability Data and code related to this work are available at: http:/ Pairwise kernels /noble.gs.washington.edu/proj/sppi. All the classification experiments were performed using the PyML machine The kernels proposed in the literature for handling learning library available at http://pyml.sourceforge.net. genomic information, e.g., sequence kernels such as the motif and spectrum kernels presented below, provide a Acknowledgements similarity between two sequences, or more generally, a This work is funded by NCRR NIH award P41 RR11823, by NHGRI NIH similarity between a representation of two proteins. award R33 HG003070, and by NSF award BDI-0243257. WSN is an Alfred Therefore, such kernels are not directly applicable to the P. Sloan Research Fellow. Page 5 of 6 (page number not for citation purposes) BMC Bioinformatics 2006, 7:S2 25. Mewes HW, Frishman D, Gruber C, Geier B, Haase D, Kaps A, Lem- References cke K, Mannhaupt G, Pfeiffer F, Schüller C, Stocker S, Weil B: MIPS: 1. von Mering C, Krause R, Snel B, Cornell M, Olivier SG, Fields S, Bork a database for genomes and protein sequences. Nucleic Acids P: Comparative assessment of large-scale data sets of pro- Research 2000, 28:37-40. tein-protein interactions. Nature 2002, 417:399-403. 26. Xenarios I, Salwinski L, Duan XQJ, Higney P, Kim SM, Eisenberg D: 2. Sprinzak E, Margalit H: Correlated sequence-signatures as DIP: the Database of Interacting Proteins: a research tool for markers of protein-protein interaction. Journal of Molecular Biol- studying cellular networks of protein interactions. Nucleic ogy 2001, 311:681-692. Acids Research 2002, 30:303-305. 3. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain inter- 27. Boser BE, Guyon IM, Vapnik VN: A Training Algorithm for Opti- actions from protein-protein interactions. Genome Research mal Margin Classifiers. 5th Annual ACM Workshop on COLT 2002, 12(10):1540-1548. 1992:144-152 [http://www.clopinet.com/isabelle/Papers/]. Pittsburgh, 4. Gomez SM, Noble WS, Rzhetsky A: Learning to predict protein- PA: ACM Press protein interactions. Bioinformatics 2003, 19:1875-1881. 28. Schölkopf B, Smola A: Learning with Kernels Cambridge, MA: MIT Press; 5. Wang H, Segal E, Ben-Hur A, Koller D, Brutlag DL: Identifying Pro- tein-Protein Interaction Sites on a Genome-Wide Scale. In 29. Noble WS: Kernel methods in computational biology, chap. Support vector Advances in Neural Information Processing Systems 17 Edited by: Saul LK, machine applications in computational biology Cambridge, MA: MIT Press; Weiss Y, Bottou L. Cambridge, MA: MIT Press; 2005:1465-1472. 2004:71-92. 6. Martin S, Roe D, Faulon JL: Predicting protein-protein interac- 30. Leslie C, Eskin E, Noble WS: The spectrum kernel: A string ker- tions using signature products. Bioinformatics 2005, nel for SVM protein classification. In Proceedings of the Pacific 21(2):218-226. Symposium on Biocomputing Edited by: Altman RB, Dunker AK, Hunter L, 7. Ben-Hur A, Noble WS: Kernel methods for predicting protein- Lauderdale K, Klein TE. New Jersey: World Scientific; 2002:564-575. protein interactions. Bioinformatics 2005, 21(suppl 1):i38-i46. 31. Ben-hur A, Brutlag D: Remote homology detection: a motif 8. Ramani A, Marcotte E: Exploiting the co-evolution of interact- based approach. Proceedings of the Eleventh International Conference ing proteins to discover interaction specificity. Journal of Molec- on Intelligent Systems for Molecular Biology 2003, 19(suppl 1):i26-i33. ular Biology 2003, 327:273-284. 32. Su Q, Liu L, Saxonov S, Brutlag D: eBLOCKS: enumerating con- 9. Pazos F, Valencia A: In silico two-hybrid system for the selec- served protein blocks to achieve maximal sensitivity and tion of physically interacting protein pairs. Proteins: Structure, specificity. Nucleic Acids Research 2005, 33:178-182. Function and Genetics 2002, 47(2):219-227. 10. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interac- tions from genome sequences. Science 1999, 285:751-753. 11. Jansen R, Yu H, Greenbaum D, Kluger Y, Krogan NJ, Chung S, Emili A, Snyder M, Greenblatt JF, Gerstein M: A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science 2003, 302:449-453. 12. Zhang LV, Wong S, King O, Roth F: Predicting co-complexed protein pairs using genomic and proteomic data integration. BMC Bioinformatics 2004, 5:38-53. 13. Lin N, Wu B, Jansen R, Gerstein M, Zhao H: Information assess- ment on predicting protein-protein interactions. BMC Bioin- formatics 2004, 5:154. 14. Sprinzak E, Sattath S, Margalit H: How Reliable are Experimental Protein-Protein Interaction Data? Journal of Molecular Biology 2003, 327(5):919-923. 15. Deane C, Salwinski L, Xenarios I, Eisenberg D: Two Methods for Assessment of the Reliability of High Throughput Observa- tions. Molecular & Cellular Proteomics 2002, 1:349-356. 16. Jansen R, Gerstein M: Analyzing protein function on a genomic scale: the importance of gold-standard positives and nega- tives for network prediction. Current Opnion in Microbiology 2004, 7:535-545. 17. Qi Y, Klein-Seetharaman J, Bar-Joseph Z: Random Forest Similar- ity for Protein-Protein Interaction Prediction from Multiple Sources. Proceedings of the Pacific Symposium on Biocomputing 2005. 18. Grigoriev A: On the number of protein-protein interactions in the yeast proteome. nar 2003, 31(14):4157-4161. 19. Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Bot- stein D, Brown P: Genomic Expression Programs in the Response of Yeast Cells to Environmental Changes. Molecular Biology of the Cell 2000, 11:4241-4257. 20. Gene Ontology Consortium: Gene ontology: tool for the unifica- Publish with Bio Med Central and every tion of biology. Nat Genet 2000, 25:25-9. 21. Resnik P: Using Information Content to Evaluate Semantic scientist can read your work free of charge Similarity in a Taxonomy. IJCAI 1995:448-453. [cite- "BioMed Central will be the most significant development for seer.ist.psu.edu/resnik95using.html] 22. Lord P, Stevens R, Brass A, Goble C: Investigating semantic sim- disseminating the results of biomedical researc h in our lifetime." ilarity measures across the Gene Ontology: the relationship Sir Paul Nurse, Cancer Research UK between sequence and annotation. Bioinformatics 2003, 19(10):1275-1283. Your research papers will be: 23. Ambroise C, McLachlan GJ: Selection bias in gene extraction on available free of charge to the entire biomedical community the basis of microarray gene-expression data. Proceedings of peer reviewed and published immediately upon acceptance the National Academy of Sciences of the United States of America 2002, 99(10):6562-6566. cited in PubMed and archived on PubMed Central 24. Bader GD, Donaldson I, Wolting C, Ouellette BF, Pawson T, Hogue yours — you keep the copyright CW: BIND-The Biomolecular Interaction Network Data- base. Nucleic Acids Res 2001, 29:242-245. BioMedcentral Submit your manuscript here: http://www.biomedcentral.com/info/publishing_adv.asp Page 6 of 6 (page number not for citation purposes)

Journal

BMC Bioinformatics – Springer Journals

Published: Mar 20, 2006

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Choosing negative examples for the prediction of protein-protein interactions

Choosing negative examples for the prediction of protein-protein interactions

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Choosing negative examples for the prediction of protein-protein interactions

Choosing negative examples for the prediction of protein-protein interactions

References (41)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies