Computational prediction of the human-microbial oral interactome

Edgar Coelho; Joel Arrais; Sérgio Matos; Carlos Pereira; Nuno Rosa; Maria Correia; Marlene Barros; José Oliveira

doi:10.1186/1752-0509-8-24

Computational prediction of the human-microbial oral interactome

Coelho, Edgar; Arrais, Joel; Matos, Sérgio; Pereira, Carlos; Rosa, Nuno; Correia, Maria; Barros, Marlene; Oliveira, José 2014-02-27 00:00:00 Background: The oral cavity is a complex ecosystem where human chemical compounds coexist with a particular microbiota. However, shifts in the normal composition of this microbiota may result in the onset of oral ailments, such as periodontitis and dental caries. In addition, it is known that the microbial colonization of the oral cavity is mediated by protein-protein interactions (PPIs) between the host and microorganisms. Nevertheless, this kind of PPIs is still largely undisclosed. To elucidate these interactions, we have created a computational prediction method that allows us to obtain a first model of the Human-Microbial oral interactome. Results: We collected high-quality experimental PPIs from five major human databases. The obtained PPIs were used to create our positive dataset and, indirectly, our negative dataset. The positive and negative datasets were merged and used for training and validation of a naïve Bayes classifier. For the final prediction model, we used an ensemble methodology combining five distinct PPI prediction techniques, namely: literature mining, primary protein sequences, orthologous profiles, biological process similarity, and domain interactions. Performance evaluation of our method revealed an area under the ROC-curve (AUC) value greater than 0.926, supporting our primary hypothesis, as no single set of features reached an AUC greater than 0.877. After subjecting our dataset to the −7 prediction model, the classified result was filtered for very high confidence PPIs (probability ≥ 1-10 ), leading to a set of 46,579 PPIs to be further explored. Conclusions: We believe this dataset holds not only important pathways involved in the onset of infectious oral diseases, but also potential drug-targets and biomarkers. The dataset used for training and validation, the predictions obtained and the network final network are available at http://bioinformatics.ua.pt/software/oralint. Keywords: Protein-protein interactions, Oral interactome, Bayesian classification Background reveal high structural and physical-chemical affinity with The majority of gene products that crowd a living cell an associated degree of conservation. This is further evi- interact, at least transiently, with other protein molecules. denced by the fact that close protein homologs frequently Virtually all cellular events, such as signal transduction, interact in the same way [3-7]. With this in mind, we can intracellular transport, DNA replication, transcription, expect understanding of the human interactome to pro- translation, splicing, secretion, cell cycle control and inter- vide insight into physiopathological mechanisms [8]. mediary metabolism, are mediated by protein-protein in- Numerous experimental techniques have been explored teractions (PPIs) [1]. The same applies to host-pathogen to attain the human interactome: two-hybrid screening systems, where PPIs are essential in the establishment of [9,10], affinity purification mass spectrometry [11], DNA infection [2]. The binding domains of interacting proteins microarrays [12], protein microarrays [13-15], synthetic le- thality[16], phagedisplay[17], X-ray crystallography and nuclear magnetic resonance spectroscopy [18], fluorescence * Correspondence: [email protected] resonance energy transfer [19], surface plasmon resonance Department of Informatics Engineering (DEI), University of Coimbra, Coimbra, Portugal [20], atomic force microscopy [21], and electron micros- Centre for Informatics and Systems of the University at Coimbra (CISUC), copy [22]. These methods have major drawbacks that ren- University of Coimbra, Coimbra, Portugal der them non-applicable in large-scale PPI prediction, Full list of author information is available at the end of the article © 2014 Coelho et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. Coelho et al. BMC Systems Biology 2014, 8:24 Page 2 of 12 http://www.biomedcentral.com/1752-0509/8/24 namely the amount of time, associated cost and minimal handling, and the minimal risk linked to its collection protein interaction network coverage per run. Additionally, for both medical staff and the patient, the reason for high-throughput approaches are also often associated with studying the oral cavity becomes clear [67]. low-specificity and large numbers of both false negatives As a result of this work, analysis of the resulting PPI and false positives [23]. Moreover, these techniques were network revealed some interesting features. Some of the developed to detect intra-species PPIs, which renders them PPIs involving the Rothia mucilaginosa microorganism sub-optimal in inter-species PPI identification. Still, experi- are very specific and relevant. Moreover, our method not mental methods remain the only viable methodology to val- only predicted new PPIs between periodontal pathogens idate PPIs. and the host, but also PPIs between different periodontal As an alternative to experimental methods, a wide range pathogens, suggesting a synergistic course of action. of computational approaches for the prediction of intra- species PPIs have been proposed. Computational methods Results can be categorized according to the types of information We conducted a series of pre-test analyses to assess the they analyze. One common approach consists of using performance of our model. Then, we proceeded to test our text mining to extract known PPIs from the biomedical lit- approach on high-quality experimental protein-protein erature [24]. Additionally, there are methods based on interaction (PPI) data collected from five databases. The genomic data (gene neighborhood [25-28], gene fusion selected databases exclusively contain manually curated [29,30], phylogenetic profiles [31-33], codon usage similar- PPI data. ity [34]), protein structure (homology-based method [35], threading-based method [36]), domain information (single Computational model for predicting the human-microbial domain pairs [37-41], multi-domain pairs [42,43]), protein interactome sequence [44-56], and Gene Ontology (GO) [57] annota- Figure 1 summarizes the procedure used to achieve the tion semantic similarity ([58-61]). In contrast, computa- model of the human-microbial interactome. The starting tional efforts to predict inter-species PPIs have been very point of this work is a set of 4,707 proteins identified by limited. Dyer et al. [2] combined domain information with proteomic studies as being present in the oral cavity and a maximum likelihood estimator algorithm [37], while available on the OralCard database [66,67]. Davis et al. [62] adapted an approach following the Since there is no well-established gold standard for threading-based method [36]. To provide a better predic- PPIs, we collected data from five databases containing tion, Tastan et al. [63] applied a method combining mul- high-quality experimentally determined interactions as tiple data sources, and used a random forest classifier to described further on in Methods. Extracted PPIs from predict interactions between HIV-1 virus and human pro- the five databases were merged, creating our gold stand- teins. Despite these advances, the interactomes of several ard of positive interactions. The gold standard of nega- species are still far from complete. Nonetheless, the results tive interactions was obtained by randomly pairing the of some of these studies provide great working knowledge protein list on the premise that all protein pairs pro- of the characteristics of protein and gene interaction net- duced must differ from those on the positive dataset. A works. For instance, the topological characteristics of pro- total of 18,371 positive and a similar number of negative tein interaction networks (PINs) have been proven to pairs were obtained. reflect the functionality of the interacting genes. This was Simultaneously, for each possible pair of proteins, we demonstrated in yeast, where essential genes were more constructed five clusters of features based on: (1) litera- likely to be well connected and globally centered in the ture; (2) primary protein sequence information; (3) ortho- PIN [64,65]. logous profiles; (4) biological process similarity, and; (5) Here we present a computational model to predict enriched conserved domain pairs. This was performed by inter-species PPIs within the human oral cavity, an en- accessing public databases, extracting, and then processing vironment particularly prone to bacterial colonization. the collected data. This is mostly due to the fact that human, microbial and The gold standard dataset was used to train a Naïve environmental factors interact in a dynamic equilibrium Bayes classifier and to perform further validations on the within the human oral cavity [66]. Determination of the final model. The classifier was then applied to the set of salivary interactome will clarify the role of saliva in oral all possible pairs of protein interactions. Finally, by ag- biology and enable the identification of disease bio- gregating all individual pairs of predicted interactions, markers. The presence of blood exudate proteins and the final network was obtained. exfoliated epithelial cells in saliva suggest it may be an alternative to blood as a diagnostic fluid in many in- Evaluating the reconstruction of the human interactome stances. Additionally, if we consider the systemic nature In this section, we evaluate the performance of the pro- of saliva, the ease and low cost associated with its posed method when applied to the set of human proteins Coelho et al. BMC Systems Biology 2014, 8:24 Page 3 of 12 http://www.biomedcentral.com/1752-0509/8/24 Figure 1 Workflow applied on the construction of the Human-microbial oral interactome”. It also contains footnote information: “a) the proteins identified on the oral proteome are obtained from the Oralcard database; b) the gold standard used for training and validation is obtained by combining the five most relevant curated protein interaction databases; c) for each protein interacting pair five clusters of features are constructed; d) the previously trained classifier is applied to each pair of interaction; and e) finally the interactome network is obtained by combining the individual pairs of proteins. from the gold standard. We performed a 5-fold cross- The best performance is achieved through the ensem- validation to assess the combined and individual contribu- ble of the five clusters, returning an area under the re- tions of the clusters of features. Table 1 shows the results ceiver operating characteristic (ROC) curve (AUC) of for the performance of each individual cluster while Table 2 0.926, a precision of 0.848 and a recall of 0.854. This re- presents the contribution of each cluster to the final classi- sult is above the performance of any individual feature fier by iteratively removing each cluster. and can only be achieved with the participation of all, Table 1 Analysis of the prediction performance of Table 2 Analysis of the contribution to the overall individual features performance of individual cluster of features Feature AUC CA F1-score Precision Recall Feature AUC CA F1-score Precision Recall + Literature 0.781 0.722 0.723 0.721 0.726 - Literature 0.919 0.841 0.841 0.841 0.841 + Sequence 0.877 0.784 0.790 0.768 0.813 - Sequence 0.891 0.794 0.774 0.855 0.708 + GO 0.817 0.742 0.748 0.735 0.760 - GO 0.916 0.838 0.839 0.835 0.842 + COGs 0.663 0.652 0.537 0.806 0.402 - COGs 0.923 0.846 0.847 0.842 0.852 + DDIs 0.620 0.617 0.424 0.861 0.281 - DDIs 0.911 0.831 0.834 0.819 0.850 Final Model 0.926 0.850 0.851 0.848 0.854 Final Model 0.926 0.850 0.851 0.848 0.854 For each line the metrics are obtained by considering only that cluster of For each line the metrics are obtained by removing that cluster of features features on the classifier. AUC, area under the receiver operating characteristic from the classifier. AUC, area under the receiver operating characteristic (ROC) (ROC) curve; CA, classification accuracy. curve; CA, classification accuracy. Coelho et al. BMC Systems Biology 2014, 8:24 Page 4 of 12 http://www.biomedcentral.com/1752-0509/8/24 −7 meaning that all features are required and have a com- observed in Figure 2, the cutoff of 1-10 is the lowest plementary contribution. probability value where an increment does not imply a de- The Sequence is simultaneously the feature with the crease in the number of interactions. This cut-off resulted best overall performance (AUC = 0.877) and the one that in a total of 46,579 PPIs, with 37,407 being between hu- causes the most negative impact when removed from man proteins, 6,394 between human and microbial pro- the classifier, making the AUC drop to 0.891. It also has teins, and 2,778 between microbial proteins. The average a very interesting recall of 0.813, partially due to the fact number of protein interactions per protein after the cutoff that all protein sequences are recognized and therefore was 8. Figure 3 is a visual representation of the interac- the feature has full coverage. tions between the various organisms found in the oral cav- In contrast, the clusters of orthologous groups (COGs, ity and the human host. Intra-species interactions are not with AUC = 0.663) and domain-domain interactions (DDI, shown. The thickness of the ribbons between each or- with AUC = 0.620) have the lowest individual AUCs, ganism is correlated with the number of PPIs between mainly due to the low coverage of their features. Despite both organisms, meaning that the organisms sharing that, they benefit from a considerably high precision that highest number of PPIs with the human are Rothia contributes positively to the final classifier. This is espe- mucilaginosa, Leptotrichia buccalis,and Actinomyces cially true for the COGs which, when removed, cause the odontolyticus (strain independent). major drop in precision. With the exception of Homo sapiens with 3,030 pro- The Literature and the Gene Ontology (GO) features, teins, the most represented organisms in the human oral while not outstanding in any particular metric, have con- cavity are Rothia mucilaginosa (strain DY-18) (Stomato- sistent performance on almost all metrics. Nevertheless, coccus mucilaginosus), Actinomyces odontolyticus (strain they make a very relevant contribution to the final classi- ATCC 17982), and Streptococcus salivarius (strain SK126), fier while the removal of the Literature causes a drop of with 68, 54, and 26 proteins, respectively. These organisms the AUC to 0.919 and the GOs to 0.916. are opportunistic pathogens known to be associated with periodontitis [69] and caries [70]. Global characterization of the human-microbial interactome The most frequent biological processes are related to The classifier returned a set of 1.9 million possible interac- host-microbial interactions: GO:0044281 (small molecule tions with a probability higher than 0.5. This corresponds metabolic process, involved in 173 PPIs), GO:0019048 to an average degree of 404 interactions per protein, which (viral interaction with host, involved in 161 PPIs), and is much above the range of 3 to 30 documented in previ- GO:0045087 (innate immune response, involved in 145 ous studies [68]. Additionally, there are reports of yeast PPIs). two-hybrid screenings, the most commonly used high- We also identified the top three human hub-proteins throughput experimental method, reaching false-positive present in our data: epidermal growth factor receptor rates of 70%. With this in mind, and in order to minimize (EGFR) (UniProt AC P00533, involved in 3247 PPIs), fibro- the presence of false-positives in our predicted interac- nectin (UniProt AC P02751, involved in 3143 PPIs), and tome, we filtered our prediction results to consider only cullin-associated NEDD8-dissociated protein 1 (CAND1) −7 very high confidence PPIs (probability ≥ 1-10 ). We (UniProt AC Q86VP6, involved in 2911 PPIs). In terms of neglected the recall for the sake of precision. As can be non-human original hub-proteins, the most common are a Figure 2 Plot with the relation of the number of interactions (y-axis) by classifier probability (x-axis). Coelho et al. BMC Systems Biology 2014, 8:24 Page 5 of 12 http://www.biomedcentral.com/1752-0509/8/24 Figure 3 Representation of the Human-microbial inter-species protein interactions. Each section represents an organism. The ribbons connecting any two sections symbolize the PPIs between two organisms. The thickness of each ribbon correlates with the number of PPIs between both organisms. serine/threonine protein kinase from Leptotrichia buccalis glucocorticoid-mediated signal transduction cascades. (UniProt AC C7NEK0, involved in 258 PPIs), a kinase do- While the NF-kB pathway promotes the immune re- main protein from Parviromonas micra ATCC 33270 sponse and inflammation, the glucocorticoid-mediated (UniProt AC A8SM03, involved in 194 PPIs), and Ras- signal transduction cascade suppresses it. In order to ex- related protein SEC4 from Saccharomyces cerevisiae (Uni- plain the association between small molecule metabol- Prot AC P07560, involved in 163 PPIs). ism and host-pathogen interactions we must focus on the NF-kB cascade, as it is known to mediate the tran- Discussion scriptional activation of several cytokines (cell-signaling Functional analysis of the human-microbial interactome molecules) involved in immunity [71]. Tumor necrosis Unsurprisingly, the most frequent GO biological pro- factor (TNF)-α and TNF-β, two of these cytokines, play cesses in our final PPI dataset are associated with host- key roles in immune regulation and inflammation [72]. pathogen interactions. The preeminence of innate im- However, these cytokines are mainly responsible for the mune response and viral interaction with host as the metabolic instabilities that occur during the infection, as most frequent biological processes are self-explanatory. they increase the metabolism of triglycerides inducing However, the association between small molecule metab- hyperlipidemia (escalation of blood lipid levels), stimu- olism and host-microbial interactions is not so direct. late lipolysis (degradation of lipids), accelerate glycogen When faced with an infection, the body will respond breakdown and glucose consumption and uptake, and by initiating two major cellular signaling pathways with increase the serum levels of hormones that regulate glu- opposing functions: the nuclear factor (NF)-kB and cose metabolism. These metabolic changes possibly Coelho et al. BMC Systems Biology 2014, 8:24 Page 6 of 12 http://www.biomedcentral.com/1752-0509/8/24 explain the great number of “small molecule metabolic Regarding the first observation, the analysis of the sub- process” biological processes. network pertaining to Rothia mucilaginosa shares the characteristics previously described for the hub proteins with 37/638 interactions with the EFGR protein, 40/638 Analysis of hub proteins interactions with fibronectin and 34/638 interactions with The top three hub proteins identified share a common CAND1. Furthermore, this sub-network presents two pre- trait: these are exploited by pathogens in an attempt to dicted interactions which have not been described before: gain entry to the host and survive inside it. R. mucilaginosa proteins D2NSF5 and C6R5R8, which are EGFR is a transmembrane protein mainly produced in predicted to interact with human immunoglobulin chains the salivary glands and the kidneys [73]. Its association (P01719 and P01781), and could be related to the immune with microbial invasion has already been reported for response specific for this species, explaining why these in- Salmonella typhimurium [74], Candida albicans [75], teractions are worth investigating. Reovirus [76], and Vaccinia virus [77]. Apparently, all If we consider the bacteria most associated with peri- these pathogens initiate cellular invasion, at least to odontal disease, our model predicts few interactions be- some extent,bybinding to EGFR.Thissuggeststhe tween A. actinomycemcomitans, P. gingivalis, and the host possibility that several other pathogens are using the proteins. As mentioned before, this is due to the fact that EGFR to start host colonization, as supported by Buret these organisms are not well represented in the original et al.[78]. protein data set. However, besides the interactions pre- Similarly to EGFR, fibronectin appears to also play the dicted between these bacteria and the human hub proteins role of a “microbial-anchor”. This glycoprotein is found described above, in the case of Porphyromonas gingivalis it bound to the β integrins in the cell surface, and is gen- is possible to identify at least two potentially interesting erally seen as a key protein for bacterial adhesion within new types of interactions between bacterial ribosomal pro- the oral cavity [79,80]. teins and a major histocompatibility complex protein The CAND1 protein, formerly TIP120A, was found to (P30461). Furthermore, we also identified a possible inter- interact with most of the proteins in the Cullin family action between the bacterial enolase (Q7MTV8) and a [81]. The Cullin protein family plays a key role in the host aquaporin which could interfere with the homeostasis ubiquitination of cellular proteins, i.e. performing a mechanisms of the host. Additionally, when we consider post-translational modification in order to label the the interactions of P. gingivalis with other bacteria, we find target protein with ubiquitin molecules. This labeling that the same enolase might interact with outer mem- frequently results in the commitment of the ubiquitin- brane proteins of Haemophilus influenza and Pasteurella linked protein to proteasomal degradation [82]. Conse- multocida. The role of bacterial enolase as a multitask quently, CAND1 was suggested to function as a global protein involved not only in carbohydrate metabolism but regulator of cullin-containing ubiquitin ligases [81,83]. also in virulence has been recognized recently [87]. Being one of the top hub-proteins, we investigated the This suggests that previously unknown and important relationship between the ubiquitination pathway and PPIs for oral colonization and biofilm formation may be pathogen colonization of the host cells. As expected, we present in this dataset. Finally the fact there are possible found that certain bacteria corrupt the ubiquitination interactions between P.gingivalis proteases and those of machinery as a means of regulating their virulence fac- other periodonto-pathogens such as Kingella oralis and tors, or to trigger internalization of bacteria into host Treponema denticola is interesting. This may even shed cells [84]. Such a mechanism improves the survival and some light on the synergistic aspects of oral biofilm in replication chances of bacteria inside the host. periodontal disease [86]. Study of the microbiome role in periodontitis Conclusion When the data analysis is focused on a particular disease The continuous yield of large-scale data mainly from mi- such as periodontal disease four main features can be ob- croarrays and yeast two-hybrid studies has made the served: i) Rothia mucilaginosa, a microorganism present study of PPIs very appealing. The main issue associated in the normal human oral microbiome but considered an with PPI study is the high prevalence of false positives opportunistic pathogen [85], is the species with the most and negatives in experimental PPI data. Being the only interactions, with some of them revealing important and “reliable” source of PPIs, inaccurate experimental PPI specific interactions; ii) new interactions are predicted be- data will contaminate training datasets and therefore tween periodonto-pathogens and the host, and; iii) inter- compromise the performance of computational PPI pre- actions between periodonto-pathogens are also predicted, diction methods. For this reason, we believe that an im- most likely explaining a synergistic course of action, as has provement in the quality of experimental PPI data will been previously proposed [86]. greatly impact the performance of new computational Coelho et al. BMC Systems Biology 2014, 8:24 Page 7 of 12 http://www.biomedcentral.com/1752-0509/8/24 PPI prediction approaches. While this is not the case at Positive dataset present, we must consider how to avoid the effects of We collected experimental oral protein-protein inter- false positives and false negatives in the final PPI predic- action (PPI) data from five databases: 14,139 PPIs from tion model. BIOGRID [89], 254 PPIs from DIP [90], 3,555 PPIs from We proposed a probabilistic Bayesian-based method to HPRD [91], 4,135 PPIs from IntAct [92], and 1,481 PPIs integrate several data sources, to obtain more robust from MINT [93], totaling 23,564 protein interactions and reliable PPI predictions. By applying naïve Bayes, we (Figure 4). automatically up-weigh the most informative features All the interacting protein pairs were identified by and down-weigh the less informative ones, allowing for their UniProtKB [94] Accession IDs for normalization automatic error-correction. purposes. In some instances it was necessary to convert Our individual feature analysis results show a great the database own identifiers to UniProtKB Accession relevance of the selected features. When applied on a IDs. The BioGRID database represents interacting pro- naïve Bayes classifier, the individual features synergize, tein pairs using their own identifiers and Entrez Gene boosting the AUC up to 0.926. This suggests that the re- IDs. To match them to UniProtKB AccessionIDs we ex- liability of prediction improves with the increase of sig- tracted the Gene IDs from the protein pairs and down- nificant features, meaning that the ensemble final model loaded the list of respective gene products in the actually reduces the disadvantages of the individual UniProtKB Accession ID format. UniProtKB allows dir- methods. ect mapping from the MINT and DIP databases to an- Cytoscape was successfully used to validate the net- other identifier. A list of PPI pairs from both databases work when tested with real pathway examples, discover- was uploaded to the UniProtKB mapping feature, result- ing new potentially interesting interactions in oral ing in two different sets of UniProtKB Accession ID biology, both between the host and the periodontal path- pairs. HPRD uses its own identification system coupled ogens and between different periodontal pathogens. with NCBI Reference Sequence Accession IDs (RefSeq) We believe our work may be applied in several scientific to classify PPI pairs. All the RefSeq Protein IDs were areas, and even in other PPI related studies. An example is converted to UniProtKB Accession IDs and paired ac- biomedical PPI screening, to assess if interactions of par- cordingly. IntAct PPI pairs are identified with Uni- ticular interest might occur and what the related interaction ProtKB Accession IDs and were directly extracted. probability is. Another example is pharmacologic research, PPI pairs from the five databases were merged and re- as a well-established PPI network can provide insights on peated entries were removed. From a total of 23,564 potential drug targets, but also new uses for existent in- PPIs, 5,193 duplicated entries were removed, resulting in market drugs. Finally, and based on the fact that protein a PDS of 18,371 protein pairs. interaction networks are dynamic [88], our work can sup- port researchers in identifying evolutionary patterns. Negative dataset The selection of negative examples to integrate the nega- Methods tive data was based on two methods described in the lit- Oral proteome erature [95]. These methods consist of randomly selecting As a starting point for our study we used 4,707 proteins, protein pairs that are not present in a veto list containing 3500 from Human and 1207 from microbial, available all PPIs from the positive data set. The use of this strategy on the OralCard database [66,67]. was considered acceptable because the probability of com- These proteins were identified via proteomic analysis mitting an error while picking a random pair is low: of the saliva, frequently by using 2D electrophoresis/ mass spectrometry or 2D liquid chromatography/mass N K K spectrometry. By the end of 2012 the salivary proteome PeðÞ ¼ ¼ ;ðÞ K≪N ⇒PeðÞ≅0; was determined to contain 3500 proteins from human N ðÞ N−1 N−1 origin and 1207 from microbial sources. where N is the number of proteins and K is the average Predictive dataset construction degree for the final PPI network. In this study the N is The use of positive (interacting pairs of proteins) and 4,707 and for PPIs the typical value of K is between 6 negative (non-interacting pairs of proteins) examples is and 16. required for training and assessing the performance of With this strategy we generated a NDS of a size simi- the classifier. All data used in the construction of the lar to that of the PDS (18,348 “negative” protein pairs), positive data set (PDS) and the negative data set (NDS) and combined it with the PDS to obtain a training data was downloaded in March 2013. set with 36,719 PPIs. Coelho et al. BMC Systems Biology 2014, 8:24 Page 8 of 12 http://www.biomedcentral.com/1752-0509/8/24 Figure 4 Venn diagram representing the intersections between the five high-quality experimentally determined protein-protein interaction databases. Feature construction interact. The semantic context for a given protein is de- In this section we describe the procedure for construc- fined by the concepts, from a pre-defined vocabulary, tion of the five clusters of features. The final results are that are frequently mentioned in the same articles with summarized in Table 3. that protein, and is represented by a vector containing a weight for each concept. These weights are based on the Literature co-occurrence statistics, and measure the degree of asso- The literature-based protein-protein interaction scores ciation between the protein and each concept. Following were calculated by the method described in van Haagen Jelier et al. [97], we use the symmetric uncertainty coef- et al. [96]. This method is based on comparing the se- ficient U (X ,Y ) – where X is in this case the protein of i j i mantic contexts in which two proteins are mentioned in interest and Y is any other concept in the vocabulary – the published literature. The rationale for the method is as the weights used for creating the concept profiles: that two proteins occurring in similar contexts will have HY þHXðÞ−HH ; Y a higher similarity score and are therefore more likely to j i i j UX ; Y ¼ 2 ; i j HXðÞþHY i j Table 3 Relative coverage of protein-protein interactions Where H (X) is the entropy for X and H (X, Y) is the present in the training and test data by individual feature clusters joint entropy for X and Y, calculated based on document frequency counts. Training data Classification We used a corpus of nearly one million abstracts, ob- #Interactions % of total #Interactions % of total tained by searching Pubmed with 17,402 names and syno- Literature 22,720 61.9% 4,698,390 69.9% nyms extracted from UniProtKB for 4,707 proteins in Sequence 35,379 96.4% 6,703,945 99.8% the dataset, after removing nonsensical names such as GO 23,769 64.8% 5,130,103 76.4% “uncharacterized protein”. To identify the concepts men- COGs 9,636 26.3% 1,324,230 19.7% tioned in the texts we used Gimli [98], a machine-learning DDIs 5,994 16.3% 516,609 7.7% tool for gene and protein name recognition, together with dictionary matching to recognize other concepts from ten Total 36,698 100.0% 6,716,792 100.0% different semantic types including chemical entities, ana- GO, gene ontology; COGs, clusters of orthologous groups; DDIs, domain-domain interactions. tomical terms, diseases, pathways and GO terms. The Coelho et al. BMC Systems Biology 2014, 8:24 Page 9 of 12 http://www.biomedcentral.com/1752-0509/8/24 dictionaries used contain around 1,3 million distinct We were able to obtain the orthologous profile for names for around 400 thousand concepts. Based on the 9,636 protein pairs from the training dataset and concept annotation of this corpus, we were able to calcu- 1,324,230 proteins pairs for the classification dataset. late concept profiles for 22,720 protein pairs from the training dataset and 4,698,390 protein pairs for the classifi- Biological process similarity cation dataset. Previous studies have explored the use of GO annotation similarity between two proteins as a PPI predictor [59,102-105]. We downloaded biological process infor- Primary protein sequence information mation from the GO Consortium [57] in March 2013 Several studies have been carried out where detection of and calculated the depth of the GO terms (nodes) in the protein-protein interaction is derived from information Directed Acyclic Graph (DAG), and the total number of directly extracted from the amino-acid sequences [44-56]. proteins comprised between the smallest shared bio- The results indicate that the sequence information alone logical process (SSBP) for each pair of proteins and the is sufficient to detect PPIs with reasonable accuracy [87] following three branches. Since the depth of the GO but may be improved if combined with other strategies. terms in the DAG is implied in the total number of pro- Taking into account the primary protein sequence in- teins, post-test odds analysis was performed solely on formation, the following features have been considered this feature to avoid redundancy. Such an approach was in this work: occurrence of the 20 amino-acids in the based on the general hypothesis that it is progressively protein sequence, protein atomic composition, molecu- more likely for the proteins comprised within a bio- lar weight and atomic weight, forming a vector of 27 fea- logical process to interact, if the total number of pro- tures. The interacting protein pair (X, Y) is represented teins involved in that process is progressively smaller. by concatenating the corresponding features vectors F We were able to obtain the gene ontology profile for and F , represented by (F ,F ). 23,769 protein pairs from the training dataset and y x y We were able to obtain the sequence profile for 35,379 5,130,103 protein pairs for the classification dataset. proteins pairs from the training dataset and 6,703,945 protein pairs for the classification dataset. Enriched conserved domain pairs The Database of Protein Domain Interactions [106] (DOMINE) contains binary domain-domain interaction Orthologous profiles (DDI) data compiled from a collection of 15 databases By definition, clusters of orthologous groups (COGs) are and DDI prediction methods. Additionally, DOMINE sets of orthologous genes or orthologous groups of para- provides a quality measure of the DDI confidence, as logs from three or more phylogenetic trees. In essence, well as a binary classification of whether the domains this means that two proteins from different lineages be- are part of the same GO biological process. Here, we as- longing to the same COG are orthologous. Orthologs sume that whenever two given proteins possess one or are genes in different species that evolved from a com- more interacting domains between them, those proteins mon ancestor by speciation (i.e. convergent evolution). will interact. We adopted this DDI data collection as in- In contrast, paralogs are genes related by duplication dividual features in our approach. Since DOMINE pro- within a genome [99]. vides DDI information from several sources, we tallied Lee et al. [100] aimed to expand the interactomes of the number of sources that identified a DDI. This strat- various organisms by applying orthology-based methods egy confers higher reliability on DDI pairs with higher in inter-species PPI prediction. They expanded ortholo- scores (closer to 15, the maximum number of DDI gous pairs of 18 eukaryotic organisms and merged them sources). with experimental PPI datasets, allowing the inference of We were able to obtain the domain profile for 5,994 PPIs for various species. protein pairs from the training dataset and 516,609 pro- In this work we used the Search Tool for the Retrieval tein pairs for the classification dataset. of Interacting Genes/Proteins (STRING) [101] database to obtain COGs and their respective combined scores. Data classification and validation The combined score is computed by integrating the like- The proposed approach was developed, tested, optimized lihoods from the different types of evidence, correcting and performed using Orange, an open-source bioinfor- for the probability of randomly observing an interaction matics tool featuring Python scripting and a visual and [101]. This enhances the predictive performance of the programmatic interface. We used the naïve Bayes [107] method, as a combined score is only computed when classifier to predict PPIs in our data. The naïve Bayes clas- more than one of the data sources in STRING supports sifier calculates the conditional probability of each attri- a given association. bute A given the class label C, from the training data. The i Coelho et al. BMC Systems Biology 2014, 8:24 Page 10 of 12 http://www.biomedcentral.com/1752-0509/8/24 Bayes rule is then applied to calculate the probability of C conception of the study, coordinated it, and helped to draft the manuscript. All authors read and approved the final manuscript. given the specific instance of A ,…, A , and then assessing 1 n the class with the greatest posterior probability, ensuing classification [108]. Acknowledgements This work has received support from the RD-CONNECT project (EC contract The receiver operating characteristic (ROC) curve, number 305444). Edgar D. Coelho is funded by Fundação para a Ciência e which is the plot of the true positive (TP) rate with the Tecnologia, FCT, under Grant SFRH/BD/86343/2012. false positive (FP) rate, depicting the relative trade-off be- Author details tween both rates [109] was used to evaluate the method’s Department of Electronics, Telecommunications and Informatics (DETI), performance. When comparing classifiers with very simi- Institute of Electronics and Telematics Engineering of Aveiro (IEETA), lar ROC curves, it may be necessary to estimate a single University of Aveiro, Aveiro, Portugal. Department of Informatics Engineering (DEI), University of Coimbra, Coimbra, Portugal. Centre for scalar value to represent the expected performance. One Informatics and Systems of the University at Coimbra (CISUC), University of of the most common methods is calculation of the area Coimbra, Coimbra, Portugal. Department of Informatics Engineering and under the ROC curve (AUC) [110], which we used to Systems, Polytechnic Institute of Coimbra, Engineering Institute of Coimbra (IPC-ISEC), Coimbra, Portugal. Department of Health Sciences, Institute of compare the naïve Bayes classifier. Therefore, we assessed Health Sciences, The Catholic University of Portugal, Viseu, Portugal. Centre the individual contributions of each feature in terms of for Neurosciences and Cell Biology, University of Coimbra, Coimbra, Portugal. classification accuracy (CA), area under curve (AUC), F1- Received: 27 August 2013 Accepted: 17 February 2014 score, precision and recall. Published: 27 February 2014 Interactome analysis We used Cytoscape to visualize and validate the ob- References 1. Phizicky EM, Fields S: Protein-protein interactions: methods for detection tained PPI network. The PPIs were classified as “HU- and analysis. Microbiol Rev 1995, 59:94–123. MAN-HUMAN”, if the interacting proteins were only of 2. Dyer MD, Murali TM, Sobral BW: Computational prediction of host- human origin, as “MICRO-MICRO”, if the interacting pathogen protein–protein interactions. Bioinformatics 2007, 23:i159–i166. 3. Littler SJ, Hubbard SJ: Conservation of orientation and sequence in proteins were only of microbial origin, or as “HUMAN- protein domain–domain interactions. J Mol Biol 2005, 345:1265–1279. MICRO”. 4. Valdar WS, Thornton JM: Protein-protein interfaces: analysis of amino acid We imported the network data to Cytoscape defining conservation in homodimers. Proteins 2001, 42:108–124. 5. Aloy P, Ceulemans H, Stark A, Russell RB: The relationship between the two proteins in the same interacting protein pair as sequence and interaction divergence in proteins. J Mol Biol 2003, Source Interaction (protein one) and Target Interaction 332:989–998. (protein two). The chosen Interaction Type was the 6. Teichmann SA: The constraints protein-protein interactions place on sequence divergence. J Mol Biol 2002, 324:399–407. above-mentioned organism-organism classification. A 7. Panchenko AR, Wolf YI, Panchenko LA, Madej T: Evolutionary plasticity of file containing node attributes was also imported, con- protein families: coupling between sequence and structure variation. taining microorganism and biological process informa- Proteins 2005, 61:535–544. 8. Rual J-F, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, tion extracted from the UniProt database pertaining to Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem each individual protein in the network. M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, et al: Towards a proteome-scale map of Availability the human protein-protein interaction network. Nature 2005, 437:1173–1178. All data required to analyze the results and re-run this 9. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover experiment are available for download at http://bioinfor- D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg matics.ua.pt/software/oralint. This includes the unique JM: A comprehensive analysis of protein-protein interactions in list of UniProt AC for the proteins in the oral cavity, the Saccharomyces cerevisiae. Nature 2000, 403:623–627. 10. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive gold standard of interactions, the dataset used for train- two-hybrid analysis to explore the yeast protein interactome. ing and validation, the predictions obtained, and the Proceedings of the National Academy of Sciences 2001, 98:4569–4574. Cytoscape project file with the network. 11. Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, Séraphin B: A generic protein purification method for protein complex characterization and Competing interests proteome exploration. Nat Biotechnol 1999, 17:1030–1032. The authors declare that they have no competing interests. 12. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95:14863–14868. Authors’ contributions 13. MacBeath G, Schreiber SL: Printing proteins as microarrays for high- EDC participated in the design of the study, constructed the positive and throughput function determination. Science 2000, 289:1760–1763. negative datasets, performed the analysis of hub proteins, characterized and 14. Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen analysed the human-microbial interactome, and drafted the manuscript. JPA R, Bidlingmaier S, Houfek T, Mitchell T, Miller P, Dean RA, Gerstein M, Snyder conceived the study, participated in its design, performed feature construction M: Global analysis of protein activities using proteome chips. and selection, parameterized the classifier, and helped to draft the manuscript. Science 2001, 293:2101–2105. SM performed the text-mining analysis and helped to draft the manuscript. CP carried out primary protein sequence analysis and helped to draft the 15. Jones RB, Gordus A, Krall JA, MacBeath G: A quantitative protein manuscript. NR and MJC analysed the role of the microbiome in periodontitis interaction network for the ErbB receptors using protein microarrays. and helped to draft the manuscript. MB and JLO participated in the design and Nature 2006, 439:168–174. Coelho et al. BMC Systems Biology 2014, 8:24 Page 11 of 12 http://www.biomedcentral.com/1752-0509/8/24 16. Ye P, Peyser BD, Pan X, Boeke JD, Spencer FA, Bader JS: Gene function 44. Bock JR, Gough DA: Predicting protein–protein interactions from primary prediction from congruent synthetic lethal interactions in yeast. structure. Bioinformatics 2001, 17:455–460. Mol Syst Biol 2005, 1(2005):0026. 45. Bock JR, Gough DA: Whole-proteome interaction mining. Bioinformatics 17. Smith GP: Filamentous fusion phage: novel expression vectors that 2003, 19:125–134. display cloned antigens on the virion surface. Science 1985, 228:1315–1317. 46. Martin S, Roe D, Faulon J-L: Predicting protein–protein interactions using 18. Tong AHY, Evangelista M, Parsons AB, Xu H, Bader GD, Pagé N, Robinson M, signature products. Bioinformatics 2005, 21:218–226. Raghibizadeh S, Hogue CWV, Bussey H, Andrews B, Tyers M, Boone C: 47. Ben-Hur A, Noble WS: Kernel methods for predicting protein–protein Systematic genetic analysis with ordered arrays of yeast deletion interactions. Bioinformatics 2005, 21:38–46. mutants. Science 2001, 294:2364–2368. 48. Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, 19. Yan Y, Marriott G: Analysis of protein interactions using fluorescence Greenblatt J, Jessulat M, Krogan N, Luo X, Golshani A: PIPE: a protein- technologies. Curr Opin Chem Biol 2003, 7:635–640. protein interaction prediction engine based on the re-occurring short 20. Cooper MA: Label-free screening of bio-molecular interactions. polypeptide sequences between known interacting protein pairs. Anal Bioanal Chem 2003, 377:834–842. BMC Bioinformatics 2006, 7:365. 21. Yang Y, Wang H, Erie DA: Quantitative characterization of biomolecular 49. Nanni L, Lumini A: An ensemble of K-local hyperplanes for predicting assemblies and interactions using atomic force microscopy. protein–protein interactions. Bioinformatics 2006, 22:1207–1210. Methods 2003, 29:175–187. 50. Nanni L: Hyperplanes for predicting protein–protein interactions. 22. Baumeister W, Grimm R, Walz J: Electron tomography of molecules and Neurocomputing 2005, 69:257–263. cells. Trends Cell Biol 1999, 9:81–85. 51. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H: Predicting 23. Xia JF, Wang SL, Lei YK: Computational methods for the prediction of protein–protein interactions based only on sequences information. protein-protein interactions. Protein Pept Lett 2010, 17:1069–1078. Proc Natl Acad Sci 2007, 104:4337–4341. 24. Jaeger S, Gaudan S, Leser U, Rebholz-Schuhmann D: Integrating protein- 52. Guo Y, Yu L, Wen Z, Li M: Using support vector machine combined with protein interactions and text mining for protein function prediction. auto covariance to predict protein–protein interactions from protein BMC Bioinformatics 2008, 9(Suppl 8):S2. sequences. Nucleic Acids Res 2008, 36:3025–3030. 25. Tamames J, Casari G, Ouzounis C, Valencia A: Conserved clusters of 53. Xia JF, Han K, Huang DS: Sequence-based prediction of protein-protein functionally related genes in two bacterial genomes. J Mol Evol 1997, interactions by means of rotation forest and autocorrelation descriptor. 44:66–73. Protein Pept Lett 2010, 17:137–145. 26. Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: 54. Rajasekaran S, Merlin JC, Kundeti V, Mi T, Oommen A, Vyas J, Alaniz I, Chung a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, K, Chowdhury F, Deverasatty S, Irvey TM, Lacambacal D, Lara D, 23:324–328. Panchangam S, Rathnayake V, Watts P, Schiller MR: A computational tool 27. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene for identifying minimotifs in protein-protein interactions and improving clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, the accuracy of minimotif predictions. Proteins 2011, 79:153–164. 96:2896–2901. 55. Knisley D, Knisley J: Predicting protein–protein interactions using graph 28. Blumenthal T: Gene clusters and polycistronic transcription in eukaryotes. invariants and a neural network. Comput Biol Chem 2011, 35:108–113. Bioessays 1998, 20:480–487. 56. Zhang Y, Zhang D, Mi G, Ma D, Li G, Guo Y, Li M, Zhu M: Using ensemble 29. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction methods to deal with imbalanced data in predicting protein-protein maps for complete genomes based on gene fusion events. Nature 1999, interactions. Comput Biol Chem 2012, 36:36–41. 402:86–90. 57. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, 30. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan Detecting protein function and protein-protein interactions from M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, genome sequences. Science 1999, 285:751–753. Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong 31. Ouzounis C, Kyrpides N: The emergence of major cellular processes in EL, Nash RS, et al: The gene ontology (GO) database and informatics evolution. FEBS Lett 1996, 390:119–123. resource. Nucleic Acids Res 2004, 32:D258–D261. 32. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: 58. Jain S, Bader GD: An improved method for scoring protein-protein Assigning protein functions by comparative genome analysis: protein interactions using semantic similarity within the gene ontology. phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96:4285–4288. BMC Bioinformatics 2010, 11:562. 33. Barker D, Pagel M: Predicting functional gene links from phylogenetic- 59. Maetschke SR, Simonsen M, Davis MJ, Ragan MA: Gene Ontology-driven statistical analyses of whole genomes. PLoS Comput Biol 2005, 1:e3. inference of protein–protein interactions using inducers. Bioinformatics 2012, 28:69–75. 34. Najafabadi HS, Salavati R: Sequence-based prediction of protein-protein interactions by means of codon usage. Genome Biol 2008, 9:R87. 60. Park B, Cui G, Lee H, Huang D-S, Han K: PPISearchEngine: gene ontology- 35. Aloy P, Russell RB: Interrogating protein interaction networks through based search for protein–protein interactions. Comput Methods Biomech structural biology. Proc Natl Acad Sci 2002, 99:5896–5901. Biomed Engin 2012, 16:1–8. 36. Lu L, Lu H, Skolnick J: MULTIPROSPECTOR: an algorithm for the prediction 61. Wu X, Zhu L, Guo J, Zhang DY, Lin K: Prediction of yeast protein-protein of protein-protein interactions by multimeric threading. Proteins 2002, interaction network: insights from the Gene Ontology and annotations. 49:350–364. Nucleic Acids Res 2006, 34:2137–2150. 37. Sprinzak E, Margalit H: Correlated sequence-signatures as markers of 62. Davis FP, Barkan DT, Eswar N, McKerrow JH, Sali A: Host pathogen protein protein-protein interaction. J Mol Biol 2001, 311:681–692. interactions predicted by comparative modeling. Protein Sci 2007, 16:2585–2596. 38. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions 63. Tastan O, Qi Y, Carbonell JG, Klein-Seetharaman J: Prediction of interactions from protein-protein interactions. Genome Res 2002, 12:1540–1548. between HIV-1 and human proteins by information integration. Pac Symp Biocomput 2009:516–527. 39. Chen L, Wu LY, Wang Y, Zhang XS: Inferring protein interactions from experimental data by association probabilistic method. Proteins 2006, 64. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in 62:833–837. protein networks. Nature 2001, 411:41–42. 40. Morrison JL, Breitling R, Higham DJ, Gilbert DR: A lock-and-key model for 65. Wuchty S, Almaas E: Peeling the yeast protein network. Proteomics 2005, protein-protein interactions. Bioinformatics 2006, 22:2012–2019. 5:444–449. 41. Huang C, Morcos F, Kanaan SP, Wuchty S, Chen DZ, Izaguirre JA: Predicting 66. Arrais JP, Rosa N, Melo J, Coelho ED, Amaral D, Correia MJ, Barros M, Oliveira protein-protein interactions from protein domains using a set cover JL: OralCard: a bioinformatic tool for the study of oral proteome. approach. IEEE/ACM Trans Comput Biol Bioinform 2007, 4:78–87. Arch Oral Biol 2013, 58(7):762–772. 42. Chen X-W, Liu M: Prediction of protein–protein interactions using 67. Rosa N, Correia MJ, Arrais JP, Lopes P, Melo J, Oliveira JL, Barros M: From random decision forest framework. Bioinformatics 2005, 21:4394–4400. the salivary proteome to the OralOme: comprehensive molecular oral biology. Arch Oral Biol 2012, 57(7):853–864. 43. Wang R-S, Wang Y, Wu L-Y, Zhang X-S, Chen L: Analysis on multi-domain cooperation for predicting protein-protein interactions. BMC Bioinformatics 68. Vecchiola C, Pandey S, Buyya R: High-performance cloud computing: 2007, 8:391. aviewofscientificapplications. 2009:4–16. Proceedings of the 10th Coelho et al. BMC Systems Biology 2014, 8:24 Page 12 of 12 http://www.biomedcentral.com/1752-0509/8/24 International Symposium on Pervasive Systems, Algorithms and 90. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Networks I-SPAN 2009, IEEE Computer Society. database of interacting proteins: 2004 update. Nucleic Acids Res 2004, 69. Yamane K, Nambu T, Yamanaka T, Mashimo C, Sugimori C, Leung K-P, 32:D449–D451. Fukushima H: Complete genome sequence of rothia mucilaginosa DY-18: 91. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, a clinical isolate with dense meshwork-like structures from a persistent Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan apical periodontitis lesion. Sequencing 2010, 2010:1–6. L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, 70. Batty I: Actinomyces odontolyticus, a new species of actinomycete Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, regularly isolated from deep carious dentine. J Pathol Bacteriol 1958, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, 75:455–459. Ramabadran S, Chaerkady R, Pandey A: Human protein reference 71. McKay LI, Cidlowski JA: Molecular control of immune/inflammatory database–2009 update. Nucleic Acids Res 2009, 37:D767–772. responses: interactions between nuclear factor-κB and steroid 92. Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury receptor-signaling pathways. Endocrine Rev 1999, 20:435–459. M, Dumousseau M, Feuermann M, Hinz U, Jandrasits C, Jimenez RC, Khadake J, Mahadevan U, Masson P, Pedruzzi I, Pfeiffenberger E, Porras P, Raghunath 72. McDevitt H, Munson S, Ettinger R, Wu A: Multiple roles for tumor necrosis A, Roechert B, Orchard S, Hermjakob H: The IntAct molecular interaction factor-alpha and lymphotoxin alpha/beta in immunity and autoimmunity. database in 2012. Nucleic Acids Res 2012, 40:D841–D846. Arthritis Res 2002, 4:S141–S152. 93. Chatr-aryamontri A, Ceol A, Montecchi Palazzi L, Nardelli G, Schneider MV, 73. Barnard JA, Beauchamp RD, Russell WE, Dubois RN, Coffey RJ: Epidermal Castagnoli L, Cesareni G: MINT, the molecular interaction database: 2012 growth factor-related peptides and their relevance to gastrointestinal update. Nucleic Acids Res 2012, 40:D857–861. pathophysiology. Gastroenterology 1995, 108:564–580. 94. Consortium TU: Reorganizing the protein space at the Universal protein 74. Galan JE, Pace J, Hayman MJ: Involvement of the epidermal growth factor resource (UniProt). Nucleic Acids Res 2012, 40:D71–D75. receptor in the invasion of cultured mammalian cells by Salmonella 95. Ben-Hur A, Noble WS: Choosing negative examples for the prediction of typhimurium. Nature 1992, 357:588–589. protein-protein interactions. BMC Bioinformatics 2006, 7(Suppl 1):S2. 75. Zhu W, Phan QT, Boontheung P, Solis NV, Loo JA, Filler SG: EGFR and HER2 96. van Haagen HHHBM, Hoen PAC't, Botelho Bovo A, de Morrée A, van Mulligen receptor kinase signaling mediate epithelial cell invasion by Candida EM, Chichester C, Kors JA, den Dunnen JT, van Ommen G-JB, van der Maarel albicans during oropharyngeal infection. Proc Natl Acad Sci U S A 2012, SM, Medina KernV,MonsB,Schuemie MJ: Novel protein-protein interactions 109:14194–14199. inferred from literature context. PLoS One 2009, 4:e7894. 76. Strong JE, Tang D, Lee PW: Evidence that the epidermal growth factor 97. Jelier R, Schuemie MJ, Roes PJ, van Mulligen EM, Kors JA: Literature-based receptor on host cells confers reovirus infection efficiency. Virology 1993, concept profiles for gene annotation: the issue of weighting. Int J Med 197:405–411. Inform 2008, 77:354–362. 77. Eppstein DA, Vivienne Marsh Y, Schreiber AB, Newman SR, Todaro GJ, 98. Campos D, Matos S, Oliveira J: Gimli: open source and high-performance Nestor JJ Jr: Epidermal growth factor receptor occupancy inhibits biomedical name recognition. BMC Bioinformatics 2013, 14:54. vaccinia virus infection. Nature 1985, 318:663–665. 99. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein 78. Buret A, Gall DG, Olson ME, Hardin JA: The role of the epidermal growth families. Science 1997, 278:631–637. factor receptor in microbial infections of the gastrointestinal tract. 100. Lee S-A, C-h C, Tsai C-H, Lai J-M, Wang F-S, Kao C-Y, Huang C-YF: Ortholog- Microbes Infect 1999, 1:1139–1144. based protein-protein interaction prediction and its application to 79. Llena-Puy MC, Montanana-Llorens C, Forner-Navarro L: Fibronectin levels in inter-species interactions. BMC Bioinformatics 2008, 9(Suppl 12):S11. stimulated whole-saliva and their relationship with cariogenic oral 101. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, bacteria. Int Dent J 2000, 50:57–59. Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C: The STRING 80. Henderson B, Nair S, Pallas J, Williams MA: Fibronectin: a multidomain host database in 2011: functional interaction networks of proteins, globally adhesin targeted by bacterial fibronectin-binding proteins. integrated and scored. Nucleic Acids Res 2011, 39:D561–568. FEMS Microbiol Rev 2011, 35:147–200. 102. Lin N, Wu B, Jansen R, Gerstein M, Zhao H: Information assessment on 81. Min K-W, Hwang J-W, Lee J-S, Park Y, T-a T, Yoon J-B: TIP120A associates predicting protein-protein interactions. BMC Bioinformatics 2004, 5:154. with cullins and modulates ubiquitin ligase activity. J. Biol. Chem 2003, 103. Miller JP, Lo RS, Ben-Hur A, Desmarais C, Stagljar I, Noble WS, Fields S: 278:15905–15910. Large-scale identification of yeast integral membrane protein 82. Sarikas A, Hartmann T, Pan ZQ: The cullin protein family. Genome Biol 2011, interactions. Proc Natl Acad Sci U S A 2005, 102:12123–12128. 12:220. 104. Patil A, Nakamura H: Filtering high-throughput protein-protein interaction 83. Zheng J, Yang X, Harrell JM, Ryzhikov S, Shim E-H, Lykke-Andersen K, Wei N, data using a combination of genomic features. BMC Bioinformatics 2005, Sun H, Kobayashi R, Zhang H: CAND1 binds to unneddylated CUL1 and 6:100. regulates the formation of SCF ubiquitin E3 ligase complex. 105. Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological Mol Cell 2002, 10:1519–1526. data and computational classification methods for use in protein 84. Munro P, Flatau G, Lemichez E: Bacteria and the ubiquitin pathway. interaction prediction. Proteins 2006, 63:490–500. Curr Opin Microbiol 2007, 10:39–46. 106. Yellaboina S, Tasneem A, Zaykin DV, Raghavachari B, Jothi R: DOMINE: a 85. Curtis H, Dirk G, Rob K, Sahar A, Badger JH, Chinwalla AT, Creasy HH, Earl comprehensive collection of known and predicted domain-domain AM, FitzGerald MG, Fulton RS, Giglio MG, Kymberlie H-P, Lobos EA, Ramana interactions. Nucleic Acids Res 2011, 39:D730–D735. M, Vincent M, Martin JC, Makedonka M, Muzny DM, Sodergren EJ, Versalovic 107. Duda R, Hart P: Pattern Classification and Scene Analysis. New York: John J, Wollam AM, Worley KC, Wortman JR, Young SK, Qiandong Z, Aagaard KM, Wiley & Sons Inc; 1973. Abolude OO, Emma A-V, Alm EJ, Lucia A, et al: Structure, function and 108. Friedman N, Geiger D, Goldszmidt M: Bayesian Network Classifiers. diversity of the healthy human microbiome. Nature 2012, 486:207–214. Mach Learn 1997, 29:131–163. 86. Avila-Campos MJ, Velasquez-Melendez G: Prevalence of putative periodon- 109. Swets JA: Measuring the accuracy of diagnostic systems. Science 1988, topathogens from periodontal patients and healthy subjects in Sao 240:1285–1293. Paulo, SP, Brazil. Rev Inst Med Trop Sao Paulo 2002, 44:1–5. 110. Hanley JA, McNeil BJ: The meaning and use of the area under a receiver 87. Antikainen J, Kuparinen V, Lahteenmaki K, Korhonen TK: Enolases from operating characteristic (ROC) curve. Radiology 1982, 143:29–36. Gram-positive bacterial pathogens and commensal lactobacilli share functional similarity in virulence-associated traits. FEMS Immunol Med doi:10.1186/1752-0509-8-24 Microbiol 2007, 51:526–534. Cite this article as: Coelho et al.: Computational prediction of the 88. Levy ED, Pereira-Leal JB: Evolution and dynamics of protein interactions human-microbial oral interactome. BMC Systems Biology 2014 8:24. and networks. Curr Opin Struct Biol 2008, 18:349–357. 89. Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34:D535–539. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png BMC Systems Biology Springer Journals http://www.deepdyve.com/lp/springer-journals/computational-prediction-of-the-human-microbial-oral-interactome-QvpN58GWJs

Loading next page...

References (121)

Kyoengwoo Min, J. Hwang, Jong-Sik Lee, Yoon Park, Taka‐aki Tamura, Jong-Bok Yoon (2003)
TIP120A Associates with Cullins and Modulates Ubiquitin Ligase Activity*
The Journal of Biological Chemistry, 278
Debra Knisley, J. Knisley (2011)
Predicting protein-protein interactions using graph invariants and a neural network
Computational biology and chemistry, 35 2
O Tastan, Y Qi, JG Carbonell, J Klein-Seetharaman (2009)
Pac Symp Biocomput
The Consortium (2011)
Reorganizing the protein space at the Universal Protein Resource (UniProt)
Nucleic Acids Research, 40
W. Baumeister, R. Grimm, J. Walz (1999)
Electron tomography of molecules and cells.
Trends in cell biology, 9 2
Luonan Chen, Ling-Yun Wu, Yong Wang, Xiang-Sun Zhang (2006)
Inferring protein interactions from experimental data by association probabilistic method
Proteins: Structure, 62
Yuling Yan, G. Marriott (2003)
Analysis of protein interactions using fluorescence technologies.
Current opinion in chemical biology, 7 5
E. Marcotte, M. Pellegrini, H. Ng, Danny Rice, T. Yeates, D. Eisenberg (1999)
Detecting protein function and protein-protein interactions from genome sequences.
Science, 285 5428
Hawoong Jeong, S. Mason, A. Barabási, Z. Oltvai (2001)
Lethality and centrality in protein networks
Nature, 411
Sylvain Pitre, F. Dehne, A. Chan, James Cheetham, Alex Duong, A. Emili, M. Gebbia, J. Greenblatt, M. Jessulat, N. Krogan, Xuemei Luo, A. Golshani (2006)
PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs
BMC Bioinformatics, 7
D. Eppstein, Y. Marsh, A. Schreiber, S. Newman, G. Todaro, J. Nestor (1985)
Epidermal growth factor receptor occupancy inhibits vaccinia virus infection
Nature, 318
Richard Jones, A. Gordus, Jordan Krall, G. MacBeath (2006)
A quantitative protein interaction network for the ErbB receptors using protein microarrays
Nature, 439
C. Huttenhower, D. Gevers, R. Knight, Sahar Abubucker, J. Badger, A. Chinwalla, H. Creasy, A. Earl, Michael Fitzgerald, R. Fulton, M. Giglio, Kymberlie Hallsworth-Pepin, E. Lobos, R. Madupu, V. Magrini, John Martin, M. Mitreva, D. Muzny, E. Sodergren, J. Versalovic, A. Wollam, K. Worley, J. Wortman, Sarah Young, Qiandong Zeng, K. Aagaard, Olukemi Abolude, E. Allen-Vercoe, E. Alm, Lucia Alvarado, G. Andersen, S. Anderson, Elizabeth Appelbaum, H. Arachchi, G. Armitage, Cesar Arze, T. Ayvaz, Carl Baker, L. Begg, Tsegahiwot Belachew, Veena Bhonagiri, Monika Bihan, M. Blaser, Toby Bloom, Vivien Bonazzi, J. Brooks, G. Buck, C. Buhay, D. Busam, Joseph Campbell, S. Canon, B. Cantarel, P. Chain, I. Chen, Lei Chen, Shaila Chhibba, Ken Chu, Dawn Ciulla, J. Clemente, S. Clifton, S. Conlan, J. Crabtree, M. Cutting, Noam Davidovics, Catherine Davis, T. DeSantis, C. Deal, Kimberley Delehaunty, F. Dewhirst, E. Deych, Yan Ding, D. Dooling, Shannon Dugan, W. Dunne, A. Durkin, Robert Edgar, R. Erlich, Candace Farmer, R. Farrell, Karoline Faust, M. Feldgarden, Victor Felix, Sheila Fisher, A. Fodor, L. Forney, Les Foster, V. Francesco, Jonathan Friedman, Dennis Friedrich, C. Fronick, L. Fulton, Hongyu Gao, Nathalia Garcia, G. Giannoukos, C. Giblin, Maria Giovanni, J. Goldberg, Johannes Goll, Antonio Gonzalez, Allison Griggs, Sharvari Gujja, S. Haake, B. Haas, Holli Hamilton, Emily Harris, Theresa Hepburn, Brandi Herter, D. Hoffmann, M. Holder, Clinton Howarth, Katherine Huang, Susan Huse, J. Izard, J. Jansson, Huaiyang Jiang, Catherine Jordan, Vandita Joshi, J. Katancik, W. Keitel, S. Kelley, C. Kells, N. King, D. Knights, H. Kong, O. Koren, S. Koren, Karthik Kota, C. Kovar, N. Kyrpides, P. Rosa, Sandy Lee, K. Lemon, N. Lennon, Cecil Lewis, L. Lewis, R. Ley, Kelvin Li, K. Liolios, Bo Liu, Yue Liu, C. Lo, C. Lozupone, R. Lunsford, T. Madden, A. Mahurkar, P. Mannon, E. Mardis, V. Markowitz, K. Mavromatis, J. McCorrison, Daniel McDonald, J. Mcewen, A. McGuire, P. Mcinnes, Teena Mehta, K. Mihindukulasuriya, J. Miller, P. Minx, I. Newsham, C. Nusbaum, M. O'Laughlin, Joshua Orvis, I. Pagani, Krishna Palaniappan, Shital Patel, Matthew Pearson, Jane Peterson, M. Podar, C. Pohl, K. Pollard, Mihai Pop, M. Priest, L. Proctor, X. Qin, J. Raes, J. Ravel, J. Reid, Mina Rho, R. Rhodes, Kevin Riehle, M. Rivera, B. Rodriguez-Mueller, Y. Rogers, M. Ross, C. Russ, Ravi Sanka, P. Sankar, J. Sathirapongsasuti, J. Schloss, P. Schloss, T. Schmidt, M. Scholz, L. Schriml, Alyxandria Schubert, N. Segata, J. Segre, W. Shannon, R. Sharp, T. Sharpton, N. Shenoy, N. Sheth, Gina Simone, Indresh Singh, C. Smillie, J. Sobel, Daniel Sommer, P. Spicer, G. Sutton, S. Sykes, D. Tabbaa, M. Thiagarajan, Chad Tomlinson, M. Torralba, T. Treangen, R. Truty, T. Vishnivetskaya, Jason Walker, Lu Wang, Zhengyuan Wang, D. Ward, W. Warren, M. Watson, Christopher Wellington, K. Wetterstrand, J. White, Katarzyna Wilczek-Boney, Yuanqing Wu, K. Wylie, T. Wylie, C. Yandava, Liang Ye, Yuzhen Ye, Shibu Yooseph, Bonnie Youmans, Lan Zhang, Yanjiao Zhou, Yiming Zhu, L. Zoloth, Jeremy Zucker, B. Birren, R. Gibbs, S. Highlander, B. Methé, K. Nelson, J. Petrosino, G. Weinstock, R. Wilson, O. White (2012)
Structure, Function and Diversity of the Healthy Human Microbiome
Nature, 486
Gene Consortium (2003)
The Gene Ontology (GO) database and informatics resource
Chris Stark, B. Breitkreutz, T. Reguly, Lorrie Boucher, A. Breitkreutz, M. Tyers (2005)
BioGRID: a general repository for interaction datasets
Nucleic Acids Research, 34
Ashwini Patil, Haruki Nakamura (2005)
Filtering high-throughput protein-protein interaction data using a combination of genomic features
BMC Bioinformatics, 6
Ping Ye, Brian Peyser, Xuewen Pan, J. Boeke, Forrest Spencer, J. Bader (2005)
Gene function prediction from congruent synthetic lethal interactions in yeast
Molecular Systems Biology, 1
Shawn Martin, D. Roe, J. Faulon (2005)
Predicting protein-protein interactions using signature products
Bioinformatics, 21 2
A. Ben-Hur, William Noble (2005)
Kernel methods for predicting protein-protein interactions
Bioinformatics, 21 Suppl 1
Nan Lin, Baolin Wu, R. Jansen, M. Gerstein, Hongyu Zhao (2004)
Information assessment on predicting protein-protein interactions
BMC Bioinformatics, 5
John Miller, Russell Lo, A. Ben-Hur, Cynthia Desmarais, I. Stagljar, William Noble, S. Fields (2005)
Large-scale identification of yeast integral membrane protein interactions.
Proceedings of the National Academy of Sciences of the United States of America, 102 34
I. Xenarios, Esteban Fernandez, L. Salwínski, X. Duan, Michael Thompson, E. Marcotte, D. Eisenberg (2001)
DIP: The Database of Interacting Proteins: 2001 update
Nucleic acids research, 29 1
S. Teichmann (2002)
The constraints protein-protein interactions place on sequence divergence.
Journal of molecular biology, 324 3
A. Panchenko, Y. Wolf, L. Panchenko, T. Madej (2005)
Evolutionary plasticity of protein families: Coupling between sequence and structure variation
Proteins: Structure, 61
Bruno Aranda, P. Achuthan, Y. Alam-Faruque, Irina Armean, A. Bridge, C. Derow, M. Feuermann, A. Ghanbarian, Samuel Kerrien, J. Khadake, J. Kerssemakers, C. Leroy, M. Menden, Magali Michaut, L. Montecchi-Palazzi, S. Neuhauser, S. Orchard, Victoria Perreau, B. Roechert, K. Eijk, H. Hermjakob (2009)
The IntAct molecular interaction database in 2010
Nucleic Acids Research, 38
G. Smith (1985)
Filamentous fusion phage: novel expression vectors that display cloned antigens on the virion surface.
Science, 228 4705
Weidong Zhu, Q. Phan, P. Boontheung, N. Solis, J. Loo, S. Filler (2012)
EGFR and HER2 receptor kinase signaling mediate epithelial cell invasion by Candida albicans during oropharyngeal infection
Proceedings of the National Academy of Sciences, 109
H. Najafabadi, R. Salavati (2008)
Sequence-based prediction of protein-protein interactions by means of codon usage
Genome Biology, 9
H. Zhu, M. Bilgin, R. Bangham, D. Hall, A. Casamayor, Paul Bertone, N. Lan, R. Jansen, S. Bidlingmaier, T. Houfek, T. Mitchell, P. Miller, R. Dean, M. Gerstein, M. Snyder (2001)
Global Analysis of Protein Activities Using Proteome Chips
Science, 293
M. Pellegrini, E. Marcotte, Michael Thompson, D. Eisenberg, T. Yeates (1999)
Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.
Proceedings of the National Academy of Sciences of the United States of America, 96 8
Yanzhi Guo, Lezheng Yu, Z. Wen, Meng-long Li (2008)
Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences
Nucleic Acids Research, 36
Herman Haagen, P. Hoen, Alessandro Bovo, Antoine Morrée, E. Mulligen, C. Chichester, J. Kors, J. Dunnen, G. Ommen, S. Maarel, V. Kern, B. Mons, M. Schuemie (2009)
Novel Protein-Protein Interactions Inferred from Literature Context
PLoS ONE, 4
A. Valencia, F. Pazos (2002)
Computational methods for the prediction of protein interactions.
Current opinion in structural biology, 12 3
W. Valdar, J. Thornton (2001)
Protein–protein interfaces: Analysis of amino acid conservation in homodimers
Proteins: Structure, 42
M. Eisen, P. Spellman, P. Brown, D. Botstein (1998)
Cluster analysis and display of genome-wide expression patterns.
Proceedings of the National Academy of Sciences of the United States of America, 95 25
E. Levy, J. Pereira-Leal (2008)
Evolution and dynamics of protein interactions and networks.
Current opinion in structural biology, 18 3
L. Licata, Leonardo Briganti, Daniele Peluso, L. Perfetto, M. Iannuccelli, Eugenia Galeota, F. Sacco, Anita Palma, A. Nardozza, Elena Santonico, L. Castagnoli, G. Cesareni (2011)
MINT, the molecular interaction database: 2012 update
Nucleic Acids Research, 40
Jorge Galán, J. Pace, Michael Hayman (1992)
Involvement of the epidermal growth factor receptor in the invasion of cultured mammalian cells by Salmonella typhimurium
Nature, 357
H McDevitt, S Munson, R Ettinger, A Wu (2002)
Multiple roles for tumor necrosis factor-alpha and lymphotoxin alpha/beta in immunity and autoimmunity
Arthritis Res, 4
F. Davis, D. Barkan, N. Eswar, J. McKerrow, A. Sali (2007)
Host–pathogen protein interactions predicted by comparative modeling
Protein Science, 16
Rob Jelier, M. Schuemie, Peter-Jan Roes, E. Mulligen, J. Kors (2008)
Literature-based concept profiles for gene annotation: The issue of weighting
International journal of medical informatics, 77 5
S. Wuchty, E. Almaas (2005)
Peeling the yeast protein network
PROTEOMICS, 5
J. Antikainen, Veera Kuparinen, K. Lähteenmäki, Timo Korhonen (2007)
Enolases from Gram-positive bacterial pathogens and commensal lactobacilli share functional similarity in virulence-associated traits.
FEMS immunology and medical microbiology, 51 3
L. Nanni (2005)
Hyperplanes for predicting protein-protein interactions
Neurocomputing, 69
G. Rigaut, A. Shevchenko, B. Rutz, M. Wilm, M. Mann, B. Séraphin (1999)
A generic protein purification method for protein complex characterization and proteome exploration
Nature Biotechnology, 17
B. Henderson, S. Nair, Jaqueline Pallas, Mark Williams (2011)
Fibronectin: a multidomain host adhesin targeted by bacterial fibronectin-binding proteins.
FEMS microbiology reviews, 35 1
N. Friedman, D. Geiger, M. Goldszmidt (1997)
Bayesian Network Classifiers
Machine Learning, 29
James Strong, Damu Tang, P. Lee (1993)
Evidence that the epidermal growth factor receptor on host cells confers reovirus infection efficiency.
Virology, 197 1
M. Mehta, Shirley Liu, J. Silberg (2012)
A transposase strategy for creating libraries of circularly permuted proteins
Nucleic Acids Research, 40
Oznur Tastan, Yanjun Qi, J. Carbonell, J. Klein-Seetharaman (2008)
Prediction of Interactions Between HIV-1 and Human Proteins by Information Integration
Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing
Chengbang Huang, F. Morcos, Simon Kanaan, S. Wuchty, Danny Chen, J. Izaguirre (2007)
Predicting Protein-Protein Interactions from Protein Domains Using a Set Cover Approach
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4
R. Overbeek, M. Fonstein, M. D'Souza, G. Pusch, N. Maltsev (1999)
The use of gene clusters to infer functional coupling.
Proceedings of the National Academy of Sciences of the United States of America, 96 6
R. Duda, P. Hart (1974)
Pattern classification and scene analysis
Joel Bock, D. Gough (2001)
Predicting protein-protein interactions from primary structure
Bioinformatics, 17 5
L. Salwínski, ChristopherA. Miller, Adam Smith, Frank Pettit, J. Bowie, D. Eisenberg (2004)
The Database of Interacting Proteins: 2004 update
Nucleic acids research, 32 Database issue
M. Llena-Puy, C. Montañana-Llorens, L. Forner-Navarro (2000)
Fibronectin levels in stimulated whole-saliva and their relationship with cariogenic oral bacteria.
International dental journal, 50 1
J. Hanley, B. McNeil (1982)
The meaning and use of the area under a receiver operating characteristic (ROC) curve.
Radiology, 143 1
R. Tatusov, E. Koonin, D. Lipman (1997)
A genomic perspective on protein families.
Science, 278 5338
B. Park, Guangyu Cui, Hyunjin Lee, De-shuang Huang, Kyungsook Han (2013)
PPISearchEngine: gene ontology-based search for protein–protein interactions
Computer Methods in Biomechanics and Biomedical Engineering, 16
C. Guerra, Marco Mina (2011)
Computational Methods for the Prediction of Protein-Protein Interactions
P. Aloy, H. Ceulemans, A. Stark, R. Russell (2003)
The relationship between sequence and interaction divergence in proteins.
Journal of molecular biology, 332 5
A. Ben-Hur, William Noble (2006)
Choosing negative examples for the prediction of protein-protein interactions
BMC Bioinformatics, 7
K. Yamane, T. Nambu, T. Yamanaka, Chiho Mashimo, Chieko Sugimori, K. Leung, H. Fukushima (2010)
Complete Genome Sequence of Rothia mucilaginosa DY-18: A Clinical Isolate with Dense Meshwork-Like Structures from a Persistent Apical Periodontitis Lesion
Sequencing, 2010
S. Maetschke, Martin Simonsen, Melissa Davis, M. Ragan (2012)
Gene Ontology-driven inference of protein-protein interactions using inducers
Bioinformatics, 28 1
L. Mckay, J. Cidlowski (1999)
Molecular control of immune/inflammatory responses: interactions between nuclear factor-kappa B and steroid receptor-signaling pathways.
Endocrine reviews, 20 4
J. Tamames, G. Casari, C. Ouzounis, A. Valencia (1997)
Conserved Clusters of Functionally Related Genes in Two Bacterial Genomes
Journal of Molecular Evolution, 44
Goparani Mishra, Mavanur Suresh, K. Kumaran, N. Kannabiran, Shubha Suresh, P. Bala, K. Shivakumar, N. Anuradha, Raghunath Reddy, T. Raghavan, Shalini Menon, G. Hanumanthu, Malvika Gupta, Sapna Upendran, Shweta Gupta, Madhu Mahesh, Bincy Jacob, Pinky Mathew, P. Chatterjee, K.H.S. Arun, Salil Sharma, K. Chandrika, Nandan Deshpande, Kshitish Palvankar, R. Raghavnath, R. Krishnakanth, Hiren Karathia, B. Rekha, Rashmi Nayak, G. Vishnupriya, H. Kumar, M. Nagini, G. Kumar, Rojan Jose, P. Deepthi, S. Mohan, T. Gandhi, H. Harsha, Krishna Deshpande, Malabika Sarker, T. Prasad, A. Pandey (2005)
Human protein reference database—2006 update
Nucleic Acids Research, 34
Copyright � 1995, American Society for Microbiology Protein-Protein Interactions: Methods for Detection and Analysis
David Campos, Sérgio Matos, J. Oliveira (2013)
Gimli: open source and high-performance biomedical name recognition
BMC Bioinformatics, 14
Matthew Dyer, T. Murali, B. Sobral (2007)
Computational prediction of host-pathogen protein-protein interactions
Bioinformatics, 23 13
A. Tong, M. Evangelista, A. Parsons, H. Xu, Gary Bader, N. Pagé, M. Robinson, S. Raghibizadeh, C. Hogue, H. Bussey, B. Andrews, M. Tyers, C. Boone (2001)
Systematic Genetic Analysis with Ordered Arrays of Yeast Deletion Mutants
Science, 294
Jun-Feng Xia, Kyungsook Han, De-shuang Huang (2010)
Sequence-based prediction of protein-protein interactions by means of rotation forest and autocorrelation descriptor.
Protein and peptide letters, 17 1
Nuno Rosa, M. Correia, Joel Arrais, Pedro Lopes, J. Melo, J. Oliveira, M. Barros (2012)
From the salivary proteome to the OralOme: comprehensive molecular oral biology.
Archives of oral biology, 57 7
P. Munro, G. Flatau, E. Lemichez (2007)
Bacteria and the ubiquitin pathway.
Current opinion in microbiology, 10 1
G. MacBeath, S. Schreiber (2000)
Printing proteins as microarrays for high-throughput function determination.
Science, 289 5485
Yanjun Qi, Z. Bar-Joseph, J. Klein-Seetharaman (2006)
Evaluation of different biological data and computational classification methods for use in protein interaction prediction
Proteins: Structure, 63
Hugh McDevitt, Sibyl Munson, Rachel Ettinger, Ava Wu (2002)
Multiple roles for tumor necrosis factor-α and lymphotoxin α/β in immunity and autoimmunity
Arthritis Research, 4
T. Blumenthal (1998)
Gene clusters and polycistronic transcription in eukaryotes
BioEssays, 20
Joel Arrais, Nuno Rosa, J. Melo, E. Coelho, Diana Amaral, M. Correia, M. Barros, J. Oliveira (2013)
OralCard: a bioinformatic tool for the study of oral proteome.
Archives of oral biology, 58 7
Fernando Siso-Nadal, Julien Ollivier, Peter Swain (2007)
Facile: a command-line network compiler for systems biology
BMC Systems Biology, 1
Yong Yang, Hong Wang, D. Erie (2003)
Quantitative characterization of biomolecular assemblies and interactions using atomic force microscopy.
Methods, 29 2
Einat Sprinzak, H. Margalit (2001)
Correlated sequence-signatures as markers of protein-protein interaction.
Journal of molecular biology, 311 4
Jean-François Rual, K. Venkatesan, Tong Hao, T. Hirozane-Kishikawa, Amélie Dricot, Ning Li, G. Berriz, Francis Gibbons, Matija Dreze, Nono Ayivi-Guedehoussou, Niels Klitgord, Christophe Simon, M. Boxem, S. Milstein, Jennifer Rosenberg, D. Goldberg, Lan Zhang, Sharyl Wong, G. Franklin, Siming Li, J. Albala, Janghoo Lim, Carlene Fraughton, E. Llamosas, S. Cevik, C. Bex, Philippe Lamesch, R. Sikorski, J. Vandenhaute, H. Zoghbi, A. Smolyar, Stephanie Bosak, Reynaldo Sequerra, L. Doucette-Stamm, M. Cusick, D. Hill, F. Roth, M. Vidal (2005)
Towards a proteome-scale map of the human protein–protein interaction network
Nature, 437
(2004)
BIOINFORMATICS ORIGINAL PAPER Systems biology
C. Vecchiola, Suraj Pandey, R. Buyya (2009)
High-Performance Cloud Computing: A View of Scientific Applications
2009 10th International Symposium on Pervasive Systems, Algorithms, and Networks
Jianyu Zheng, Xiaoming Yang, Jennifer Harrell, S. Ryzhikov, Eun-Hee Shim, Karin Lykke-Andersen, N. Wei, Hong Sun, R. Kobayashi, Hui Zhang (2002)
CAND1 binds to unneddylated CUL1 and regulates the formation of SCF ubiquitin E3 ligase complex.
Molecular cell, 10 6
M. Ávila-Campos, G. Velásquez-Melendez (2002)
Prevalence of putative periodontopathogens from periodontal patients and healthy subjects in Sao Paulo, SP, Brazil.
Revista do Instituto de Medicina Tropical de Sao Paulo, 44 1
Yongqing Zhang, Danling Zhang, Gang Mi, Daichuan Ma, Gongbing Li, Yanzhi Guo, Meng-long Li, Min Zhu (2012)
Using ensemble methods to deal with imbalanced data in predicting protein-protein interactions
Computational biology and chemistry, 36
J. Barnard, R. Beauchamp, W. Russell, R. Dubois, R. Dubois, R. Coffey, R. Coffey (1995)
Epidermal growth factor-related peptides and their relevance to gastrointestinal pathophysiology.
Gastroenterology, 108 2
L Nanni, A Lumini (2006)
An ensemble of K-local hyperplanes for predicting protein–protein interactions
Bioinformatics, 22
J. Swets (1988)
Measuring the accuracy of diagnostic systems.
Science, 240 4857
Damian Szklarczyk, Andrea Franceschini, Michael Kuhn, M. Simonovic, Alexander Roth, Pablo Minguez, T. Doerks, M. Stark, J. Muller, P. Bork, L. Jensen, C. Mering (2010)
The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored
Nucleic Acids Research, 39
M. Cooper (2003)
Label-free screening of bio-molecular interactions
Analytical and Bioanalytical Chemistry, 377
Julie Morrison, R. Breitling, D. Higham, D. Gilbert (2006)
A lock-and-key model for protein-protein interactions
Bioinformatics, 22 16
EM Phizicky, S Fields (1995)
Protein-protein interactions: methods for detection and analysis
Microbiol Rev, 59
Stephen Littler, S. Hubbard (2005)
Conservation of orientation and sequence in protein domain--domain interactions.
Journal of molecular biology, 345 5
Sailu Yellaboina, A. Tasneem, D. Zaykin, B. Raghavachari, Raja Jothi (2010)
DOMINE: a comprehensive collection of known and predicted domain-domain interactions
Nucleic Acids Research, 39
C. Ouzounis, N. Kyrpides (1996)
The emergence of major cellular processes in evolution
FEBS Letters, 390
Takashi Ito, Tomoko Chiba, Ritsuko Ozawa, Mikio Yoshida, Masahira Hattori, Y. Sakaki (2001)
A comprehensive two-hybrid analysis to explore the yeast protein interactome
Proceedings of the National Academy of Sciences of the United States of America, 98
S. Rajasekaran, J. Merlin, V. Kundeti, Tian Mi, Aaron Oommen, Jay Vyas, Izua Alaniz, Keith Chung, Farah Chowdhury, Sandeep Deverasatty, Tenisha Irvey, David Lacambacal, D. Lara, Subhasree Panchangam, V. Rathnayake, Paula Watts, M. Schiller (2011)
A computational tool for identifying minimotifs in protein–protein interactions and improving the accuracy of minimotif predictions
Proteins: Structure, 79
Isabel Batty (1958)
Actinomyces odontolyticus, a new species of actinomycete regularly isolated from deep carious dentine.
The Journal of pathology and bacteriology, 75 2
Long Lu, Hui Lu, J. Skolnick (2002)
MULTIPROSPECTOR: An algorithm for the prediction of protein–protein interactions by multimeric threading
Proteins: Structure, 49
V. Mauch, M. Kunze, Marius Hillenbrand (2013)
High performance cloud computing
Future Gener. Comput. Syst., 29
(2014)
24 Page 11 of 12 http://www.biomedcentral.com/1752-0509
D. Wisser, F. Wisser, S. Raschke, N. Klein, M. Leistner, J. Grothe, E. Brunner, S. Kaskel, T. Yamashita, P. Knipe, N. Busschaert, S. Thompson, A. Hamilton (2015)
Protein – Protein Interactions
A. Sarikas, T. Hartmann, Z. Pan (2011)
The cullin protein family
Genome Biology, 12
Joel Bock, D. Gough (2003)
Whole-proteome interaction mining
Bioinformatics, 19 1
Sheng-An Lee, Chen-hsiung Chan, Chi-Hung Tsai, J. Lai, Feng-Sheng Wang, Cheng-Yan Kao, Chi-Ying Huang (2008)
Ortholog-based protein-protein interaction prediction and its application to inter-species interactions
BMC Bioinformatics, 9
Juwen Shen, Jian Zhang, Xiaomin Luo, Weiliang Zhu, Kunqian Yu, Kaixian Chen, Yixue Li, Hualiang Jiang (2007)
Predicting protein–protein interactions based only on sequences information
Proceedings of the National Academy of Sciences, 104
S. Jaeger, S. Gaudan, U. Leser, D. Rebholz-Schuhmann (2008)
Integrating protein-protein interactions and text mining for protein function prediction
BMC Bioinformatics, 9
Shobhit Jain, Gary Bader (2010)
An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology
BMC Bioinformatics, 11
Xiaomei Wu, Lei Zhu, Jie Guo, Da‐Yong Zhang, Kui Lin (2006)
Prediction of yeast protein–protein interaction network: insights from the Gene Ontology and annotations
Nucleic Acids Research, 34
P. Uetz, L. Giot, G. Cagney, T. Mansfield, R. Judson, James Knight, D. Lockshon, Vaibhav Narayan, Maithreyan Srinivasan, P. Pochart, Alia Qureshi-Emili, Ying Li, B. Godwin, D. Conover, T. Kalbfleisch, G. Vijayadamodar, Meijia Yang, M. Johnston, S. Fields, J. Rothberg (2000)
A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae
Nature, 403
X-W Chen, M Liu (2005)
Prediction of protein–protein interactions using random decision forest framework
Bioinformatics, 21
T. Dandekar, B. Snel, M. Huynen, P. Bork (1998)
Conservation of gene order: a fingerprint of proteins that physically interact.
Trends in biochemical sciences, 23 9
A. Buret, D. Gall, M. Olson, J. Hardin (1999)
The role of the epidermal growth factor receptor in microbial infections of the gastrointestinal tract.
Microbes and infection, 1 13
Anton Enright, I. Iliopoulos, N. Kyrpides, C. Ouzounis (1999)
Protein interaction maps for complete genomes based on gene fusion events
Nature, 402
P. Aloy, R. Russell (2002)
Interrogating protein interaction networks through structural biology
Proceedings of the National Academy of Sciences of the United States of America, 99
Minghua Deng, Shipra Mehta, Fengzhu Sun, Ting Chen (2002)
Inferring domain-domain interactions from protein-protein interactions
Genome research, 12 10
D. Barker, M. Pagel (2005)
Predicting Functional Gene Links from Phylogenetic-Statistical Analyses of Whole Genomes
PLoS Computational Biology, 1
Rui-Sheng Wang, Yong Wang, Ling-Yun Wu, Xiang-Sun Zhang, Luonan Chen (2007)
Analysis on multi-domain cooperation for predicting protein-protein interactions
BMC Bioinformatics, 8

Publisher: Springer Journals
Copyright: Copyright © 2014 by Coelho et al.; licensee BioMed Central Ltd.
Subject: Life Sciences; Bioinformatics; Systems Biology; Simulation and Modeling; Computational Biology/Bioinformatics; Physiological, Cellular and Medical Topics; Algorithms
eISSN: 1752-0509
DOI: 10.1186/1752-0509-8-24
pmid: 24576332
Publisher site: See Article on Publisher Site

Abstract

Background: The oral cavity is a complex ecosystem where human chemical compounds coexist with a particular microbiota. However, shifts in the normal composition of this microbiota may result in the onset of oral ailments, such as periodontitis and dental caries. In addition, it is known that the microbial colonization of the oral cavity is mediated by protein-protein interactions (PPIs) between the host and microorganisms. Nevertheless, this kind of PPIs is still largely undisclosed. To elucidate these interactions, we have created a computational prediction method that allows us to obtain a first model of the Human-Microbial oral interactome. Results: We collected high-quality experimental PPIs from five major human databases. The obtained PPIs were used to create our positive dataset and, indirectly, our negative dataset. The positive and negative datasets were merged and used for training and validation of a naïve Bayes classifier. For the final prediction model, we used an ensemble methodology combining five distinct PPI prediction techniques, namely: literature mining, primary protein sequences, orthologous profiles, biological process similarity, and domain interactions. Performance evaluation of our method revealed an area under the ROC-curve (AUC) value greater than 0.926, supporting our primary hypothesis, as no single set of features reached an AUC greater than 0.877. After subjecting our dataset to the −7 prediction model, the classified result was filtered for very high confidence PPIs (probability ≥ 1-10 ), leading to a set of 46,579 PPIs to be further explored. Conclusions: We believe this dataset holds not only important pathways involved in the onset of infectious oral diseases, but also potential drug-targets and biomarkers. The dataset used for training and validation, the predictions obtained and the network final network are available at http://bioinformatics.ua.pt/software/oralint. Keywords: Protein-protein interactions, Oral interactome, Bayesian classification Background reveal high structural and physical-chemical affinity with The majority of gene products that crowd a living cell an associated degree of conservation. This is further evi- interact, at least transiently, with other protein molecules. denced by the fact that close protein homologs frequently Virtually all cellular events, such as signal transduction, interact in the same way [3-7]. With this in mind, we can intracellular transport, DNA replication, transcription, expect understanding of the human interactome to pro- translation, splicing, secretion, cell cycle control and inter- vide insight into physiopathological mechanisms [8]. mediary metabolism, are mediated by protein-protein in- Numerous experimental techniques have been explored teractions (PPIs) [1]. The same applies to host-pathogen to attain the human interactome: two-hybrid screening systems, where PPIs are essential in the establishment of [9,10], affinity purification mass spectrometry [11], DNA infection [2]. The binding domains of interacting proteins microarrays [12], protein microarrays [13-15], synthetic le- thality[16], phagedisplay[17], X-ray crystallography and nuclear magnetic resonance spectroscopy [18], fluorescence * Correspondence: [email protected] resonance energy transfer [19], surface plasmon resonance Department of Informatics Engineering (DEI), University of Coimbra, Coimbra, Portugal [20], atomic force microscopy [21], and electron micros- Centre for Informatics and Systems of the University at Coimbra (CISUC), copy [22]. These methods have major drawbacks that ren- University of Coimbra, Coimbra, Portugal der them non-applicable in large-scale PPI prediction, Full list of author information is available at the end of the article © 2014 Coelho et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. Coelho et al. BMC Systems Biology 2014, 8:24 Page 2 of 12 http://www.biomedcentral.com/1752-0509/8/24 namely the amount of time, associated cost and minimal handling, and the minimal risk linked to its collection protein interaction network coverage per run. Additionally, for both medical staff and the patient, the reason for high-throughput approaches are also often associated with studying the oral cavity becomes clear [67]. low-specificity and large numbers of both false negatives As a result of this work, analysis of the resulting PPI and false positives [23]. Moreover, these techniques were network revealed some interesting features. Some of the developed to detect intra-species PPIs, which renders them PPIs involving the Rothia mucilaginosa microorganism sub-optimal in inter-species PPI identification. Still, experi- are very specific and relevant. Moreover, our method not mental methods remain the only viable methodology to val- only predicted new PPIs between periodontal pathogens idate PPIs. and the host, but also PPIs between different periodontal As an alternative to experimental methods, a wide range pathogens, suggesting a synergistic course of action. of computational approaches for the prediction of intra- species PPIs have been proposed. Computational methods Results can be categorized according to the types of information We conducted a series of pre-test analyses to assess the they analyze. One common approach consists of using performance of our model. Then, we proceeded to test our text mining to extract known PPIs from the biomedical lit- approach on high-quality experimental protein-protein erature [24]. Additionally, there are methods based on interaction (PPI) data collected from five databases. The genomic data (gene neighborhood [25-28], gene fusion selected databases exclusively contain manually curated [29,30], phylogenetic profiles [31-33], codon usage similar- PPI data. ity [34]), protein structure (homology-based method [35], threading-based method [36]), domain information (single Computational model for predicting the human-microbial domain pairs [37-41], multi-domain pairs [42,43]), protein interactome sequence [44-56], and Gene Ontology (GO) [57] annota- Figure 1 summarizes the procedure used to achieve the tion semantic similarity ([58-61]). In contrast, computa- model of the human-microbial interactome. The starting tional efforts to predict inter-species PPIs have been very point of this work is a set of 4,707 proteins identified by limited. Dyer et al. [2] combined domain information with proteomic studies as being present in the oral cavity and a maximum likelihood estimator algorithm [37], while available on the OralCard database [66,67]. Davis et al. [62] adapted an approach following the Since there is no well-established gold standard for threading-based method [36]. To provide a better predic- PPIs, we collected data from five databases containing tion, Tastan et al. [63] applied a method combining mul- high-quality experimentally determined interactions as tiple data sources, and used a random forest classifier to described further on in Methods. Extracted PPIs from predict interactions between HIV-1 virus and human pro- the five databases were merged, creating our gold stand- teins. Despite these advances, the interactomes of several ard of positive interactions. The gold standard of nega- species are still far from complete. Nonetheless, the results tive interactions was obtained by randomly pairing the of some of these studies provide great working knowledge protein list on the premise that all protein pairs pro- of the characteristics of protein and gene interaction net- duced must differ from those on the positive dataset. A works. For instance, the topological characteristics of pro- total of 18,371 positive and a similar number of negative tein interaction networks (PINs) have been proven to pairs were obtained. reflect the functionality of the interacting genes. This was Simultaneously, for each possible pair of proteins, we demonstrated in yeast, where essential genes were more constructed five clusters of features based on: (1) litera- likely to be well connected and globally centered in the ture; (2) primary protein sequence information; (3) ortho- PIN [64,65]. logous profiles; (4) biological process similarity, and; (5) Here we present a computational model to predict enriched conserved domain pairs. This was performed by inter-species PPIs within the human oral cavity, an en- accessing public databases, extracting, and then processing vironment particularly prone to bacterial colonization. the collected data. This is mostly due to the fact that human, microbial and The gold standard dataset was used to train a Naïve environmental factors interact in a dynamic equilibrium Bayes classifier and to perform further validations on the within the human oral cavity [66]. Determination of the final model. The classifier was then applied to the set of salivary interactome will clarify the role of saliva in oral all possible pairs of protein interactions. Finally, by ag- biology and enable the identification of disease bio- gregating all individual pairs of predicted interactions, markers. The presence of blood exudate proteins and the final network was obtained. exfoliated epithelial cells in saliva suggest it may be an alternative to blood as a diagnostic fluid in many in- Evaluating the reconstruction of the human interactome stances. Additionally, if we consider the systemic nature In this section, we evaluate the performance of the pro- of saliva, the ease and low cost associated with its posed method when applied to the set of human proteins Coelho et al. BMC Systems Biology 2014, 8:24 Page 3 of 12 http://www.biomedcentral.com/1752-0509/8/24 Figure 1 Workflow applied on the construction of the Human-microbial oral interactome”. It also contains footnote information: “a) the proteins identified on the oral proteome are obtained from the Oralcard database; b) the gold standard used for training and validation is obtained by combining the five most relevant curated protein interaction databases; c) for each protein interacting pair five clusters of features are constructed; d) the previously trained classifier is applied to each pair of interaction; and e) finally the interactome network is obtained by combining the individual pairs of proteins. from the gold standard. We performed a 5-fold cross- The best performance is achieved through the ensem- validation to assess the combined and individual contribu- ble of the five clusters, returning an area under the re- tions of the clusters of features. Table 1 shows the results ceiver operating characteristic (ROC) curve (AUC) of for the performance of each individual cluster while Table 2 0.926, a precision of 0.848 and a recall of 0.854. This re- presents the contribution of each cluster to the final classi- sult is above the performance of any individual feature fier by iteratively removing each cluster. and can only be achieved with the participation of all, Table 1 Analysis of the prediction performance of Table 2 Analysis of the contribution to the overall individual features performance of individual cluster of features Feature AUC CA F1-score Precision Recall Feature AUC CA F1-score Precision Recall + Literature 0.781 0.722 0.723 0.721 0.726 - Literature 0.919 0.841 0.841 0.841 0.841 + Sequence 0.877 0.784 0.790 0.768 0.813 - Sequence 0.891 0.794 0.774 0.855 0.708 + GO 0.817 0.742 0.748 0.735 0.760 - GO 0.916 0.838 0.839 0.835 0.842 + COGs 0.663 0.652 0.537 0.806 0.402 - COGs 0.923 0.846 0.847 0.842 0.852 + DDIs 0.620 0.617 0.424 0.861 0.281 - DDIs 0.911 0.831 0.834 0.819 0.850 Final Model 0.926 0.850 0.851 0.848 0.854 Final Model 0.926 0.850 0.851 0.848 0.854 For each line the metrics are obtained by considering only that cluster of For each line the metrics are obtained by removing that cluster of features features on the classifier. AUC, area under the receiver operating characteristic from the classifier. AUC, area under the receiver operating characteristic (ROC) (ROC) curve; CA, classification accuracy. curve; CA, classification accuracy. Coelho et al. BMC Systems Biology 2014, 8:24 Page 4 of 12 http://www.biomedcentral.com/1752-0509/8/24 −7 meaning that all features are required and have a com- observed in Figure 2, the cutoff of 1-10 is the lowest plementary contribution. probability value where an increment does not imply a de- The Sequence is simultaneously the feature with the crease in the number of interactions. This cut-off resulted best overall performance (AUC = 0.877) and the one that in a total of 46,579 PPIs, with 37,407 being between hu- causes the most negative impact when removed from man proteins, 6,394 between human and microbial pro- the classifier, making the AUC drop to 0.891. It also has teins, and 2,778 between microbial proteins. The average a very interesting recall of 0.813, partially due to the fact number of protein interactions per protein after the cutoff that all protein sequences are recognized and therefore was 8. Figure 3 is a visual representation of the interac- the feature has full coverage. tions between the various organisms found in the oral cav- In contrast, the clusters of orthologous groups (COGs, ity and the human host. Intra-species interactions are not with AUC = 0.663) and domain-domain interactions (DDI, shown. The thickness of the ribbons between each or- with AUC = 0.620) have the lowest individual AUCs, ganism is correlated with the number of PPIs between mainly due to the low coverage of their features. Despite both organisms, meaning that the organisms sharing that, they benefit from a considerably high precision that highest number of PPIs with the human are Rothia contributes positively to the final classifier. This is espe- mucilaginosa, Leptotrichia buccalis,and Actinomyces cially true for the COGs which, when removed, cause the odontolyticus (strain independent). major drop in precision. With the exception of Homo sapiens with 3,030 pro- The Literature and the Gene Ontology (GO) features, teins, the most represented organisms in the human oral while not outstanding in any particular metric, have con- cavity are Rothia mucilaginosa (strain DY-18) (Stomato- sistent performance on almost all metrics. Nevertheless, coccus mucilaginosus), Actinomyces odontolyticus (strain they make a very relevant contribution to the final classi- ATCC 17982), and Streptococcus salivarius (strain SK126), fier while the removal of the Literature causes a drop of with 68, 54, and 26 proteins, respectively. These organisms the AUC to 0.919 and the GOs to 0.916. are opportunistic pathogens known to be associated with periodontitis [69] and caries [70]. Global characterization of the human-microbial interactome The most frequent biological processes are related to The classifier returned a set of 1.9 million possible interac- host-microbial interactions: GO:0044281 (small molecule tions with a probability higher than 0.5. This corresponds metabolic process, involved in 173 PPIs), GO:0019048 to an average degree of 404 interactions per protein, which (viral interaction with host, involved in 161 PPIs), and is much above the range of 3 to 30 documented in previ- GO:0045087 (innate immune response, involved in 145 ous studies [68]. Additionally, there are reports of yeast PPIs). two-hybrid screenings, the most commonly used high- We also identified the top three human hub-proteins throughput experimental method, reaching false-positive present in our data: epidermal growth factor receptor rates of 70%. With this in mind, and in order to minimize (EGFR) (UniProt AC P00533, involved in 3247 PPIs), fibro- the presence of false-positives in our predicted interac- nectin (UniProt AC P02751, involved in 3143 PPIs), and tome, we filtered our prediction results to consider only cullin-associated NEDD8-dissociated protein 1 (CAND1) −7 very high confidence PPIs (probability ≥ 1-10 ). We (UniProt AC Q86VP6, involved in 2911 PPIs). In terms of neglected the recall for the sake of precision. As can be non-human original hub-proteins, the most common are a Figure 2 Plot with the relation of the number of interactions (y-axis) by classifier probability (x-axis). Coelho et al. BMC Systems Biology 2014, 8:24 Page 5 of 12 http://www.biomedcentral.com/1752-0509/8/24 Figure 3 Representation of the Human-microbial inter-species protein interactions. Each section represents an organism. The ribbons connecting any two sections symbolize the PPIs between two organisms. The thickness of each ribbon correlates with the number of PPIs between both organisms. serine/threonine protein kinase from Leptotrichia buccalis glucocorticoid-mediated signal transduction cascades. (UniProt AC C7NEK0, involved in 258 PPIs), a kinase do- While the NF-kB pathway promotes the immune re- main protein from Parviromonas micra ATCC 33270 sponse and inflammation, the glucocorticoid-mediated (UniProt AC A8SM03, involved in 194 PPIs), and Ras- signal transduction cascade suppresses it. In order to ex- related protein SEC4 from Saccharomyces cerevisiae (Uni- plain the association between small molecule metabol- Prot AC P07560, involved in 163 PPIs). ism and host-pathogen interactions we must focus on the NF-kB cascade, as it is known to mediate the tran- Discussion scriptional activation of several cytokines (cell-signaling Functional analysis of the human-microbial interactome molecules) involved in immunity [71]. Tumor necrosis Unsurprisingly, the most frequent GO biological pro- factor (TNF)-α and TNF-β, two of these cytokines, play cesses in our final PPI dataset are associated with host- key roles in immune regulation and inflammation [72]. pathogen interactions. The preeminence of innate im- However, these cytokines are mainly responsible for the mune response and viral interaction with host as the metabolic instabilities that occur during the infection, as most frequent biological processes are self-explanatory. they increase the metabolism of triglycerides inducing However, the association between small molecule metab- hyperlipidemia (escalation of blood lipid levels), stimu- olism and host-microbial interactions is not so direct. late lipolysis (degradation of lipids), accelerate glycogen When faced with an infection, the body will respond breakdown and glucose consumption and uptake, and by initiating two major cellular signaling pathways with increase the serum levels of hormones that regulate glu- opposing functions: the nuclear factor (NF)-kB and cose metabolism. These metabolic changes possibly Coelho et al. BMC Systems Biology 2014, 8:24 Page 6 of 12 http://www.biomedcentral.com/1752-0509/8/24 explain the great number of “small molecule metabolic Regarding the first observation, the analysis of the sub- process” biological processes. network pertaining to Rothia mucilaginosa shares the characteristics previously described for the hub proteins with 37/638 interactions with the EFGR protein, 40/638 Analysis of hub proteins interactions with fibronectin and 34/638 interactions with The top three hub proteins identified share a common CAND1. Furthermore, this sub-network presents two pre- trait: these are exploited by pathogens in an attempt to dicted interactions which have not been described before: gain entry to the host and survive inside it. R. mucilaginosa proteins D2NSF5 and C6R5R8, which are EGFR is a transmembrane protein mainly produced in predicted to interact with human immunoglobulin chains the salivary glands and the kidneys [73]. Its association (P01719 and P01781), and could be related to the immune with microbial invasion has already been reported for response specific for this species, explaining why these in- Salmonella typhimurium [74], Candida albicans [75], teractions are worth investigating. Reovirus [76], and Vaccinia virus [77]. Apparently, all If we consider the bacteria most associated with peri- these pathogens initiate cellular invasion, at least to odontal disease, our model predicts few interactions be- some extent,bybinding to EGFR.Thissuggeststhe tween A. actinomycemcomitans, P. gingivalis, and the host possibility that several other pathogens are using the proteins. As mentioned before, this is due to the fact that EGFR to start host colonization, as supported by Buret these organisms are not well represented in the original et al.[78]. protein data set. However, besides the interactions pre- Similarly to EGFR, fibronectin appears to also play the dicted between these bacteria and the human hub proteins role of a “microbial-anchor”. This glycoprotein is found described above, in the case of Porphyromonas gingivalis it bound to the β integrins in the cell surface, and is gen- is possible to identify at least two potentially interesting erally seen as a key protein for bacterial adhesion within new types of interactions between bacterial ribosomal pro- the oral cavity [79,80]. teins and a major histocompatibility complex protein The CAND1 protein, formerly TIP120A, was found to (P30461). Furthermore, we also identified a possible inter- interact with most of the proteins in the Cullin family action between the bacterial enolase (Q7MTV8) and a [81]. The Cullin protein family plays a key role in the host aquaporin which could interfere with the homeostasis ubiquitination of cellular proteins, i.e. performing a mechanisms of the host. Additionally, when we consider post-translational modification in order to label the the interactions of P. gingivalis with other bacteria, we find target protein with ubiquitin molecules. This labeling that the same enolase might interact with outer mem- frequently results in the commitment of the ubiquitin- brane proteins of Haemophilus influenza and Pasteurella linked protein to proteasomal degradation [82]. Conse- multocida. The role of bacterial enolase as a multitask quently, CAND1 was suggested to function as a global protein involved not only in carbohydrate metabolism but regulator of cullin-containing ubiquitin ligases [81,83]. also in virulence has been recognized recently [87]. Being one of the top hub-proteins, we investigated the This suggests that previously unknown and important relationship between the ubiquitination pathway and PPIs for oral colonization and biofilm formation may be pathogen colonization of the host cells. As expected, we present in this dataset. Finally the fact there are possible found that certain bacteria corrupt the ubiquitination interactions between P.gingivalis proteases and those of machinery as a means of regulating their virulence fac- other periodonto-pathogens such as Kingella oralis and tors, or to trigger internalization of bacteria into host Treponema denticola is interesting. This may even shed cells [84]. Such a mechanism improves the survival and some light on the synergistic aspects of oral biofilm in replication chances of bacteria inside the host. periodontal disease [86]. Study of the microbiome role in periodontitis Conclusion When the data analysis is focused on a particular disease The continuous yield of large-scale data mainly from mi- such as periodontal disease four main features can be ob- croarrays and yeast two-hybrid studies has made the served: i) Rothia mucilaginosa, a microorganism present study of PPIs very appealing. The main issue associated in the normal human oral microbiome but considered an with PPI study is the high prevalence of false positives opportunistic pathogen [85], is the species with the most and negatives in experimental PPI data. Being the only interactions, with some of them revealing important and “reliable” source of PPIs, inaccurate experimental PPI specific interactions; ii) new interactions are predicted be- data will contaminate training datasets and therefore tween periodonto-pathogens and the host, and; iii) inter- compromise the performance of computational PPI pre- actions between periodonto-pathogens are also predicted, diction methods. For this reason, we believe that an im- most likely explaining a synergistic course of action, as has provement in the quality of experimental PPI data will been previously proposed [86]. greatly impact the performance of new computational Coelho et al. BMC Systems Biology 2014, 8:24 Page 7 of 12 http://www.biomedcentral.com/1752-0509/8/24 PPI prediction approaches. While this is not the case at Positive dataset present, we must consider how to avoid the effects of We collected experimental oral protein-protein inter- false positives and false negatives in the final PPI predic- action (PPI) data from five databases: 14,139 PPIs from tion model. BIOGRID [89], 254 PPIs from DIP [90], 3,555 PPIs from We proposed a probabilistic Bayesian-based method to HPRD [91], 4,135 PPIs from IntAct [92], and 1,481 PPIs integrate several data sources, to obtain more robust from MINT [93], totaling 23,564 protein interactions and reliable PPI predictions. By applying naïve Bayes, we (Figure 4). automatically up-weigh the most informative features All the interacting protein pairs were identified by and down-weigh the less informative ones, allowing for their UniProtKB [94] Accession IDs for normalization automatic error-correction. purposes. In some instances it was necessary to convert Our individual feature analysis results show a great the database own identifiers to UniProtKB Accession relevance of the selected features. When applied on a IDs. The BioGRID database represents interacting pro- naïve Bayes classifier, the individual features synergize, tein pairs using their own identifiers and Entrez Gene boosting the AUC up to 0.926. This suggests that the re- IDs. To match them to UniProtKB AccessionIDs we ex- liability of prediction improves with the increase of sig- tracted the Gene IDs from the protein pairs and down- nificant features, meaning that the ensemble final model loaded the list of respective gene products in the actually reduces the disadvantages of the individual UniProtKB Accession ID format. UniProtKB allows dir- methods. ect mapping from the MINT and DIP databases to an- Cytoscape was successfully used to validate the net- other identifier. A list of PPI pairs from both databases work when tested with real pathway examples, discover- was uploaded to the UniProtKB mapping feature, result- ing new potentially interesting interactions in oral ing in two different sets of UniProtKB Accession ID biology, both between the host and the periodontal path- pairs. HPRD uses its own identification system coupled ogens and between different periodontal pathogens. with NCBI Reference Sequence Accession IDs (RefSeq) We believe our work may be applied in several scientific to classify PPI pairs. All the RefSeq Protein IDs were areas, and even in other PPI related studies. An example is converted to UniProtKB Accession IDs and paired ac- biomedical PPI screening, to assess if interactions of par- cordingly. IntAct PPI pairs are identified with Uni- ticular interest might occur and what the related interaction ProtKB Accession IDs and were directly extracted. probability is. Another example is pharmacologic research, PPI pairs from the five databases were merged and re- as a well-established PPI network can provide insights on peated entries were removed. From a total of 23,564 potential drug targets, but also new uses for existent in- PPIs, 5,193 duplicated entries were removed, resulting in market drugs. Finally, and based on the fact that protein a PDS of 18,371 protein pairs. interaction networks are dynamic [88], our work can sup- port researchers in identifying evolutionary patterns. Negative dataset The selection of negative examples to integrate the nega- Methods tive data was based on two methods described in the lit- Oral proteome erature [95]. These methods consist of randomly selecting As a starting point for our study we used 4,707 proteins, protein pairs that are not present in a veto list containing 3500 from Human and 1207 from microbial, available all PPIs from the positive data set. The use of this strategy on the OralCard database [66,67]. was considered acceptable because the probability of com- These proteins were identified via proteomic analysis mitting an error while picking a random pair is low: of the saliva, frequently by using 2D electrophoresis/ mass spectrometry or 2D liquid chromatography/mass N K K spectrometry. By the end of 2012 the salivary proteome PeðÞ ¼ ¼ ;ðÞ K≪N ⇒PeðÞ≅0; was determined to contain 3500 proteins from human N ðÞ N−1 N−1 origin and 1207 from microbial sources. where N is the number of proteins and K is the average Predictive dataset construction degree for the final PPI network. In this study the N is The use of positive (interacting pairs of proteins) and 4,707 and for PPIs the typical value of K is between 6 negative (non-interacting pairs of proteins) examples is and 16. required for training and assessing the performance of With this strategy we generated a NDS of a size simi- the classifier. All data used in the construction of the lar to that of the PDS (18,348 “negative” protein pairs), positive data set (PDS) and the negative data set (NDS) and combined it with the PDS to obtain a training data was downloaded in March 2013. set with 36,719 PPIs. Coelho et al. BMC Systems Biology 2014, 8:24 Page 8 of 12 http://www.biomedcentral.com/1752-0509/8/24 Figure 4 Venn diagram representing the intersections between the five high-quality experimentally determined protein-protein interaction databases. Feature construction interact. The semantic context for a given protein is de- In this section we describe the procedure for construc- fined by the concepts, from a pre-defined vocabulary, tion of the five clusters of features. The final results are that are frequently mentioned in the same articles with summarized in Table 3. that protein, and is represented by a vector containing a weight for each concept. These weights are based on the Literature co-occurrence statistics, and measure the degree of asso- The literature-based protein-protein interaction scores ciation between the protein and each concept. Following were calculated by the method described in van Haagen Jelier et al. [97], we use the symmetric uncertainty coef- et al. [96]. This method is based on comparing the se- ficient U (X ,Y ) – where X is in this case the protein of i j i mantic contexts in which two proteins are mentioned in interest and Y is any other concept in the vocabulary – the published literature. The rationale for the method is as the weights used for creating the concept profiles: that two proteins occurring in similar contexts will have HY þHXðÞ−HH ; Y a higher similarity score and are therefore more likely to j i i j UX ; Y ¼ 2 ; i j HXðÞþHY i j Table 3 Relative coverage of protein-protein interactions Where H (X) is the entropy for X and H (X, Y) is the present in the training and test data by individual feature clusters joint entropy for X and Y, calculated based on document frequency counts. Training data Classification We used a corpus of nearly one million abstracts, ob- #Interactions % of total #Interactions % of total tained by searching Pubmed with 17,402 names and syno- Literature 22,720 61.9% 4,698,390 69.9% nyms extracted from UniProtKB for 4,707 proteins in Sequence 35,379 96.4% 6,703,945 99.8% the dataset, after removing nonsensical names such as GO 23,769 64.8% 5,130,103 76.4% “uncharacterized protein”. To identify the concepts men- COGs 9,636 26.3% 1,324,230 19.7% tioned in the texts we used Gimli [98], a machine-learning DDIs 5,994 16.3% 516,609 7.7% tool for gene and protein name recognition, together with dictionary matching to recognize other concepts from ten Total 36,698 100.0% 6,716,792 100.0% different semantic types including chemical entities, ana- GO, gene ontology; COGs, clusters of orthologous groups; DDIs, domain-domain interactions. tomical terms, diseases, pathways and GO terms. The Coelho et al. BMC Systems Biology 2014, 8:24 Page 9 of 12 http://www.biomedcentral.com/1752-0509/8/24 dictionaries used contain around 1,3 million distinct We were able to obtain the orthologous profile for names for around 400 thousand concepts. Based on the 9,636 protein pairs from the training dataset and concept annotation of this corpus, we were able to calcu- 1,324,230 proteins pairs for the classification dataset. late concept profiles for 22,720 protein pairs from the training dataset and 4,698,390 protein pairs for the classifi- Biological process similarity cation dataset. Previous studies have explored the use of GO annotation similarity between two proteins as a PPI predictor [59,102-105]. We downloaded biological process infor- Primary protein sequence information mation from the GO Consortium [57] in March 2013 Several studies have been carried out where detection of and calculated the depth of the GO terms (nodes) in the protein-protein interaction is derived from information Directed Acyclic Graph (DAG), and the total number of directly extracted from the amino-acid sequences [44-56]. proteins comprised between the smallest shared bio- The results indicate that the sequence information alone logical process (SSBP) for each pair of proteins and the is sufficient to detect PPIs with reasonable accuracy [87] following three branches. Since the depth of the GO but may be improved if combined with other strategies. terms in the DAG is implied in the total number of pro- Taking into account the primary protein sequence in- teins, post-test odds analysis was performed solely on formation, the following features have been considered this feature to avoid redundancy. Such an approach was in this work: occurrence of the 20 amino-acids in the based on the general hypothesis that it is progressively protein sequence, protein atomic composition, molecu- more likely for the proteins comprised within a bio- lar weight and atomic weight, forming a vector of 27 fea- logical process to interact, if the total number of pro- tures. The interacting protein pair (X, Y) is represented teins involved in that process is progressively smaller. by concatenating the corresponding features vectors F We were able to obtain the gene ontology profile for and F , represented by (F ,F ). 23,769 protein pairs from the training dataset and y x y We were able to obtain the sequence profile for 35,379 5,130,103 protein pairs for the classification dataset. proteins pairs from the training dataset and 6,703,945 protein pairs for the classification dataset. Enriched conserved domain pairs The Database of Protein Domain Interactions [106] (DOMINE) contains binary domain-domain interaction Orthologous profiles (DDI) data compiled from a collection of 15 databases By definition, clusters of orthologous groups (COGs) are and DDI prediction methods. Additionally, DOMINE sets of orthologous genes or orthologous groups of para- provides a quality measure of the DDI confidence, as logs from three or more phylogenetic trees. In essence, well as a binary classification of whether the domains this means that two proteins from different lineages be- are part of the same GO biological process. Here, we as- longing to the same COG are orthologous. Orthologs sume that whenever two given proteins possess one or are genes in different species that evolved from a com- more interacting domains between them, those proteins mon ancestor by speciation (i.e. convergent evolution). will interact. We adopted this DDI data collection as in- In contrast, paralogs are genes related by duplication dividual features in our approach. Since DOMINE pro- within a genome [99]. vides DDI information from several sources, we tallied Lee et al. [100] aimed to expand the interactomes of the number of sources that identified a DDI. This strat- various organisms by applying orthology-based methods egy confers higher reliability on DDI pairs with higher in inter-species PPI prediction. They expanded ortholo- scores (closer to 15, the maximum number of DDI gous pairs of 18 eukaryotic organisms and merged them sources). with experimental PPI datasets, allowing the inference of We were able to obtain the domain profile for 5,994 PPIs for various species. protein pairs from the training dataset and 516,609 pro- In this work we used the Search Tool for the Retrieval tein pairs for the classification dataset. of Interacting Genes/Proteins (STRING) [101] database to obtain COGs and their respective combined scores. Data classification and validation The combined score is computed by integrating the like- The proposed approach was developed, tested, optimized lihoods from the different types of evidence, correcting and performed using Orange, an open-source bioinfor- for the probability of randomly observing an interaction matics tool featuring Python scripting and a visual and [101]. This enhances the predictive performance of the programmatic interface. We used the naïve Bayes [107] method, as a combined score is only computed when classifier to predict PPIs in our data. The naïve Bayes clas- more than one of the data sources in STRING supports sifier calculates the conditional probability of each attri- a given association. bute A given the class label C, from the training data. The i Coelho et al. BMC Systems Biology 2014, 8:24 Page 10 of 12 http://www.biomedcentral.com/1752-0509/8/24 Bayes rule is then applied to calculate the probability of C conception of the study, coordinated it, and helped to draft the manuscript. All authors read and approved the final manuscript. given the specific instance of A ,…, A , and then assessing 1 n the class with the greatest posterior probability, ensuing classification [108]. Acknowledgements This work has received support from the RD-CONNECT project (EC contract The receiver operating characteristic (ROC) curve, number 305444). Edgar D. Coelho is funded by Fundação para a Ciência e which is the plot of the true positive (TP) rate with the Tecnologia, FCT, under Grant SFRH/BD/86343/2012. false positive (FP) rate, depicting the relative trade-off be- Author details tween both rates [109] was used to evaluate the method’s Department of Electronics, Telecommunications and Informatics (DETI), performance. When comparing classifiers with very simi- Institute of Electronics and Telematics Engineering of Aveiro (IEETA), lar ROC curves, it may be necessary to estimate a single University of Aveiro, Aveiro, Portugal. Department of Informatics Engineering (DEI), University of Coimbra, Coimbra, Portugal. Centre for scalar value to represent the expected performance. One Informatics and Systems of the University at Coimbra (CISUC), University of of the most common methods is calculation of the area Coimbra, Coimbra, Portugal. Department of Informatics Engineering and under the ROC curve (AUC) [110], which we used to Systems, Polytechnic Institute of Coimbra, Engineering Institute of Coimbra (IPC-ISEC), Coimbra, Portugal. Department of Health Sciences, Institute of compare the naïve Bayes classifier. Therefore, we assessed Health Sciences, The Catholic University of Portugal, Viseu, Portugal. Centre the individual contributions of each feature in terms of for Neurosciences and Cell Biology, University of Coimbra, Coimbra, Portugal. classification accuracy (CA), area under curve (AUC), F1- Received: 27 August 2013 Accepted: 17 February 2014 score, precision and recall. Published: 27 February 2014 Interactome analysis We used Cytoscape to visualize and validate the ob- References 1. Phizicky EM, Fields S: Protein-protein interactions: methods for detection tained PPI network. The PPIs were classified as “HU- and analysis. Microbiol Rev 1995, 59:94–123. MAN-HUMAN”, if the interacting proteins were only of 2. Dyer MD, Murali TM, Sobral BW: Computational prediction of host- human origin, as “MICRO-MICRO”, if the interacting pathogen protein–protein interactions. Bioinformatics 2007, 23:i159–i166. 3. Littler SJ, Hubbard SJ: Conservation of orientation and sequence in proteins were only of microbial origin, or as “HUMAN- protein domain–domain interactions. J Mol Biol 2005, 345:1265–1279. MICRO”. 4. Valdar WS, Thornton JM: Protein-protein interfaces: analysis of amino acid We imported the network data to Cytoscape defining conservation in homodimers. Proteins 2001, 42:108–124. 5. Aloy P, Ceulemans H, Stark A, Russell RB: The relationship between the two proteins in the same interacting protein pair as sequence and interaction divergence in proteins. J Mol Biol 2003, Source Interaction (protein one) and Target Interaction 332:989–998. (protein two). The chosen Interaction Type was the 6. Teichmann SA: The constraints protein-protein interactions place on sequence divergence. J Mol Biol 2002, 324:399–407. above-mentioned organism-organism classification. A 7. Panchenko AR, Wolf YI, Panchenko LA, Madej T: Evolutionary plasticity of file containing node attributes was also imported, con- protein families: coupling between sequence and structure variation. taining microorganism and biological process informa- Proteins 2005, 61:535–544. 8. Rual J-F, Venkatesan K, Hao T, Hirozane-Kishikawa T, Dricot A, Li N, Berriz GF, tion extracted from the UniProt database pertaining to Gibbons FD, Dreze M, Ayivi-Guedehoussou N, Klitgord N, Simon C, Boxem each individual protein in the network. M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Boxem M, Milstein S, Rosenberg J, Goldberg DS, Zhang LV, Wong SL, Franklin G, Li S, Albala JS, Lim J, et al: Towards a proteome-scale map of Availability the human protein-protein interaction network. Nature 2005, 437:1173–1178. All data required to analyze the results and re-run this 9. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover experiment are available for download at http://bioinfor- D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg matics.ua.pt/software/oralint. This includes the unique JM: A comprehensive analysis of protein-protein interactions in list of UniProt AC for the proteins in the oral cavity, the Saccharomyces cerevisiae. Nature 2000, 403:623–627. 10. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive gold standard of interactions, the dataset used for train- two-hybrid analysis to explore the yeast protein interactome. ing and validation, the predictions obtained, and the Proceedings of the National Academy of Sciences 2001, 98:4569–4574. Cytoscape project file with the network. 11. Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, Séraphin B: A generic protein purification method for protein complex characterization and Competing interests proteome exploration. Nat Biotechnol 1999, 17:1030–1032. The authors declare that they have no competing interests. 12. Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A 1998, 95:14863–14868. Authors’ contributions 13. MacBeath G, Schreiber SL: Printing proteins as microarrays for high- EDC participated in the design of the study, constructed the positive and throughput function determination. Science 2000, 289:1760–1763. negative datasets, performed the analysis of hub proteins, characterized and 14. Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen analysed the human-microbial interactome, and drafted the manuscript. JPA R, Bidlingmaier S, Houfek T, Mitchell T, Miller P, Dean RA, Gerstein M, Snyder conceived the study, participated in its design, performed feature construction M: Global analysis of protein activities using proteome chips. and selection, parameterized the classifier, and helped to draft the manuscript. Science 2001, 293:2101–2105. SM performed the text-mining analysis and helped to draft the manuscript. CP carried out primary protein sequence analysis and helped to draft the 15. Jones RB, Gordus A, Krall JA, MacBeath G: A quantitative protein manuscript. NR and MJC analysed the role of the microbiome in periodontitis interaction network for the ErbB receptors using protein microarrays. and helped to draft the manuscript. MB and JLO participated in the design and Nature 2006, 439:168–174. Coelho et al. BMC Systems Biology 2014, 8:24 Page 11 of 12 http://www.biomedcentral.com/1752-0509/8/24 16. Ye P, Peyser BD, Pan X, Boeke JD, Spencer FA, Bader JS: Gene function 44. Bock JR, Gough DA: Predicting protein–protein interactions from primary prediction from congruent synthetic lethal interactions in yeast. structure. Bioinformatics 2001, 17:455–460. Mol Syst Biol 2005, 1(2005):0026. 45. Bock JR, Gough DA: Whole-proteome interaction mining. Bioinformatics 17. Smith GP: Filamentous fusion phage: novel expression vectors that 2003, 19:125–134. display cloned antigens on the virion surface. Science 1985, 228:1315–1317. 46. Martin S, Roe D, Faulon J-L: Predicting protein–protein interactions using 18. Tong AHY, Evangelista M, Parsons AB, Xu H, Bader GD, Pagé N, Robinson M, signature products. Bioinformatics 2005, 21:218–226. Raghibizadeh S, Hogue CWV, Bussey H, Andrews B, Tyers M, Boone C: 47. Ben-Hur A, Noble WS: Kernel methods for predicting protein–protein Systematic genetic analysis with ordered arrays of yeast deletion interactions. Bioinformatics 2005, 21:38–46. mutants. Science 2001, 294:2364–2368. 48. Pitre S, Dehne F, Chan A, Cheetham J, Duong A, Emili A, Gebbia M, 19. Yan Y, Marriott G: Analysis of protein interactions using fluorescence Greenblatt J, Jessulat M, Krogan N, Luo X, Golshani A: PIPE: a protein- technologies. Curr Opin Chem Biol 2003, 7:635–640. protein interaction prediction engine based on the re-occurring short 20. Cooper MA: Label-free screening of bio-molecular interactions. polypeptide sequences between known interacting protein pairs. Anal Bioanal Chem 2003, 377:834–842. BMC Bioinformatics 2006, 7:365. 21. Yang Y, Wang H, Erie DA: Quantitative characterization of biomolecular 49. Nanni L, Lumini A: An ensemble of K-local hyperplanes for predicting assemblies and interactions using atomic force microscopy. protein–protein interactions. Bioinformatics 2006, 22:1207–1210. Methods 2003, 29:175–187. 50. Nanni L: Hyperplanes for predicting protein–protein interactions. 22. Baumeister W, Grimm R, Walz J: Electron tomography of molecules and Neurocomputing 2005, 69:257–263. cells. Trends Cell Biol 1999, 9:81–85. 51. Shen J, Zhang J, Luo X, Zhu W, Yu K, Chen K, Li Y, Jiang H: Predicting 23. Xia JF, Wang SL, Lei YK: Computational methods for the prediction of protein–protein interactions based only on sequences information. protein-protein interactions. Protein Pept Lett 2010, 17:1069–1078. Proc Natl Acad Sci 2007, 104:4337–4341. 24. Jaeger S, Gaudan S, Leser U, Rebholz-Schuhmann D: Integrating protein- 52. Guo Y, Yu L, Wen Z, Li M: Using support vector machine combined with protein interactions and text mining for protein function prediction. auto covariance to predict protein–protein interactions from protein BMC Bioinformatics 2008, 9(Suppl 8):S2. sequences. Nucleic Acids Res 2008, 36:3025–3030. 25. Tamames J, Casari G, Ouzounis C, Valencia A: Conserved clusters of 53. Xia JF, Han K, Huang DS: Sequence-based prediction of protein-protein functionally related genes in two bacterial genomes. J Mol Evol 1997, interactions by means of rotation forest and autocorrelation descriptor. 44:66–73. Protein Pept Lett 2010, 17:137–145. 26. Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: 54. Rajasekaran S, Merlin JC, Kundeti V, Mi T, Oommen A, Vyas J, Alaniz I, Chung a fingerprint of proteins that physically interact. Trends Biochem Sci 1998, K, Chowdhury F, Deverasatty S, Irvey TM, Lacambacal D, Lara D, 23:324–328. Panchangam S, Rathnayake V, Watts P, Schiller MR: A computational tool 27. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene for identifying minimotifs in protein-protein interactions and improving clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, the accuracy of minimotif predictions. Proteins 2011, 79:153–164. 96:2896–2901. 55. Knisley D, Knisley J: Predicting protein–protein interactions using graph 28. Blumenthal T: Gene clusters and polycistronic transcription in eukaryotes. invariants and a neural network. Comput Biol Chem 2011, 35:108–113. Bioessays 1998, 20:480–487. 56. Zhang Y, Zhang D, Mi G, Ma D, Li G, Guo Y, Li M, Zhu M: Using ensemble 29. Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction methods to deal with imbalanced data in predicting protein-protein maps for complete genomes based on gene fusion events. Nature 1999, interactions. Comput Biol Chem 2012, 36:36–41. 402:86–90. 57. Harris MA, Clark J, Ireland A, Lomax J, Ashburner M, Foulger R, Eilbeck K, 30. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Lewis S, Marshall B, Mungall C, Richter J, Rubin GM, Blake JA, Bult C, Dolan Detecting protein function and protein-protein interactions from M, Drabkin H, Eppig JT, Hill DP, Ni L, Ringwald M, Balakrishnan R, Cherry JM, genome sequences. Science 1999, 285:751–753. Christie KR, Costanzo MC, Dwight SS, Engel S, Fisk DG, Hirschman JE, Hong 31. Ouzounis C, Kyrpides N: The emergence of major cellular processes in EL, Nash RS, et al: The gene ontology (GO) database and informatics evolution. FEBS Lett 1996, 390:119–123. resource. Nucleic Acids Res 2004, 32:D258–D261. 32. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: 58. Jain S, Bader GD: An improved method for scoring protein-protein Assigning protein functions by comparative genome analysis: protein interactions using semantic similarity within the gene ontology. phylogenetic profiles. Proc Natl Acad Sci U S A 1999, 96:4285–4288. BMC Bioinformatics 2010, 11:562. 33. Barker D, Pagel M: Predicting functional gene links from phylogenetic- 59. Maetschke SR, Simonsen M, Davis MJ, Ragan MA: Gene Ontology-driven statistical analyses of whole genomes. PLoS Comput Biol 2005, 1:e3. inference of protein–protein interactions using inducers. Bioinformatics 2012, 28:69–75. 34. Najafabadi HS, Salavati R: Sequence-based prediction of protein-protein interactions by means of codon usage. Genome Biol 2008, 9:R87. 60. Park B, Cui G, Lee H, Huang D-S, Han K: PPISearchEngine: gene ontology- 35. Aloy P, Russell RB: Interrogating protein interaction networks through based search for protein–protein interactions. Comput Methods Biomech structural biology. Proc Natl Acad Sci 2002, 99:5896–5901. Biomed Engin 2012, 16:1–8. 36. Lu L, Lu H, Skolnick J: MULTIPROSPECTOR: an algorithm for the prediction 61. Wu X, Zhu L, Guo J, Zhang DY, Lin K: Prediction of yeast protein-protein of protein-protein interactions by multimeric threading. Proteins 2002, interaction network: insights from the Gene Ontology and annotations. 49:350–364. Nucleic Acids Res 2006, 34:2137–2150. 37. Sprinzak E, Margalit H: Correlated sequence-signatures as markers of 62. Davis FP, Barkan DT, Eswar N, McKerrow JH, Sali A: Host pathogen protein protein-protein interaction. J Mol Biol 2001, 311:681–692. interactions predicted by comparative modeling. Protein Sci 2007, 16:2585–2596. 38. Deng M, Mehta S, Sun F, Chen T: Inferring domain-domain interactions 63. Tastan O, Qi Y, Carbonell JG, Klein-Seetharaman J: Prediction of interactions from protein-protein interactions. Genome Res 2002, 12:1540–1548. between HIV-1 and human proteins by information integration. Pac Symp Biocomput 2009:516–527. 39. Chen L, Wu LY, Wang Y, Zhang XS: Inferring protein interactions from experimental data by association probabilistic method. Proteins 2006, 64. Jeong H, Mason SP, Barabasi AL, Oltvai ZN: Lethality and centrality in 62:833–837. protein networks. Nature 2001, 411:41–42. 40. Morrison JL, Breitling R, Higham DJ, Gilbert DR: A lock-and-key model for 65. Wuchty S, Almaas E: Peeling the yeast protein network. Proteomics 2005, protein-protein interactions. Bioinformatics 2006, 22:2012–2019. 5:444–449. 41. Huang C, Morcos F, Kanaan SP, Wuchty S, Chen DZ, Izaguirre JA: Predicting 66. Arrais JP, Rosa N, Melo J, Coelho ED, Amaral D, Correia MJ, Barros M, Oliveira protein-protein interactions from protein domains using a set cover JL: OralCard: a bioinformatic tool for the study of oral proteome. approach. IEEE/ACM Trans Comput Biol Bioinform 2007, 4:78–87. Arch Oral Biol 2013, 58(7):762–772. 42. Chen X-W, Liu M: Prediction of protein–protein interactions using 67. Rosa N, Correia MJ, Arrais JP, Lopes P, Melo J, Oliveira JL, Barros M: From random decision forest framework. Bioinformatics 2005, 21:4394–4400. the salivary proteome to the OralOme: comprehensive molecular oral biology. Arch Oral Biol 2012, 57(7):853–864. 43. Wang R-S, Wang Y, Wu L-Y, Zhang X-S, Chen L: Analysis on multi-domain cooperation for predicting protein-protein interactions. BMC Bioinformatics 68. Vecchiola C, Pandey S, Buyya R: High-performance cloud computing: 2007, 8:391. aviewofscientificapplications. 2009:4–16. Proceedings of the 10th Coelho et al. BMC Systems Biology 2014, 8:24 Page 12 of 12 http://www.biomedcentral.com/1752-0509/8/24 International Symposium on Pervasive Systems, Algorithms and 90. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Networks I-SPAN 2009, IEEE Computer Society. database of interacting proteins: 2004 update. Nucleic Acids Res 2004, 69. Yamane K, Nambu T, Yamanaka T, Mashimo C, Sugimori C, Leung K-P, 32:D449–D451. Fukushima H: Complete genome sequence of rothia mucilaginosa DY-18: 91. Keshava Prasad TS, Goel R, Kandasamy K, Keerthikumar S, Kumar S, a clinical isolate with dense meshwork-like structures from a persistent Mathivanan S, Telikicherla D, Raju R, Shafreen B, Venugopal A, Balakrishnan apical periodontitis lesion. Sequencing 2010, 2010:1–6. L, Marimuthu A, Banerjee S, Somanathan DS, Sebastian A, Rani S, Ray S, 70. Batty I: Actinomyces odontolyticus, a new species of actinomycete Harrys Kishore CJ, Kanth S, Ahmed M, Kashyap MK, Mohmood R, regularly isolated from deep carious dentine. J Pathol Bacteriol 1958, Ramachandra YL, Krishna V, Rahiman BA, Mohan S, Ranganathan P, 75:455–459. Ramabadran S, Chaerkady R, Pandey A: Human protein reference 71. McKay LI, Cidlowski JA: Molecular control of immune/inflammatory database–2009 update. Nucleic Acids Res 2009, 37:D767–772. responses: interactions between nuclear factor-κB and steroid 92. Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury receptor-signaling pathways. Endocrine Rev 1999, 20:435–459. M, Dumousseau M, Feuermann M, Hinz U, Jandrasits C, Jimenez RC, Khadake J, Mahadevan U, Masson P, Pedruzzi I, Pfeiffenberger E, Porras P, Raghunath 72. McDevitt H, Munson S, Ettinger R, Wu A: Multiple roles for tumor necrosis A, Roechert B, Orchard S, Hermjakob H: The IntAct molecular interaction factor-alpha and lymphotoxin alpha/beta in immunity and autoimmunity. database in 2012. Nucleic Acids Res 2012, 40:D841–D846. Arthritis Res 2002, 4:S141–S152. 93. Chatr-aryamontri A, Ceol A, Montecchi Palazzi L, Nardelli G, Schneider MV, 73. Barnard JA, Beauchamp RD, Russell WE, Dubois RN, Coffey RJ: Epidermal Castagnoli L, Cesareni G: MINT, the molecular interaction database: 2012 growth factor-related peptides and their relevance to gastrointestinal update. Nucleic Acids Res 2012, 40:D857–861. pathophysiology. Gastroenterology 1995, 108:564–580. 94. Consortium TU: Reorganizing the protein space at the Universal protein 74. Galan JE, Pace J, Hayman MJ: Involvement of the epidermal growth factor resource (UniProt). Nucleic Acids Res 2012, 40:D71–D75. receptor in the invasion of cultured mammalian cells by Salmonella 95. Ben-Hur A, Noble WS: Choosing negative examples for the prediction of typhimurium. Nature 1992, 357:588–589. protein-protein interactions. BMC Bioinformatics 2006, 7(Suppl 1):S2. 75. Zhu W, Phan QT, Boontheung P, Solis NV, Loo JA, Filler SG: EGFR and HER2 96. van Haagen HHHBM, Hoen PAC't, Botelho Bovo A, de Morrée A, van Mulligen receptor kinase signaling mediate epithelial cell invasion by Candida EM, Chichester C, Kors JA, den Dunnen JT, van Ommen G-JB, van der Maarel albicans during oropharyngeal infection. Proc Natl Acad Sci U S A 2012, SM, Medina KernV,MonsB,Schuemie MJ: Novel protein-protein interactions 109:14194–14199. inferred from literature context. PLoS One 2009, 4:e7894. 76. Strong JE, Tang D, Lee PW: Evidence that the epidermal growth factor 97. Jelier R, Schuemie MJ, Roes PJ, van Mulligen EM, Kors JA: Literature-based receptor on host cells confers reovirus infection efficiency. Virology 1993, concept profiles for gene annotation: the issue of weighting. Int J Med 197:405–411. Inform 2008, 77:354–362. 77. Eppstein DA, Vivienne Marsh Y, Schreiber AB, Newman SR, Todaro GJ, 98. Campos D, Matos S, Oliveira J: Gimli: open source and high-performance Nestor JJ Jr: Epidermal growth factor receptor occupancy inhibits biomedical name recognition. BMC Bioinformatics 2013, 14:54. vaccinia virus infection. Nature 1985, 318:663–665. 99. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein 78. Buret A, Gall DG, Olson ME, Hardin JA: The role of the epidermal growth families. Science 1997, 278:631–637. factor receptor in microbial infections of the gastrointestinal tract. 100. Lee S-A, C-h C, Tsai C-H, Lai J-M, Wang F-S, Kao C-Y, Huang C-YF: Ortholog- Microbes Infect 1999, 1:1139–1144. based protein-protein interaction prediction and its application to 79. Llena-Puy MC, Montanana-Llorens C, Forner-Navarro L: Fibronectin levels in inter-species interactions. BMC Bioinformatics 2008, 9(Suppl 12):S11. stimulated whole-saliva and their relationship with cariogenic oral 101. Szklarczyk D, Franceschini A, Kuhn M, Simonovic M, Roth A, Minguez P, bacteria. Int Dent J 2000, 50:57–59. Doerks T, Stark M, Muller J, Bork P, Jensen LJ, von Mering C: The STRING 80. Henderson B, Nair S, Pallas J, Williams MA: Fibronectin: a multidomain host database in 2011: functional interaction networks of proteins, globally adhesin targeted by bacterial fibronectin-binding proteins. integrated and scored. Nucleic Acids Res 2011, 39:D561–568. FEMS Microbiol Rev 2011, 35:147–200. 102. Lin N, Wu B, Jansen R, Gerstein M, Zhao H: Information assessment on 81. Min K-W, Hwang J-W, Lee J-S, Park Y, T-a T, Yoon J-B: TIP120A associates predicting protein-protein interactions. BMC Bioinformatics 2004, 5:154. with cullins and modulates ubiquitin ligase activity. J. Biol. Chem 2003, 103. Miller JP, Lo RS, Ben-Hur A, Desmarais C, Stagljar I, Noble WS, Fields S: 278:15905–15910. Large-scale identification of yeast integral membrane protein 82. Sarikas A, Hartmann T, Pan ZQ: The cullin protein family. Genome Biol 2011, interactions. Proc Natl Acad Sci U S A 2005, 102:12123–12128. 12:220. 104. Patil A, Nakamura H: Filtering high-throughput protein-protein interaction 83. Zheng J, Yang X, Harrell JM, Ryzhikov S, Shim E-H, Lykke-Andersen K, Wei N, data using a combination of genomic features. BMC Bioinformatics 2005, Sun H, Kobayashi R, Zhang H: CAND1 binds to unneddylated CUL1 and 6:100. regulates the formation of SCF ubiquitin E3 ligase complex. 105. Qi Y, Bar-Joseph Z, Klein-Seetharaman J: Evaluation of different biological Mol Cell 2002, 10:1519–1526. data and computational classification methods for use in protein 84. Munro P, Flatau G, Lemichez E: Bacteria and the ubiquitin pathway. interaction prediction. Proteins 2006, 63:490–500. Curr Opin Microbiol 2007, 10:39–46. 106. Yellaboina S, Tasneem A, Zaykin DV, Raghavachari B, Jothi R: DOMINE: a 85. Curtis H, Dirk G, Rob K, Sahar A, Badger JH, Chinwalla AT, Creasy HH, Earl comprehensive collection of known and predicted domain-domain AM, FitzGerald MG, Fulton RS, Giglio MG, Kymberlie H-P, Lobos EA, Ramana interactions. Nucleic Acids Res 2011, 39:D730–D735. M, Vincent M, Martin JC, Makedonka M, Muzny DM, Sodergren EJ, Versalovic 107. Duda R, Hart P: Pattern Classification and Scene Analysis. New York: John J, Wollam AM, Worley KC, Wortman JR, Young SK, Qiandong Z, Aagaard KM, Wiley & Sons Inc; 1973. Abolude OO, Emma A-V, Alm EJ, Lucia A, et al: Structure, function and 108. Friedman N, Geiger D, Goldszmidt M: Bayesian Network Classifiers. diversity of the healthy human microbiome. Nature 2012, 486:207–214. Mach Learn 1997, 29:131–163. 86. Avila-Campos MJ, Velasquez-Melendez G: Prevalence of putative periodon- 109. Swets JA: Measuring the accuracy of diagnostic systems. Science 1988, topathogens from periodontal patients and healthy subjects in Sao 240:1285–1293. Paulo, SP, Brazil. Rev Inst Med Trop Sao Paulo 2002, 44:1–5. 110. Hanley JA, McNeil BJ: The meaning and use of the area under a receiver 87. Antikainen J, Kuparinen V, Lahteenmaki K, Korhonen TK: Enolases from operating characteristic (ROC) curve. Radiology 1982, 143:29–36. Gram-positive bacterial pathogens and commensal lactobacilli share functional similarity in virulence-associated traits. FEMS Immunol Med doi:10.1186/1752-0509-8-24 Microbiol 2007, 51:526–534. Cite this article as: Coelho et al.: Computational prediction of the 88. Levy ED, Pereira-Leal JB: Evolution and dynamics of protein interactions human-microbial oral interactome. BMC Systems Biology 2014 8:24. and networks. Curr Opin Struct Biol 2008, 18:349–357. 89. Stark C, Breitkreutz B-J, Reguly T, Boucher L, Breitkreutz A, Tyers M: BioGRID: a general repository for interaction datasets. Nucleic Acids Res 2006, 34:D535–539.

Journal

BMC Systems Biology – Springer Journals

Published: Feb 27, 2014

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Computational prediction of the human-microbial oral interactome

Computational prediction of the human-microbial oral interactome

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Computational prediction of the human-microbial oral interactome

Computational prediction of the human-microbial oral interactome

References (121)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies