Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A lock-and-key model for protein–protein interactions

A lock-and-key model for protein–protein interactions Vol. 22 no. 16 2006, pages 2012–2019 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl338 Systems biology 1,2, 3 2 1 Julie L. Morrison , Rainer Breitling , Desmond J. Higham and David R. Gilbert Bioinformatics Research Centre, Department of Computing Science, University of Glasgow, G12 8QQ, UK, 2 3 Department of Mathematics, University of Strathclyde, G1 1XH, UK and Groningen Bioinformatics Centre, University of Groningen, Kerklaan 30, 9751 NN Haren, Netherlands Received on February 20, 2006; received on April 26, 2006; accepted on June 15, 2006 Advance Access publication June 20, 2006 Associate Editor: Charlie Hodgman ABSTRACT reasoning behind the occurrence of interactions, which may help in identifying false-positive and false-negative interactions. Also, Motivation: Protein–protein interaction networks are one of the major understanding the interactions at a local level allows us in turn to post-genomic data sources available to molecular biologists. They pro- make inferences about the global network topology. vide a comprehensive view of the global interaction structure of an The essence of our approach to modelling and thereby gaining organism’s proteome, as well as detailed information on specific inter- further insight into the local and global structure of protein–protein actions. Here we suggest a physical model of protein interactions that interaction networks is the idea of lock-and-key domains. Physical can be used to extract additional information at an intermediate level: It interactions between protein domains are responsible for the inter- enables us to identify proteins which share biological interaction motifs, actions between proteins. Thus, modelling interaction networks in and also to identify potentially missing or spurious interactions. terms of the domains that each protein contains is biologically Results: Our new graph model explains observed interactions between justified. The lock-and-key structure defines interactions to be proteinsbyanunderlyinginteractionofcomplementarybindingdomains observed, with some probability, between proteins which contain (lock-and-key model). This leads to a novel graph-theoretical algorithm complementary domains (lock and key). This results in a network to identify bipartite subgraphs within protein–protein interaction net- composed of near complete bipartite subgraphs. The algorithm workswheretheunderlying data aretakenfromyeasttwo-hybridexperi- designed in this lock-and-key framework is intended for application mental results. By testing on synthetic data, we demonstrate that under on networks derived from experiments where interactions are certain modelling assumptions, the algorithm will return correct domain observed in a pairwise fashion, such as yeast two-hybrid data informationabouteachproteininthenetwork.Testsondatafromvarious (Y2H). For the purpose of this paper we use the term ‘domain’ modelorganismsshowthatthelocalandglobalpatternspredictedbythe model are indeed found in experimental data. Using functional and pro- in the broadest possible sense. Common lock domains (as well as tein structure annotations, we show that bipartite subnetworks can be key domains) can be equivalent interaction surfaces, without being evolutionary homologues and even without a strict requirement for identified that correspond to biologically relevant interaction motifs. similarity and exact definition at the structural level. Some of these are novel and we discuss an example involving SH3 This modelling approach has greater biological grounding than domains from the Saccharomyces cerevisiae interactome. previous attempts that model protein–protein interaction networks Availability: The algorithm (in Matlab format) is available (see http:// with off-the-shelf classes of random graph. In particular, it had been www.maths.strath.ac.uk/aas96106/lock_key.html) widely believed that the degree distribution of protein–protein inter- Contact: [email protected] action networks followed a power-law, indicating a scale-free struc- Supplementary information: Supplementary data are available at ture (Jeong et al., 2000). There is mounting evidence to suggest that http://www.maths.strath.ac.uk/aas96106/lock_key.html. this is not the case (Pr z zulj et al., 2004; Khanin and Wit, 2006), so simply fitting a scale-free model to the data is not a valid approach. The use of protein domains to validate protein–protein interac- 1 INTRODUCTION tions is growing. For example, a statistical method developed by Riley et al. (2005), can be used to verify known domain–domain The vast growth in availability of high-throughout protein–protein interactions, identify highly specific domain–domain interactions interaction datasets is widely documented (Bork et al., 2004) and and find domain–domain interactions involving domains of has been accompanied by discussion emphasising the high error unknown function. The novelty of our approach is that domain rates within such datasets. This combination necessitates the devel- information is identified from interaction data alone. opment of robust analytical techniques to gain knowledge about the Assuming the lock-and-key interaction structure, we define a resultant protein–protein interaction networks (Edwards et al., mathematical model and a subsequent algorithm that allows us 2002). Graph-theoretic tools have proved successful, although to extract domain information about each protein in the network. they have largely focussed on global rather than local properties This approach is verified on synthetic data generated using the lock- of the networks (Salwinski and Eisenberg, 2003). By modelling and-key definition. We also demonstrate that the approach is robust interactions based on local properties of the proteins we gain to the introduction of false positive and false negative interactions. To whom correspondence should be addressed. We then identify a number of interaction structures indicating a 2012  The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Lock-and-key model an indicator vector u 2 R such that lock-and-key pattern in real interactomes across a wide range of species and provide biological interpretations for some of these a‚ if protein i has lock structures. u ¼ b‚ if protein i has key 0‚ otherwise : Here a and b are real numbers that will be specified later. 2 METHODS In order to get a clean mathematical analysis, we make the following sim- plifying assumptions (justification for these is given in the next paragraph). 2.1 Data (1) For this lock-and-key combination, any protein which contains the The mathematical model on which we base our analysis describes pairwise lock/key interactions of proteins, rather than agglomerates or large complexes. This (a) will not also contain the key/lock, and corresponds most closely to the experimental situation prevailing in Y2H experiments. Y2H interactions were obtained from BIND—the Biomole- (b) will only interact with a protein containing the complementary cular Interaction Network Database Version 3.8 (June 20, 2005) (Alfarano key/lock (it will not contain any other locks or keys). et al., 2005). In an attempt to cover as broad a range of species as possible, (2) For each protein having this lock or key, owing to experimental networks were constructed for all species for which >500 interactions had constraints only a fixed proportion, , of its connections with the been reported. These were Helicobacter pylori, Arabidopsis thaliana, Sac- matching key/lock will be recorded. charomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens. In the networks each node represents a The first assumption ensures that the bipartite subgraph under considera- protein and each protein–protein interaction is represented by an edge. For tion is isolated from the rest of the network. Note that we are not placing yeast (S.cerevisiae) we also examined networks corresponding to the clas- restrictions on the interactions of proteins in the remainder of the network. sical Y2H studies by Ito et al. (2001) and Uetz et al. (2000). Further details of The second assumption is a type of mean-field approximation—individual these networks are given in Supplementary Material. proteins in the subgraph connect with the average frequency of the ensemble. The presence of noise in high-throughput protein–protein interaction Although these are clearly idealizations, we demonstrate below that the main datasets is widely known [it has been suggested that between 30–50% of features from our analysis are robust to the presence of multidomain proteins high-throughput interactions are biologically relevant (Bader et al., 2004)] and varying connectivity frequency. and, thus, we understand that the datasets are far from complete. Despite the If we let locksum and keysum denote the total number of proteins that presence of false positives and false negatives, our aim is to produce a robust contain the one particular lock or key under investigation, our assumptions algorithm which will identify any bipartite structures if they exist in the imply that the i-th component of the matrix-vector product Au is given by available data. As a negative control, we also analysed the yeast dataset described in von N < b ·  · keysum‚ if protein i has lock Mering et al. (2002), which combines interactions observed for yeast using a ðAuÞ :¼ a u ¼ a ·  · locksum‚ if protein i has key ij j number of experimental techniques, including mass spectrometry, that do j ¼ 1 0‚ otherwise: not identify pairwise binding. As these type of data do not conform to our If b  keysum ¼ la and a locksum ¼ lb, for some value l, then we have model, we do not expect to find strong bipartite patterns. Protein domain and (Au) ¼ (lu) . In this case, u is an eigenvector of A with eigenvalue l. These function annotations were extracted from annotation files obtained from i i 2 2 constraints give a /b ¼ keysum/locksum. Ignoring trivial re-scalings, this Affymetrix. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi results in two distinct solutions, l ¼ ± keysum · locksum. Thus, we pre- pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dict that the matrix A will have a pair of eigenvalues ± keysum · locksum 2.2 Interaction model and analytical algorithm with corresponding eigenvectors whose non-zero components take only two pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi possible values: one value ± keysum and the other value ± locksum. We propose a model for protein–protein interaction networks that reflects [lock] [key] In other words, if we let ind and ind be indicator vectors for the the manner in which proteins bind to each other in experiments such as lock and key, so that Y2H assays. The model is based on a lock-and-key principle, where proteins interact only if one protein contains the ‘lock’ aspect of some ½lock 1 if protein i haslock ind ¼ interaction surface, and the other protein contains the matching ‘key’. We 0 otherwise also assume that an interaction will be observed between such a pair of [key] and similarly for ind , then the two eigenvectors have the form proteins with some probability 0    1 . An immediate consequence of j a j j lock j j key j j b j j lock j j key j these assumptions is the prediction that there will exist nearly complete u ¼ aind þ bind ‚ u ¼ aind þ bind : bipartite subgraphs within the protein–protein interaction networks, i.e. j a j j b j j key j j a j j b j j lock j Hence u + u ¼ 2bind and u u ¼ 2aind , so the sum two groups of proteins with little or no intra-group connections but strong and difference of the eigenvectors reveal which proteins have the lock and inter-group connections. This idea of ‘complementary domains’ was intro- which have the key. We will refer to these vectors as the Sum and Difference duced in Thomas et al. (2003). In that work, domains were assigned at vectors, and they form the basis of our algorithm to determine domain random in order to develop a random graph model that matched the degree information. distribution of experimental datasets. In our work, we impose further Since the model involves a number of simplifying assumptions, we expect assumptions and develop a technique for identifying domains. Thus, unlike equalities to become approximations for real data. Fortunately, symmetric in Thomas et al. (2003), our aim is to extract information from the matrices have well-conditioned eigenvalues and eigenvectors (Golub and network. We note that there can be any number of lock-and-key pairs Van Loan, 1996), and hence the predictions from the model are likely to within a protein–protein interaction network and thus the network may carry through when the idealised adjacency matrix undergoes perturbations. consist of many bipartite subgraphs. In our analysis, we focus on iden- Supporting tests are carried out below. tifying proteins associated with one specific lock-and-key pair at a time. We point out that the choice of which element to call a lock and which a key is arbitrary. 3 RESULTS AND DISCUSSION N·N We introduce the following notation. Let A 2 R denote the adjacency Using synthetic data generated under the lock-and-key principle, we graph, where a ¼ a ¼ 1 if proteins i and j interact and a ¼ a ¼ ij ji ij ji 0 otherwise. Focussing on a particular lock/key combination, we define first show that the eigenvalues and eigenvectors continue to hold 2013 J.L.Morrison et al. that proteins may be assigned to the lock/key domain if their asso- ciated component is greater than some threshold. If 0.4 is chosen as the threshold, we find all proteins that contain the lock/key aspect of the first interaction surface, excluding the single protein that con- tains two interaction domains. All remaining non-zero components identify the second interaction type. Based on the idea of assigning proteins to domains if their cor- responding component in either vector is above a threshold value, the following pseudo code describes our algorithm. Calculate eigenvalues/vectors of adjacency matrix Group eigenvalues into pairs of the form  ±l For each eigenvalue pair (with eigenvectors u Fig. 1. Synthetic Network with  ¼ 1. and u ) Construct Sum ¼ u + u and Diff ¼ u  u a b a b Sort Sum and Diff by decreasing magnitude useful information when the simplifying assumptions are relaxed. Identify a threshold for each vector This leads to the development of an algorithm, which we then test Assign components of Sum above threshold on experimental datasets. to lock Assign components of Diff above threshold 3.1 Synthetic data to key For our first test case we consider the network shown in Figure 1. end This network has three interaction types, with a total of six lock and key domains. The protein labelled 6 contains two interaction As a measure of how well the algorithm performs, a bipartite domains, violating one of our simplifying assumptions. Motivated graph with 15 locks and 20 keys was embedded within a random by our analysis, we first calculate the eigenvalues of the correspond- network with a total of 50 nodes. For both vectors, we measured the ing adjacency matrix and look for pairs of the form ±l. We find that area under the receiver operating characteristic curve (AUC) there exist two eigenvalues of ±2. By taking the sum and the dif- (see Bamber, 1975; Gribskov and Robinson, 1996) when both ference of the eigenvectors corresponding to these eigenvalues, we vectors were ordered by decreasing order of magnitude. We are able to identify the proteins in the network with the lock and the predict that the proteins containing the lock/key should be ordered key of one particular interaction type. From Figure 2a, we can see at the top of the Sum/Difference vectors. This analysis was con- that only two non-zero values exist in the sum and difference vec- ducted for values of 0.1    1 and averaged over 200 runs for each tors. These correspond directly to the proteins labelled 18, 19, 20 . Decreasing  is equivalent to increasing the false-negative rate and 21, which have the lock and key of the third interaction type. of recording interactions in the network. We also varied the Despite the fact that protein number 6 contains two sites belonging false-positive rate, defined as the percentage of interactions to different domains, we still have two remaining eigenvalue pairs wrongly predicted. These were introduced randomly across the of ±3.76 and ±5.18. For the second pair, the Sum and Difference entire network. Figure 4 shows the AUC against  for three vectors provide domain classification for both remaining domains false-positive rates. (Fig. 2b). The non-zero values in the Sum vector correspond to the We see from Figure 4 that in all cases where  > 0.7, the ordered proteins that contain the key of the first and second interaction types, vectors correctly predict the domain structure (AUC ¼ 1). For where as the non-zero values in the Difference vector correspond to values of  > 0.4, the sorted vectors still produce highly accurate the proteins that have the lock of the first and second interaction information (AUC > 0.9). At a high false-positive rate and lower types, and also to the single protein that contains two domains. (Of values of , the domain prediction should be treated with caution course, all protein and domain numbering is purely arbitrary and although performance is still much better than random and we could is only for reference purposes.) still expect to obtain useful information from the ordered vectors. We now test our algorithm’s ability to recover the correct domain To further evaluate our algorithm, we used a list of domain– information when the network above is altered so that only 80% of domain interactions observed in PDB structures obtained from the possible links are observed ( ¼ 0.8). We find that we can still the 3DID database (http://3did.embl.de/). The list of observed classify the eigenvalues into three pairs of ±1.62, ±3.17 and ±4.34. domain–domain interactions is accompanied by experimentally We first examine the Sum and Difference vectors corresponding to observed protein–protein interactions which support the known the ±1.62 pair (Fig. 3a). Although these vectors do not show equal domain–domain interaction. Initially a network was constructed non-zero components, extracting any non-zero components from using these protein–protein interactions, however, the algorithm both vectors leads to the two groups containing the key and lock was unable to identify any bipartite structures within the data. aspects of the third interaction type. Determining domain informa- This is not surprising, as the data are based on sparse observations tion about the remaining two interaction types is less straightfor- scattered across a wide range of organisms and thus do not provide a ward, but can be be done with either of the two remaining sufficiently accurate sample of any complete protein–protein inter- eigenvector pairs. Using the Sum and Difference vectors corre- action network. Also, the data are biased towards intra-protein sponding to the ±4.34 eigenvalue pair, from Figure 3b we see interactions, which our approach is not designed to detect. As an 2014 Lock-and-key model Fig. 2. Eigenvectors of the Synthetic Network. (a) Sum and Difference vectors for l ¼ ±12 (b) Sum and Difference vectors for l ¼ ±5.18 Fig. 3. Eigenvectors of the Synthetic Network with  ¼ 0.8. (a) Sum and Difference vectors for l ¼ ±1.62. (b) Sum and Difference vectors for l ¼ ±4.34. 0.9 Fig. 4. ROC analysis of algorithm. (a) False positive ¼ 0% (b) False positive ¼ 20% (c) False positive ¼ 40%. alternative evaluation based on known domain information, we of yeast proteins where interactions were included between any combined the domain–domain interaction data with from 3DID two proteins that contained a Pfam domain pair known to interact. with Pfam domain assignments (Bateman et al., 2004) for the To measure the ability of the algorithm to check for a large yeast proteome. This was used to construct an interaction network number of bipartite subgraphs, an automated approach to finding 2015 J.L.Morrison et al. Table 1. Measure of quality of bipartite subgraphs found in network subgraphs that were identified. Here, for ease of visualization, we constructed from domain–domain information [FP-rate (down) versus FN are showing the adjacency matrix, with a dot denoting a non-zero rate (across)] entry. Where more than one subgraph is shown for a particular species, these came from different eigenvector pairs. We note that these structures may be used to infer protein–protein 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% interactions by proposing that the lock and key pairs which have not been experimentally observed to interact, may in fact do so. 0.56 0.57 0.54 0.53 0.50 0.49 0.47 0.48 0.46 0% 0.66 0.50 0.47 0.45 0.44 0.41 0.40 0.37 0.39 0.36 0.34 10% 3.3 Biological interpretation of bipartite subgraphs 0.41 0.38 0.38 0.34 0.34 0.31 0.30 0.29 0.27 0.26 20% 0.35 0.33 0.31 0.28 0.28 0.26 0.27 0.25 0.22 0.22 30% To validate the biological relevance of the observed bipartite struc- 0.31 0.29 0.27 0.25 0.25 0.23 0.23 0.22 0.20 0.20 40% tures we chose to focus on the yeast interaction network reported 0.28 0.26 0.27 0.24 0.22 0.23 0.20 0.20 0.19 0.19 50% by Uetz et al. (2000), which has been widely analysed and comes 0.25 0.25 0.23 0.22 0.21 0.20 0.19 0.18 0.17 0.17 60% from an organism with exceptionally well-understood biology. The 0.23 0.21 0.21 0.21 0.20 0.19 0.18 0.17 0.16 0.17 70% subgraphs from Figure 5a and b are shown in Figure 6 with corre- 0.23 0.21 0.20 0.20 0.18 0.18 0.18 0.17 0.16 0.16 80% sponding protein names. 0.20 0.18 0.19 0.18 0.18 0.16 0.16 0.16 0.16 0.15 90% We first focus on the smaller bipartite subgraph obtained using the second eigenvector pair (Fig. 6b). Members of this subgraph are discussed in the original paper (Uetz et al., 2000) as part of a larger lock-and-keys in the network was developed (further details can be LSM pathway. The entire group of LSM pathway proteins has 18 found in the Supplementary Material). For each pair of Sum and members, of which we have identified 8. Additional members are Difference vectors found for this network, the ‘quality’ of the bipar- found if we look at the largest components in the Sum and Differ- tite subgraph found was measured. To do this, all domains present in ence vectors: We find 17 of these proteins within the top 22 of the the lock and key proteins were found, and from all possible domain Difference vector, and the one protein which is not found there is pairs, those which were used to construct the network were deter- ranked third in the Sum vector. This gives further evidence that mined. For each domain pair, the total number of proteins contain- these vectors represent biologically relevant information. For ing each domain is known, and thus the proportion of those found in another validation of our results, we use the iterative Group Anal- the bipartite subgraph can be found. The proportions for each ysis method (iGA) (Breitling et al., 2004). In comparison with our domain pair were summed, and divided by the total number of technique, which uses an artificial threshold to identify bipartite domain pairs; this was the measure of quality of the subgraph. subgraphs, the iGA method takes a ranked list of the entire dataset This measure was calculated for the exact network, and also per- as input, along with annotations for each entity in the network, and turbed networks where false-positive and false-negative interactions identifies any enriched subgraphs that exist within the highly ranked were randomly introduced. Results are given in Table 1. For com- proteins. We produce the ranked list by ordering the proteins in the parison, we tested a random allocation of proteins to lock and key network based on the ordering of the Sum and Difference vectors groups of equal size to those identified by the algorithm. The mea- used to identify the bipartite subgraphs. The results for the second sure in this case produces a value of 0.01. We conclude that the eigenvalue pair are given in Table 2. We can see that ranking the algorithm is able to identify bipartite structure in data where they proteins on both vectors produces similar results and confirms that exist. these proteins are involved in the LSM pathway since proteins Having tested the robustness of the sorted vectors to predict annotated with the Pfam database term LSM are highly enriched domain information, we have confidence to apply our approach in both lists. The results also identify the Sm domain as being highly to experimental datasets where it is understood that false-negative enriched among the proteins in the bipartite set. This is again owing and false-positive rates are high. The domain assignment predicted to the LSM proteins which are characterized by this domain. It is, by the algorithm in each case can be verified by checking if a near however, unlikely that the Sm domain is the interaction domain in bipartite structure exists between the assigned lock and key domain this case, since we find that the Sm domain is present in both the proteins. ‘key’ and ‘lock’ group, and both vectors produce similar rankings of proteins. This suggests that the bipartite structure identified may in 3.2 Biological experimental data fact be part of a fully connected cluster, and the connections which From testing on synthetic networks, it is apparent that a heuristic have been experimentally observed indicate a bipartite structure by approach is required to identify domain information and, thus, chance. It is also important to note that the iGA analysis gives strong bipartite subgraphs in experimental datasets. In this case we can indications with respect to the biological function of this particular only hope to identify approximate eigenvalue pairs. This is mainly bipartite structure. It seems to be involved in spliceosomal rRNA owing to the well-known noisiness of the datasets, which include a processing, again in accordance with previous biological knowledge large number of false-positives and false-negatives, but also to the (Pillai et al., 2003). presence of multi-domain/multi-interaction proteins. Having validated our approach on a known subgraph, which was For all networks, except for the negative control (von Mering already discussed in the original publication, we now investigate the dataset), we are able to identify three approximate pairs of eigen- bipartite structure identified from the first eigenvector pair (Fig. 6a). values. For each pair in every dataset we attempt to delimit a To our knowledge this biologically very interesting group has so far bipartite subgraph using the method explained above. The threshold escaped attention. As above, we use the iGA method to identify the value for inclusion in the subgraph varies in each case, and is chosen enriched protein domains and functions present within this sub- by inspection of the Sum and Difference vectors. Figure 5 illustrates group. The results are given in Table 3. Results are only included 2016 Lock-and-key model Fig. 5. Bipartite subgraphs in interactomes of different species: a bipartite structure is indicated by a two-by-two checkerboard pattern with the non-zero blocks away from the diagonal. (a) Uetz Network. 1st pair. (b) Uetz Network. 2nd Pair. (c) A.thaliana, 2nd Pair. (d) H.sapiens, 1st Pair. (e) S.cerevisiae, 2nd Pair. (f) S.cerevisiae, 3rd Pair. (g) H.pylori, 1st Pair. (h) D.melanogaster,1st Pair. (i) D.melanogaster, 2nd Pair. (j) M.musculus, 1st Pair. (k) M.musculus, 3rd Pair. where the enriched subclass includes members from the bipartite reported. The SH3 domain is one of the best characterized subgraph. protein binding motifs (Mayer, 2001). It is present in all our In this case, the iGA method clearly shows that proteins with the ‘key’ proteins (Fig. 7) and is very likely to be the physical SH3 domain are strikingly enriched within the ‘key’ group which is representative of the interaction motif. Where more than one derived from the difference vector. We also obtain a first indication SH3 domain is present within a protein, we are unable to of the biological relevance of the interaction pattern: The GO terms determine which domain is interacting. On the other hand, the ‘actin cortical patch’, ‘actin filament organization’, ‘transmem- proteins of the ‘lock’ group are part of the actin cortical patchas- brane’ and ‘integral to Golgi membrane’ are overly abundant sembly mechanism of vesicle endocytosis (Drees et al., 2001). among the proteins of interest. These results are further strength- They were also identified as part of a larger group by a clustering ened when we examine the Gene Ontology annotations for the lock method in Arnau et al. (2005), but missing the highly relevant and key groups directly, rather than on the entire eigenvectors. The interaction with SH3 domain proteins. The involvement of SH3 resulting p-values for these are listed in Table 4. Again, many proteins in linking cytoskeletal dynamics and the trafficking of proteins of the lock group are annotated with terms involving vesicles, particularly Golgi membranes, has only very recently actin and Golgi, with even stronger support when these terms are been discovered in biological experiments (Friesen et al., 2005; combined (‘all actin/Golgi combined’). Kessels and Qualmann, 2004). By linking vesicular membranes The biological relevance of this interaction pattern is obvious, but with actin polymerization, SH3 domain proteins contribute the was entirely unknown when the interaction dataset was first crucial mechanistic connection between membrane trafficking 2017 J.L.Morrison et al. Table 2. iGA results for the second eigenvector pair ordering Vector Database Class p-value Sum Interpro Sm_like_riboprot 6.56 · 10 Pfam LSM 6.56 · 10 ProDom SnRNP 6.56 · 10 InterPro snRNP_Sm 6.56 · 10 SMART Sm 6.41 · 10 GO Nuclear mRNA splicing, 6.11 · 10 via spliceosome GO Small nuclear ribonucleoprotein 2.18 · 10 complex InterPro snRNP 1.15 · 10 GO Pre-mRNA splicing factor activity 5.13 · 10 Difference Interpro snRNP_Sm 8.17 · 10 ProDom SnRNP 8.17 · 10 Pfam LSM 8.17 · 10 Interpro Sm_like_riboprot 8.17 · 10 SMART Sm 7.18 · 10 GO rRNA processing 4.65 · 10 GO Nuclear mRNA splicing, via 3.66 · 10 spliceosome GO Small nuclear ribonucleoprotein 1.15 · 10 complex GO Pre-mRNA splicing factor activity 1.16 · 10 Table 3. iGA results for the first eigenvector pair ordering Vector Database Class p-value Add GO Integral to Golgi membrane 3.99 · 10 Difference ProDom SH3 8.14 · 10 PRINTS SH3DOMAIN 3.73 · 10 PROSITE SH3 1.21 · 10 Interpro SH3 1.21 · 10 Pfam SH3 1.21 · 10 SMART SH3 1.05 · 10 GO Actin cortical patch 4.56 · 10 Pfam/Interpro DUF500 2.88 · 10 GO Actin filament organization 3.35 · 10 Fig. 6. Bipartite subgraphs in the Uetz network. Dotted lines indicate intra- group connections. (a) Extracted using first eigenvalue pair (b) Extracted using second eigenvalue pair. Table 4. p-values for GO annotations found in Key and Lock Group and the cytoskeleton. The bipartite subgraph that we have identified extends on the previously reported interactions and may motivate GO term p-value important cell biological follow-up experiments. For all other bipartite subgraphs identified by our Key Group Actin filament organization 2.9916 · 10 algorithm, protein names and annotations are given where 5 Lock Group Actin cytoskeleton organization 4.3125 · 10 available in the Supplementary Material. To our knowledge these All actin combined 1.6337 · 10 novel bipartite structures and most of the corresponding interactions Actin cortical patch assembly 1.8927 · 10 have not previously been reported. These additional subgraphs Integral to Golgi membrane 2.4789 · 10 All Golgi combined 4.6637 · 10 include a number of other biologically very interesting gene groups, such as the ion-transporter module identified in the Mus musculus interactome, which further highlights the validity of our approach. to detect only the striking examples of lock-and-key interactions. Although the number of bipartite subgraphs identified is reason- As data becomes more reliable and complete, we expect our ably small, our method is wholly reliant on the underlying data approach to identify the lock-and-key interactions with greater which is understood to be extremely noisy. At present we tend coverage. 2018 Lock-and-key model REFERENCES Alfarano,C. et al. (2005) The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res., 1, D418–D424. Arnau,V. et al. (2001) Iterative cluster analysis of protein interaction data. Bioinformatics, 21, 364–378. Bader,J.S. et al. (2004) Gaining confidence in high-throughput protein interactions. Nat. Biotechnol., 22, 78–85. Bamber,D. (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psychol., 12, 387–415. Bateman,A. et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141. Breitling,R. et al. (2004) Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinformatics, 5, 34. Bork,P. et al. (2004) Protein interaction networks from yeast to human. Curr. Opin. Fig. 7. SH3 domains in key group proteins. Struct. Biol., 14, 292–299. Drees,B.L. et al. (2001) A protein interaction map for cell polarity development. J. Cell boil., 154, 549–571. 4 CONCLUSION Edwards,A.M. et al. (2002) Bridging structural biology and genomics: assessing protein–protein interaction datasets. Trends Genet., 18, 529–536. From the initial lock-and-key model of protein–protein interaction Friesen,H. et al. (2005) Interaction of Saccharomyces cerevisiae cortical actin patch networks, we have devised an algorithm that identifies proteins protein Rvs167p with proteins involved in ER to Golgi vesicle trafficking. containing the lock and key aspects of a particular interaction sur- Genetics, 170, 555–568. face. This is achieved through a search for bipartite subgraphs in Golub,G.H. and Van Loan,C.F. (1996) Matrix Computations. The John Hopkins University Press. protein–protein interaction networks derived from Y2H experi- Gribskov,M. and Robinson,N.L. (1996) Use of receiver operating ments using a spectral approach. Unlike traditional clustering tech- characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem., niques, we identify groups that are not internally highly similar, but 20, 25–33. have a large number of interactions with another group. We have Ito,T. et al. (2001) A comprehensive two-hybrid analysis to explore the demonstrated that under certain modelling assumptions our yeast protein interaction interactome. Proc. Natl Acad. Sci. USA, 98, 4569–4574. approach is guaranteed to identify the correct domain information Jeong,H. et al. (2000) The large-scale organization of metabolic networks. Nature, 507, about proteins in a network. As experimental interaction networks 651–654. are only approximated by our model, we adopt a heuristic approach Kessels,M.M. and Qualmann,B. (2004) The syndapin protein family: linking mem- to identifying bipartite subgraphs. The main ingredients of the algo- brane trafficking with the cytoskeleton. J. Cell Sci., 17, 3077–3086. Khanin,R. and Wit,E. (2006) How scale-free are gene-networks? J. Comput. Biol., 13, rithm are Sum and Difference vectors, formed from the correspond- 810–818. ing eigenvectors of eigenvalue pairs of (approximately) the form Mayer,B.J. (2001) SH3 domains: complexity in moderation. J. Cell Sci., 114, ±l. We demonstrated that this approach reveals bipartite subgraphs 1253–1261. across a large variety of protein interaction networks from diverse Pillai,R.S. et al. (2003) Unique Sm core structure of U7 snRNPs: species. For one of these subgraphs, from S.cerevisiae, we showed assembly by a specialized SMN complex and the role of a new component, Lsm11, in histone RNA processing. Genes Dev., 17, how our method discovers a novel and biologically exciting inter- 2321–2333. acting group, including identification of the physiological function Prz zulj,N. et al. (2004) Modeling interactome: scale-free or geometric? Bioinformatics, and the physical interaction motif, the SH3 domain. Used in this 20, 3808–3515. way, our approach has the potential to add considerable value to the Riley,R. et al. (2005) Inferring protein domain interactions from databases of inter- acting proteins. Genome Biol., 6, R89. experimentally observed interaction networks. Salwinski,L. and Eisenberg,D. (2003) Computational methods of analysis of protein– protein interactions. Curr. Opin. Struct. Biol., 13, 377–382. ACKNOWLEDGEMENTS Thomas,A. et al. (2003) On the structure of protein–protein interaction networks. Biochem. Soc. Trans., 31, 1491–1496. J.L.M. was supported by a Synergy scholarship (www.strath.gla.ac. Uetz,P. et al. (2000) A comprehensive analysis of protein–protein interactions in uk/synergy). R.B. was supported by a Caledonian Research Saccharomyces cerevisiae. Nature, 403, 623–627. Foundation Personal Fellowship. D.J.H. was supported by EPSRC von Mering,C. et al. (2002) Comparative assessment of large-scale datasets of grant GR/S62383/01. protein–protein interactions. Nature, 417, 399–403. Conflict of Interest: none declared. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

A lock-and-key model for protein–protein interactions

Loading next page...
 
/lp/oxford-university-press/a-lock-and-key-model-for-protein-protein-interactions-jsgVvFZdPd

References (37)

Publisher
Oxford University Press
Copyright
© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/btl338
pmid
16787977
Publisher site
See Article on Publisher Site

Abstract

Vol. 22 no. 16 2006, pages 2012–2019 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btl338 Systems biology 1,2, 3 2 1 Julie L. Morrison , Rainer Breitling , Desmond J. Higham and David R. Gilbert Bioinformatics Research Centre, Department of Computing Science, University of Glasgow, G12 8QQ, UK, 2 3 Department of Mathematics, University of Strathclyde, G1 1XH, UK and Groningen Bioinformatics Centre, University of Groningen, Kerklaan 30, 9751 NN Haren, Netherlands Received on February 20, 2006; received on April 26, 2006; accepted on June 15, 2006 Advance Access publication June 20, 2006 Associate Editor: Charlie Hodgman ABSTRACT reasoning behind the occurrence of interactions, which may help in identifying false-positive and false-negative interactions. Also, Motivation: Protein–protein interaction networks are one of the major understanding the interactions at a local level allows us in turn to post-genomic data sources available to molecular biologists. They pro- make inferences about the global network topology. vide a comprehensive view of the global interaction structure of an The essence of our approach to modelling and thereby gaining organism’s proteome, as well as detailed information on specific inter- further insight into the local and global structure of protein–protein actions. Here we suggest a physical model of protein interactions that interaction networks is the idea of lock-and-key domains. Physical can be used to extract additional information at an intermediate level: It interactions between protein domains are responsible for the inter- enables us to identify proteins which share biological interaction motifs, actions between proteins. Thus, modelling interaction networks in and also to identify potentially missing or spurious interactions. terms of the domains that each protein contains is biologically Results: Our new graph model explains observed interactions between justified. The lock-and-key structure defines interactions to be proteinsbyanunderlyinginteractionofcomplementarybindingdomains observed, with some probability, between proteins which contain (lock-and-key model). This leads to a novel graph-theoretical algorithm complementary domains (lock and key). This results in a network to identify bipartite subgraphs within protein–protein interaction net- composed of near complete bipartite subgraphs. The algorithm workswheretheunderlying data aretakenfromyeasttwo-hybridexperi- designed in this lock-and-key framework is intended for application mental results. By testing on synthetic data, we demonstrate that under on networks derived from experiments where interactions are certain modelling assumptions, the algorithm will return correct domain observed in a pairwise fashion, such as yeast two-hybrid data informationabouteachproteininthenetwork.Testsondatafromvarious (Y2H). For the purpose of this paper we use the term ‘domain’ modelorganismsshowthatthelocalandglobalpatternspredictedbythe model are indeed found in experimental data. Using functional and pro- in the broadest possible sense. Common lock domains (as well as tein structure annotations, we show that bipartite subnetworks can be key domains) can be equivalent interaction surfaces, without being evolutionary homologues and even without a strict requirement for identified that correspond to biologically relevant interaction motifs. similarity and exact definition at the structural level. Some of these are novel and we discuss an example involving SH3 This modelling approach has greater biological grounding than domains from the Saccharomyces cerevisiae interactome. previous attempts that model protein–protein interaction networks Availability: The algorithm (in Matlab format) is available (see http:// with off-the-shelf classes of random graph. In particular, it had been www.maths.strath.ac.uk/aas96106/lock_key.html) widely believed that the degree distribution of protein–protein inter- Contact: [email protected] action networks followed a power-law, indicating a scale-free struc- Supplementary information: Supplementary data are available at ture (Jeong et al., 2000). There is mounting evidence to suggest that http://www.maths.strath.ac.uk/aas96106/lock_key.html. this is not the case (Pr z zulj et al., 2004; Khanin and Wit, 2006), so simply fitting a scale-free model to the data is not a valid approach. The use of protein domains to validate protein–protein interac- 1 INTRODUCTION tions is growing. For example, a statistical method developed by Riley et al. (2005), can be used to verify known domain–domain The vast growth in availability of high-throughout protein–protein interactions, identify highly specific domain–domain interactions interaction datasets is widely documented (Bork et al., 2004) and and find domain–domain interactions involving domains of has been accompanied by discussion emphasising the high error unknown function. The novelty of our approach is that domain rates within such datasets. This combination necessitates the devel- information is identified from interaction data alone. opment of robust analytical techniques to gain knowledge about the Assuming the lock-and-key interaction structure, we define a resultant protein–protein interaction networks (Edwards et al., mathematical model and a subsequent algorithm that allows us 2002). Graph-theoretic tools have proved successful, although to extract domain information about each protein in the network. they have largely focussed on global rather than local properties This approach is verified on synthetic data generated using the lock- of the networks (Salwinski and Eisenberg, 2003). By modelling and-key definition. We also demonstrate that the approach is robust interactions based on local properties of the proteins we gain to the introduction of false positive and false negative interactions. To whom correspondence should be addressed. We then identify a number of interaction structures indicating a 2012  The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected] Lock-and-key model an indicator vector u 2 R such that lock-and-key pattern in real interactomes across a wide range of species and provide biological interpretations for some of these a‚ if protein i has lock structures. u ¼ b‚ if protein i has key 0‚ otherwise : Here a and b are real numbers that will be specified later. 2 METHODS In order to get a clean mathematical analysis, we make the following sim- plifying assumptions (justification for these is given in the next paragraph). 2.1 Data (1) For this lock-and-key combination, any protein which contains the The mathematical model on which we base our analysis describes pairwise lock/key interactions of proteins, rather than agglomerates or large complexes. This (a) will not also contain the key/lock, and corresponds most closely to the experimental situation prevailing in Y2H experiments. Y2H interactions were obtained from BIND—the Biomole- (b) will only interact with a protein containing the complementary cular Interaction Network Database Version 3.8 (June 20, 2005) (Alfarano key/lock (it will not contain any other locks or keys). et al., 2005). In an attempt to cover as broad a range of species as possible, (2) For each protein having this lock or key, owing to experimental networks were constructed for all species for which >500 interactions had constraints only a fixed proportion, , of its connections with the been reported. These were Helicobacter pylori, Arabidopsis thaliana, Sac- matching key/lock will be recorded. charomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, Mus musculus and Homo sapiens. In the networks each node represents a The first assumption ensures that the bipartite subgraph under considera- protein and each protein–protein interaction is represented by an edge. For tion is isolated from the rest of the network. Note that we are not placing yeast (S.cerevisiae) we also examined networks corresponding to the clas- restrictions on the interactions of proteins in the remainder of the network. sical Y2H studies by Ito et al. (2001) and Uetz et al. (2000). Further details of The second assumption is a type of mean-field approximation—individual these networks are given in Supplementary Material. proteins in the subgraph connect with the average frequency of the ensemble. The presence of noise in high-throughput protein–protein interaction Although these are clearly idealizations, we demonstrate below that the main datasets is widely known [it has been suggested that between 30–50% of features from our analysis are robust to the presence of multidomain proteins high-throughput interactions are biologically relevant (Bader et al., 2004)] and varying connectivity frequency. and, thus, we understand that the datasets are far from complete. Despite the If we let locksum and keysum denote the total number of proteins that presence of false positives and false negatives, our aim is to produce a robust contain the one particular lock or key under investigation, our assumptions algorithm which will identify any bipartite structures if they exist in the imply that the i-th component of the matrix-vector product Au is given by available data. As a negative control, we also analysed the yeast dataset described in von N < b ·  · keysum‚ if protein i has lock Mering et al. (2002), which combines interactions observed for yeast using a ðAuÞ :¼ a u ¼ a ·  · locksum‚ if protein i has key ij j number of experimental techniques, including mass spectrometry, that do j ¼ 1 0‚ otherwise: not identify pairwise binding. As these type of data do not conform to our If b  keysum ¼ la and a locksum ¼ lb, for some value l, then we have model, we do not expect to find strong bipartite patterns. Protein domain and (Au) ¼ (lu) . In this case, u is an eigenvector of A with eigenvalue l. These function annotations were extracted from annotation files obtained from i i 2 2 constraints give a /b ¼ keysum/locksum. Ignoring trivial re-scalings, this Affymetrix. pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi results in two distinct solutions, l ¼ ± keysum · locksum. Thus, we pre- pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi dict that the matrix A will have a pair of eigenvalues ± keysum · locksum 2.2 Interaction model and analytical algorithm with corresponding eigenvectors whose non-zero components take only two pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi possible values: one value ± keysum and the other value ± locksum. We propose a model for protein–protein interaction networks that reflects [lock] [key] In other words, if we let ind and ind be indicator vectors for the the manner in which proteins bind to each other in experiments such as lock and key, so that Y2H assays. The model is based on a lock-and-key principle, where proteins interact only if one protein contains the ‘lock’ aspect of some ½lock 1 if protein i haslock ind ¼ interaction surface, and the other protein contains the matching ‘key’. We 0 otherwise also assume that an interaction will be observed between such a pair of [key] and similarly for ind , then the two eigenvectors have the form proteins with some probability 0    1 . An immediate consequence of j a j j lock j j key j j b j j lock j j key j these assumptions is the prediction that there will exist nearly complete u ¼ aind þ bind ‚ u ¼ aind þ bind : bipartite subgraphs within the protein–protein interaction networks, i.e. j a j j b j j key j j a j j b j j lock j Hence u + u ¼ 2bind and u u ¼ 2aind , so the sum two groups of proteins with little or no intra-group connections but strong and difference of the eigenvectors reveal which proteins have the lock and inter-group connections. This idea of ‘complementary domains’ was intro- which have the key. We will refer to these vectors as the Sum and Difference duced in Thomas et al. (2003). In that work, domains were assigned at vectors, and they form the basis of our algorithm to determine domain random in order to develop a random graph model that matched the degree information. distribution of experimental datasets. In our work, we impose further Since the model involves a number of simplifying assumptions, we expect assumptions and develop a technique for identifying domains. Thus, unlike equalities to become approximations for real data. Fortunately, symmetric in Thomas et al. (2003), our aim is to extract information from the matrices have well-conditioned eigenvalues and eigenvectors (Golub and network. We note that there can be any number of lock-and-key pairs Van Loan, 1996), and hence the predictions from the model are likely to within a protein–protein interaction network and thus the network may carry through when the idealised adjacency matrix undergoes perturbations. consist of many bipartite subgraphs. In our analysis, we focus on iden- Supporting tests are carried out below. tifying proteins associated with one specific lock-and-key pair at a time. We point out that the choice of which element to call a lock and which a key is arbitrary. 3 RESULTS AND DISCUSSION N·N We introduce the following notation. Let A 2 R denote the adjacency Using synthetic data generated under the lock-and-key principle, we graph, where a ¼ a ¼ 1 if proteins i and j interact and a ¼ a ¼ ij ji ij ji 0 otherwise. Focussing on a particular lock/key combination, we define first show that the eigenvalues and eigenvectors continue to hold 2013 J.L.Morrison et al. that proteins may be assigned to the lock/key domain if their asso- ciated component is greater than some threshold. If 0.4 is chosen as the threshold, we find all proteins that contain the lock/key aspect of the first interaction surface, excluding the single protein that con- tains two interaction domains. All remaining non-zero components identify the second interaction type. Based on the idea of assigning proteins to domains if their cor- responding component in either vector is above a threshold value, the following pseudo code describes our algorithm. Calculate eigenvalues/vectors of adjacency matrix Group eigenvalues into pairs of the form  ±l For each eigenvalue pair (with eigenvectors u Fig. 1. Synthetic Network with  ¼ 1. and u ) Construct Sum ¼ u + u and Diff ¼ u  u a b a b Sort Sum and Diff by decreasing magnitude useful information when the simplifying assumptions are relaxed. Identify a threshold for each vector This leads to the development of an algorithm, which we then test Assign components of Sum above threshold on experimental datasets. to lock Assign components of Diff above threshold 3.1 Synthetic data to key For our first test case we consider the network shown in Figure 1. end This network has three interaction types, with a total of six lock and key domains. The protein labelled 6 contains two interaction As a measure of how well the algorithm performs, a bipartite domains, violating one of our simplifying assumptions. Motivated graph with 15 locks and 20 keys was embedded within a random by our analysis, we first calculate the eigenvalues of the correspond- network with a total of 50 nodes. For both vectors, we measured the ing adjacency matrix and look for pairs of the form ±l. We find that area under the receiver operating characteristic curve (AUC) there exist two eigenvalues of ±2. By taking the sum and the dif- (see Bamber, 1975; Gribskov and Robinson, 1996) when both ference of the eigenvectors corresponding to these eigenvalues, we vectors were ordered by decreasing order of magnitude. We are able to identify the proteins in the network with the lock and the predict that the proteins containing the lock/key should be ordered key of one particular interaction type. From Figure 2a, we can see at the top of the Sum/Difference vectors. This analysis was con- that only two non-zero values exist in the sum and difference vec- ducted for values of 0.1    1 and averaged over 200 runs for each tors. These correspond directly to the proteins labelled 18, 19, 20 . Decreasing  is equivalent to increasing the false-negative rate and 21, which have the lock and key of the third interaction type. of recording interactions in the network. We also varied the Despite the fact that protein number 6 contains two sites belonging false-positive rate, defined as the percentage of interactions to different domains, we still have two remaining eigenvalue pairs wrongly predicted. These were introduced randomly across the of ±3.76 and ±5.18. For the second pair, the Sum and Difference entire network. Figure 4 shows the AUC against  for three vectors provide domain classification for both remaining domains false-positive rates. (Fig. 2b). The non-zero values in the Sum vector correspond to the We see from Figure 4 that in all cases where  > 0.7, the ordered proteins that contain the key of the first and second interaction types, vectors correctly predict the domain structure (AUC ¼ 1). For where as the non-zero values in the Difference vector correspond to values of  > 0.4, the sorted vectors still produce highly accurate the proteins that have the lock of the first and second interaction information (AUC > 0.9). At a high false-positive rate and lower types, and also to the single protein that contains two domains. (Of values of , the domain prediction should be treated with caution course, all protein and domain numbering is purely arbitrary and although performance is still much better than random and we could is only for reference purposes.) still expect to obtain useful information from the ordered vectors. We now test our algorithm’s ability to recover the correct domain To further evaluate our algorithm, we used a list of domain– information when the network above is altered so that only 80% of domain interactions observed in PDB structures obtained from the possible links are observed ( ¼ 0.8). We find that we can still the 3DID database (http://3did.embl.de/). The list of observed classify the eigenvalues into three pairs of ±1.62, ±3.17 and ±4.34. domain–domain interactions is accompanied by experimentally We first examine the Sum and Difference vectors corresponding to observed protein–protein interactions which support the known the ±1.62 pair (Fig. 3a). Although these vectors do not show equal domain–domain interaction. Initially a network was constructed non-zero components, extracting any non-zero components from using these protein–protein interactions, however, the algorithm both vectors leads to the two groups containing the key and lock was unable to identify any bipartite structures within the data. aspects of the third interaction type. Determining domain informa- This is not surprising, as the data are based on sparse observations tion about the remaining two interaction types is less straightfor- scattered across a wide range of organisms and thus do not provide a ward, but can be be done with either of the two remaining sufficiently accurate sample of any complete protein–protein inter- eigenvector pairs. Using the Sum and Difference vectors corre- action network. Also, the data are biased towards intra-protein sponding to the ±4.34 eigenvalue pair, from Figure 3b we see interactions, which our approach is not designed to detect. As an 2014 Lock-and-key model Fig. 2. Eigenvectors of the Synthetic Network. (a) Sum and Difference vectors for l ¼ ±12 (b) Sum and Difference vectors for l ¼ ±5.18 Fig. 3. Eigenvectors of the Synthetic Network with  ¼ 0.8. (a) Sum and Difference vectors for l ¼ ±1.62. (b) Sum and Difference vectors for l ¼ ±4.34. 0.9 Fig. 4. ROC analysis of algorithm. (a) False positive ¼ 0% (b) False positive ¼ 20% (c) False positive ¼ 40%. alternative evaluation based on known domain information, we of yeast proteins where interactions were included between any combined the domain–domain interaction data with from 3DID two proteins that contained a Pfam domain pair known to interact. with Pfam domain assignments (Bateman et al., 2004) for the To measure the ability of the algorithm to check for a large yeast proteome. This was used to construct an interaction network number of bipartite subgraphs, an automated approach to finding 2015 J.L.Morrison et al. Table 1. Measure of quality of bipartite subgraphs found in network subgraphs that were identified. Here, for ease of visualization, we constructed from domain–domain information [FP-rate (down) versus FN are showing the adjacency matrix, with a dot denoting a non-zero rate (across)] entry. Where more than one subgraph is shown for a particular species, these came from different eigenvector pairs. We note that these structures may be used to infer protein–protein 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% interactions by proposing that the lock and key pairs which have not been experimentally observed to interact, may in fact do so. 0.56 0.57 0.54 0.53 0.50 0.49 0.47 0.48 0.46 0% 0.66 0.50 0.47 0.45 0.44 0.41 0.40 0.37 0.39 0.36 0.34 10% 3.3 Biological interpretation of bipartite subgraphs 0.41 0.38 0.38 0.34 0.34 0.31 0.30 0.29 0.27 0.26 20% 0.35 0.33 0.31 0.28 0.28 0.26 0.27 0.25 0.22 0.22 30% To validate the biological relevance of the observed bipartite struc- 0.31 0.29 0.27 0.25 0.25 0.23 0.23 0.22 0.20 0.20 40% tures we chose to focus on the yeast interaction network reported 0.28 0.26 0.27 0.24 0.22 0.23 0.20 0.20 0.19 0.19 50% by Uetz et al. (2000), which has been widely analysed and comes 0.25 0.25 0.23 0.22 0.21 0.20 0.19 0.18 0.17 0.17 60% from an organism with exceptionally well-understood biology. The 0.23 0.21 0.21 0.21 0.20 0.19 0.18 0.17 0.16 0.17 70% subgraphs from Figure 5a and b are shown in Figure 6 with corre- 0.23 0.21 0.20 0.20 0.18 0.18 0.18 0.17 0.16 0.16 80% sponding protein names. 0.20 0.18 0.19 0.18 0.18 0.16 0.16 0.16 0.16 0.15 90% We first focus on the smaller bipartite subgraph obtained using the second eigenvector pair (Fig. 6b). Members of this subgraph are discussed in the original paper (Uetz et al., 2000) as part of a larger lock-and-keys in the network was developed (further details can be LSM pathway. The entire group of LSM pathway proteins has 18 found in the Supplementary Material). For each pair of Sum and members, of which we have identified 8. Additional members are Difference vectors found for this network, the ‘quality’ of the bipar- found if we look at the largest components in the Sum and Differ- tite subgraph found was measured. To do this, all domains present in ence vectors: We find 17 of these proteins within the top 22 of the the lock and key proteins were found, and from all possible domain Difference vector, and the one protein which is not found there is pairs, those which were used to construct the network were deter- ranked third in the Sum vector. This gives further evidence that mined. For each domain pair, the total number of proteins contain- these vectors represent biologically relevant information. For ing each domain is known, and thus the proportion of those found in another validation of our results, we use the iterative Group Anal- the bipartite subgraph can be found. The proportions for each ysis method (iGA) (Breitling et al., 2004). In comparison with our domain pair were summed, and divided by the total number of technique, which uses an artificial threshold to identify bipartite domain pairs; this was the measure of quality of the subgraph. subgraphs, the iGA method takes a ranked list of the entire dataset This measure was calculated for the exact network, and also per- as input, along with annotations for each entity in the network, and turbed networks where false-positive and false-negative interactions identifies any enriched subgraphs that exist within the highly ranked were randomly introduced. Results are given in Table 1. For com- proteins. We produce the ranked list by ordering the proteins in the parison, we tested a random allocation of proteins to lock and key network based on the ordering of the Sum and Difference vectors groups of equal size to those identified by the algorithm. The mea- used to identify the bipartite subgraphs. The results for the second sure in this case produces a value of 0.01. We conclude that the eigenvalue pair are given in Table 2. We can see that ranking the algorithm is able to identify bipartite structure in data where they proteins on both vectors produces similar results and confirms that exist. these proteins are involved in the LSM pathway since proteins Having tested the robustness of the sorted vectors to predict annotated with the Pfam database term LSM are highly enriched domain information, we have confidence to apply our approach in both lists. The results also identify the Sm domain as being highly to experimental datasets where it is understood that false-negative enriched among the proteins in the bipartite set. This is again owing and false-positive rates are high. The domain assignment predicted to the LSM proteins which are characterized by this domain. It is, by the algorithm in each case can be verified by checking if a near however, unlikely that the Sm domain is the interaction domain in bipartite structure exists between the assigned lock and key domain this case, since we find that the Sm domain is present in both the proteins. ‘key’ and ‘lock’ group, and both vectors produce similar rankings of proteins. This suggests that the bipartite structure identified may in 3.2 Biological experimental data fact be part of a fully connected cluster, and the connections which From testing on synthetic networks, it is apparent that a heuristic have been experimentally observed indicate a bipartite structure by approach is required to identify domain information and, thus, chance. It is also important to note that the iGA analysis gives strong bipartite subgraphs in experimental datasets. In this case we can indications with respect to the biological function of this particular only hope to identify approximate eigenvalue pairs. This is mainly bipartite structure. It seems to be involved in spliceosomal rRNA owing to the well-known noisiness of the datasets, which include a processing, again in accordance with previous biological knowledge large number of false-positives and false-negatives, but also to the (Pillai et al., 2003). presence of multi-domain/multi-interaction proteins. Having validated our approach on a known subgraph, which was For all networks, except for the negative control (von Mering already discussed in the original publication, we now investigate the dataset), we are able to identify three approximate pairs of eigen- bipartite structure identified from the first eigenvector pair (Fig. 6a). values. For each pair in every dataset we attempt to delimit a To our knowledge this biologically very interesting group has so far bipartite subgraph using the method explained above. The threshold escaped attention. As above, we use the iGA method to identify the value for inclusion in the subgraph varies in each case, and is chosen enriched protein domains and functions present within this sub- by inspection of the Sum and Difference vectors. Figure 5 illustrates group. The results are given in Table 3. Results are only included 2016 Lock-and-key model Fig. 5. Bipartite subgraphs in interactomes of different species: a bipartite structure is indicated by a two-by-two checkerboard pattern with the non-zero blocks away from the diagonal. (a) Uetz Network. 1st pair. (b) Uetz Network. 2nd Pair. (c) A.thaliana, 2nd Pair. (d) H.sapiens, 1st Pair. (e) S.cerevisiae, 2nd Pair. (f) S.cerevisiae, 3rd Pair. (g) H.pylori, 1st Pair. (h) D.melanogaster,1st Pair. (i) D.melanogaster, 2nd Pair. (j) M.musculus, 1st Pair. (k) M.musculus, 3rd Pair. where the enriched subclass includes members from the bipartite reported. The SH3 domain is one of the best characterized subgraph. protein binding motifs (Mayer, 2001). It is present in all our In this case, the iGA method clearly shows that proteins with the ‘key’ proteins (Fig. 7) and is very likely to be the physical SH3 domain are strikingly enriched within the ‘key’ group which is representative of the interaction motif. Where more than one derived from the difference vector. We also obtain a first indication SH3 domain is present within a protein, we are unable to of the biological relevance of the interaction pattern: The GO terms determine which domain is interacting. On the other hand, the ‘actin cortical patch’, ‘actin filament organization’, ‘transmem- proteins of the ‘lock’ group are part of the actin cortical patchas- brane’ and ‘integral to Golgi membrane’ are overly abundant sembly mechanism of vesicle endocytosis (Drees et al., 2001). among the proteins of interest. These results are further strength- They were also identified as part of a larger group by a clustering ened when we examine the Gene Ontology annotations for the lock method in Arnau et al. (2005), but missing the highly relevant and key groups directly, rather than on the entire eigenvectors. The interaction with SH3 domain proteins. The involvement of SH3 resulting p-values for these are listed in Table 4. Again, many proteins in linking cytoskeletal dynamics and the trafficking of proteins of the lock group are annotated with terms involving vesicles, particularly Golgi membranes, has only very recently actin and Golgi, with even stronger support when these terms are been discovered in biological experiments (Friesen et al., 2005; combined (‘all actin/Golgi combined’). Kessels and Qualmann, 2004). By linking vesicular membranes The biological relevance of this interaction pattern is obvious, but with actin polymerization, SH3 domain proteins contribute the was entirely unknown when the interaction dataset was first crucial mechanistic connection between membrane trafficking 2017 J.L.Morrison et al. Table 2. iGA results for the second eigenvector pair ordering Vector Database Class p-value Sum Interpro Sm_like_riboprot 6.56 · 10 Pfam LSM 6.56 · 10 ProDom SnRNP 6.56 · 10 InterPro snRNP_Sm 6.56 · 10 SMART Sm 6.41 · 10 GO Nuclear mRNA splicing, 6.11 · 10 via spliceosome GO Small nuclear ribonucleoprotein 2.18 · 10 complex InterPro snRNP 1.15 · 10 GO Pre-mRNA splicing factor activity 5.13 · 10 Difference Interpro snRNP_Sm 8.17 · 10 ProDom SnRNP 8.17 · 10 Pfam LSM 8.17 · 10 Interpro Sm_like_riboprot 8.17 · 10 SMART Sm 7.18 · 10 GO rRNA processing 4.65 · 10 GO Nuclear mRNA splicing, via 3.66 · 10 spliceosome GO Small nuclear ribonucleoprotein 1.15 · 10 complex GO Pre-mRNA splicing factor activity 1.16 · 10 Table 3. iGA results for the first eigenvector pair ordering Vector Database Class p-value Add GO Integral to Golgi membrane 3.99 · 10 Difference ProDom SH3 8.14 · 10 PRINTS SH3DOMAIN 3.73 · 10 PROSITE SH3 1.21 · 10 Interpro SH3 1.21 · 10 Pfam SH3 1.21 · 10 SMART SH3 1.05 · 10 GO Actin cortical patch 4.56 · 10 Pfam/Interpro DUF500 2.88 · 10 GO Actin filament organization 3.35 · 10 Fig. 6. Bipartite subgraphs in the Uetz network. Dotted lines indicate intra- group connections. (a) Extracted using first eigenvalue pair (b) Extracted using second eigenvalue pair. Table 4. p-values for GO annotations found in Key and Lock Group and the cytoskeleton. The bipartite subgraph that we have identified extends on the previously reported interactions and may motivate GO term p-value important cell biological follow-up experiments. For all other bipartite subgraphs identified by our Key Group Actin filament organization 2.9916 · 10 algorithm, protein names and annotations are given where 5 Lock Group Actin cytoskeleton organization 4.3125 · 10 available in the Supplementary Material. To our knowledge these All actin combined 1.6337 · 10 novel bipartite structures and most of the corresponding interactions Actin cortical patch assembly 1.8927 · 10 have not previously been reported. These additional subgraphs Integral to Golgi membrane 2.4789 · 10 All Golgi combined 4.6637 · 10 include a number of other biologically very interesting gene groups, such as the ion-transporter module identified in the Mus musculus interactome, which further highlights the validity of our approach. to detect only the striking examples of lock-and-key interactions. Although the number of bipartite subgraphs identified is reason- As data becomes more reliable and complete, we expect our ably small, our method is wholly reliant on the underlying data approach to identify the lock-and-key interactions with greater which is understood to be extremely noisy. At present we tend coverage. 2018 Lock-and-key model REFERENCES Alfarano,C. et al. (2005) The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res., 1, D418–D424. Arnau,V. et al. (2001) Iterative cluster analysis of protein interaction data. Bioinformatics, 21, 364–378. Bader,J.S. et al. (2004) Gaining confidence in high-throughput protein interactions. Nat. Biotechnol., 22, 78–85. Bamber,D. (1975) The area above the ordinal dominance graph and the area below the receiver operating characteristic graph. J. Math. Psychol., 12, 387–415. Bateman,A. et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141. Breitling,R. et al. (2004) Iterative Group Analysis (iGA): a simple tool to enhance sensitivity and facilitate interpretation of microarray experiments. BMC Bioinformatics, 5, 34. Bork,P. et al. (2004) Protein interaction networks from yeast to human. Curr. Opin. Fig. 7. SH3 domains in key group proteins. Struct. Biol., 14, 292–299. Drees,B.L. et al. (2001) A protein interaction map for cell polarity development. J. Cell boil., 154, 549–571. 4 CONCLUSION Edwards,A.M. et al. (2002) Bridging structural biology and genomics: assessing protein–protein interaction datasets. Trends Genet., 18, 529–536. From the initial lock-and-key model of protein–protein interaction Friesen,H. et al. (2005) Interaction of Saccharomyces cerevisiae cortical actin patch networks, we have devised an algorithm that identifies proteins protein Rvs167p with proteins involved in ER to Golgi vesicle trafficking. containing the lock and key aspects of a particular interaction sur- Genetics, 170, 555–568. face. This is achieved through a search for bipartite subgraphs in Golub,G.H. and Van Loan,C.F. (1996) Matrix Computations. The John Hopkins University Press. protein–protein interaction networks derived from Y2H experi- Gribskov,M. and Robinson,N.L. (1996) Use of receiver operating ments using a spectral approach. Unlike traditional clustering tech- characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem., niques, we identify groups that are not internally highly similar, but 20, 25–33. have a large number of interactions with another group. We have Ito,T. et al. (2001) A comprehensive two-hybrid analysis to explore the demonstrated that under certain modelling assumptions our yeast protein interaction interactome. Proc. Natl Acad. Sci. USA, 98, 4569–4574. approach is guaranteed to identify the correct domain information Jeong,H. et al. (2000) The large-scale organization of metabolic networks. Nature, 507, about proteins in a network. As experimental interaction networks 651–654. are only approximated by our model, we adopt a heuristic approach Kessels,M.M. and Qualmann,B. (2004) The syndapin protein family: linking mem- to identifying bipartite subgraphs. The main ingredients of the algo- brane trafficking with the cytoskeleton. J. Cell Sci., 17, 3077–3086. Khanin,R. and Wit,E. (2006) How scale-free are gene-networks? J. Comput. Biol., 13, rithm are Sum and Difference vectors, formed from the correspond- 810–818. ing eigenvectors of eigenvalue pairs of (approximately) the form Mayer,B.J. (2001) SH3 domains: complexity in moderation. J. Cell Sci., 114, ±l. We demonstrated that this approach reveals bipartite subgraphs 1253–1261. across a large variety of protein interaction networks from diverse Pillai,R.S. et al. (2003) Unique Sm core structure of U7 snRNPs: species. For one of these subgraphs, from S.cerevisiae, we showed assembly by a specialized SMN complex and the role of a new component, Lsm11, in histone RNA processing. Genes Dev., 17, how our method discovers a novel and biologically exciting inter- 2321–2333. acting group, including identification of the physiological function Prz zulj,N. et al. (2004) Modeling interactome: scale-free or geometric? Bioinformatics, and the physical interaction motif, the SH3 domain. Used in this 20, 3808–3515. way, our approach has the potential to add considerable value to the Riley,R. et al. (2005) Inferring protein domain interactions from databases of inter- acting proteins. Genome Biol., 6, R89. experimentally observed interaction networks. Salwinski,L. and Eisenberg,D. (2003) Computational methods of analysis of protein– protein interactions. Curr. Opin. Struct. Biol., 13, 377–382. ACKNOWLEDGEMENTS Thomas,A. et al. (2003) On the structure of protein–protein interaction networks. Biochem. Soc. Trans., 31, 1491–1496. J.L.M. was supported by a Synergy scholarship (www.strath.gla.ac. Uetz,P. et al. (2000) A comprehensive analysis of protein–protein interactions in uk/synergy). R.B. was supported by a Caledonian Research Saccharomyces cerevisiae. Nature, 403, 623–627. Foundation Personal Fellowship. D.J.H. was supported by EPSRC von Mering,C. et al. (2002) Comparative assessment of large-scale datasets of grant GR/S62383/01. protein–protein interactions. Nature, 417, 399–403. Conflict of Interest: none declared.

Journal

BioinformaticsOxford University Press

Published: Jun 20, 2006

There are no references for this article.