DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions

Mu Gao; Jeffrey Skolnick

doi:10.1093/nar/gkn332

DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions

Gao, Mu; Skolnick, Jeffrey 2008-07-31 00:00:00 3978–3992 Nucleic Acids Research, 2008, Vol. 36, No. 12 Published online 31 May 2008 doi:10.1093/nar/gkn332 DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions Mu Gao and Jeffrey Skolnick* Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, 250 14th Street NW, Atlanta, GA 30318, USA Received April 4, 2008; Revised May 5, 2008; Accepted May 8, 2008 ABSTRACT targets already deposited in the PDB (http://targetdb. pdb.org/). Since many targets are representatives of pre- The structures of DNA–protein complexes have illu- viously uncharacterized protein families, the function of a minated the diversity of DNA–protein binding large number of these proteins is unknown. Identifying mechanisms shown by different protein families. their function is an important challenge. In recent years, This lack of generality could pose a great challenge many computational methods have been developed to for predicting DNA–protein interactions. To address assist in functional annotation (2–4). Compared to experi- this issue, we have developed a knowledge-based mental studies, computational methods have the advan- method, DNA-binding Domain Hunter (DBD-Hunter), tage of high eﬃciency and low cost. Most are based on the idea of functional inference through homology. for identifying DNA-binding proteins and associated While sequence comparison methods (5,6) are very power- binding sites. The method combines structural com- ful (7–9), they may oﬀer limited help with the task of parison and the evaluation of a statistical potential, assigning functions for structural genomics targets which we derive to describe interactions between because many have low-sequence similarity to previously DNA base pairs and protein residues. We demon- characterized proteins. Structure-based methods may strate that DBD-Hunter is an accurate method for provide additional clues to a protein’s function because predicting DNA-binding function of proteins, and that structure is better conserved than sequence (10). However, DNA-binding protein residues can be reliably inferred since a common fold may be shared by proteins with very from the corresponding templates if identified. In diﬀerent functions, it remains a challenge to infer protein benchmark tests on 4000 proteins, our method function on the basis of structure alone (11). achieved an accuracy of 98% and a precision of 84%, An area where protein structure could potentially be which significantly outperforms three previous useful is in the identiﬁcation of DNA-binding proteins. methods. We further validate the method on DNA- Such proteins play an essential role in a cell and are binding protein structures determined in DNA-free involved in transcription, replication, packaging, repair and rearrangement. It has been estimated that 2–3% of (apo) state. We show that the accuracy of our method prokaryotic proteins and 6–7% of eukaryotic proteins is only slightly affected on apo-structures compared bind DNA (12). To understand the basic rules, many to the performance on holo-structures cocrystallized eﬀorts have investigated the patterns of DNA–protein with DNA. Finally, we apply the method to 1700 interactions observed in the structures of DNA–protein structural genomics targets and predict that 37 tar- complexes (13–15). In contrast to the strict base pairing gets with previously unknown function are likely to rule observed in double-stranded DNA (dsDNA), it is be DNA-binding proteins. DBD-Hunter is freely avail- now clear that there is no simple code for protein–DNA able at http://cssb.biology.gatech.edu/skolnick/ recognition. Instead, numerous mechanisms are exhibited webservice/DBD-Hunter/. by diﬀerent protein families (12). Given the structure of a protein whose function is unknown, one wishes to answer the following questions: INTRODUCTION (i) Does this protein have a DNA-binding function? With the progress of structural genomics projects, an (ii) If so, where are its DNA-binding sites? (iii) What increasing number of protein structures have become speciﬁc DNA sequences, if any, does the protein recog- nize? To address the ﬁrst problem, several knowledge- available (1). As of early February 2008, a total of 160 792 based approaches have been developed (16–20). Shanahan proteins have been registered as targets by structural genomics centers worldwide, with the structures of 5396 et al. (18) used structural comparison to detect three types *To whom correspondence should be addressed. Tel: +1 404 407 8975; Fax: +1 404 385 7478; Email: [email protected] 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2008, Vol. 36, No. 12 3979 of well-deﬁned DNA-binding structural motifs, helix-turn- all X-ray structures of protein–DNA complexes with helix (HTH), helix-hairpin-helix (HhH) and helix-loop- resolution better than 3.0 A. The resulting 1045 complex helix (HLH), which appear in 30% of 54 structural structures were further partitioned into chains and families based on an earlier classiﬁcation of DNA–protein analyzed. We only consider protein–DNA complexes complex structures (12). Sophisticated machine-learning that satisfy the following three conditions: (i) the protein methods have also been attempted (16,17,19,20). These has a minimum of 40 amino acids; (ii) the DNA molecule utilize techniques such as neural networks (16,19), logistic is dsDNA with at least six base pairs and (iii) the protein regression (20) and support vector machines (17). Features has at least ﬁve protein residues within 4.5 A of the DNA used by these methods include the composition of amino molecule. The analysis led to 1676 DNA-binding protein acids, the charge/dipole moment of the protein molecule chains. An all-against-all global sequence alignment was and the presence of positively charged surface patches. On performed on these protein chains using the ALIGN0 average, these studies reported sensitivities ranging from program (27) from the FASTA2 package. Pairwise 70% to 90% and speciﬁcities ranging from 65% to 95%. sequence identity is deﬁned as the ratio of the number of To address the second question, similar machine-learning identical residues over the length of the shorter sequence. methods have been developed to predict individual DNA- We used the number of DNA–protein contacts, structure binding residues on proteins (21–23), with an average resolution and the available literature to select only accuracy ranging from 65% to 80%. To address the third one representative among protein chains with larger problem, statistical models have been introduced to than 35% sequence identity, leading to a nonredundant characterize the speciﬁcity of DNA sequences for a given set such that any two protein chains have <35% sequence DNA-binding protein (13,24–26), with a few successful identity. SCOP annotations (28) and visual inspection examples reported. were used to identify the DNA-binding domain (DBD) Since DNA-binding proteins likely comprise only a small for each protein chain. The resulted 179 DNA-binding fraction of structural genomics targets, for practical protein domains and associated DNA chains are listed in applications, it is necessary to develop a method with Supplementary Table 1. high precision. Otherwise, the number of false positives NB3797. A control set of 3797 non-DNA-binding protein could easily outnumber the true positives, rending such an chains (NB3797) was created from a nonredundant set approach is impractical for automatic function assignment. All machine-learning methods mentioned above, however, of 7037 protein structures, which comprise the reported relatively low speciﬁcities for non-DNA-binding PROSPECTOR threading template library built from proteins (16,17,19,20). In addition, these rates were the May 2006 PDB release by clustering all PDB protein obtained on small sets of less than 250 structures. It is not structures using a 35% global sequence identity cutoﬀ and clear whether similar speciﬁcity would be obtained on a choosing one representative from each cluster (29). From much larger data set of thousands of structures, a scenario these 7037 chains, we select the 5636 proteins with SCOP relevant to proteomic-scale applications. To address these annotations. We further discarded any chain if its PDB issues, we describe a new method, DNA-binding Domain record contains a DNA molecule or its SCOP annotation Hunter (DBD-Hunter), for the prediction of DNA-binding contains the keyword ‘hypothetical’. Protein chains were proteins and associated DNA-binding sites. The method manually inspected to determine whether its SCOP super- uses both structural comparisons and a DNA–protein family, family and domain annotations contain the key- statistical potential to assess whether or not a given protein word ‘DNA’. Those whose function is associated with binds DNA. We demonstrate that DBD-Hunter achieves DNA-binding were removed. All ribosomal proteins were an extremely high speciﬁcity of 99.5% and a precision of also excluded. For each of the resulting 3911 protein 84% in benchmark tests, with a sensitivity of 47% obtained chains, the keyword ‘DNA’ was searched in the title, on unbound and a sensitivity of 58% on DNA bound abstract and keyword sections of its primary citation. protein structures. By way of illustration as to its appli- Positive hits were inspected by reading the literature to cability, we apply our method to 1697 structural genomics exclude DNA-binding proteins. The ﬁnal 3797 protein targets and predict that 37 previously unknown targets are chains compose NB3797. DNA-binding proteins. APO104/HOLO104. A total of 104 pairs of DNA- binding protein structures determined both in the absence METHODS and presence of DNA were constructed from the PDB May 2007 release. The PDB was queried to retrieve two Availability sets of proteins: the holo structure set consists of 759 All datasets listed below, the statistical potential para- proteins cocrystallized with a dsDNA molecule; the apo meters and a web-server implementation of DBD-Hunter set comprises 35 899 crystal/NMR structures determined are available at http://cssb.biology.gatech.edu/skolnick/ without any DNA. An all-against-all sequence alignment ﬁles/. between the two sets was performed following the same procedure described above. The alignment procedure Data sets resulted in 679 holo–apo pairs, which have sequence DB179. A set of 179 DNA–protein complex structures identity >35% between each pair. The 679 holo chains (DB179) were selected by the following procedure: were further culled by excluding redundant sequences The July 2007 release of the PDB was queried to retrieve with an identity cutoﬀ of 35%. One representative was 3980 Nucleic Acids Research, 2008, Vol. 36, No. 12 selected among proteins with pairwise sequence identity protein residue type and a reduced DNA atom is deﬁned as: >35%, using literature and structure resolution informa- obs tion for guidance. The ﬁnal results are 104 holo-structures e ¼ ln 1 exp and their corresponding apo-structures, denoted as HOLO104 and APO104, respectively. Most of these obs where N is the number of observed contacts for the ad holo–apo pairs have high-sequence identity, 100% for 62 exp pair and N is the number of expected contacts assuming exp pairs and >95% for 91 pairs. The remaining 13 pairs, no preferential interaction. The reference state, N , which have sequence identities ranging from 45% to 95% is deﬁned by the product of the total number of observed are composed of apo–holo homologs from the same obs contacts N and the mole fractions of a and d, namely total SCOP family. exp obs N ¼ N f f 2 total SG1697. A set of structural genomics targets was selected The statistical potential energy E (also named the from the Jan 2008 PDB release. The PDB was queried interfacial energy) of a DNA–protein complex structure with the classiﬁcation keyword ‘structural genomics’, is deﬁned as the sum of pair interactions for all protein– resulting in 1886 PDB entries. These were split into DNA contacts. protein chains, which were further clustered at a 90% The Z-score of a native complex structure is deﬁned as: sequence identity cutoﬀ with the program CD-HIT (30). From each cluster, we randomly select one representative. E E nat ave Z-score ¼ 3 These 1697 representatives compose the set SG1697. Statistical pair potential where E and s are the average and SD of the statistical ave potential energies of all random structures, and E is the nat To derive a statistical pair potential for describing DNA– statistical potential energy of the native complex. In speci- protein interactions, we consider a contact between a ﬁcity tests, random structures were obtained by replacing protein residue and a functional group of DNA deﬁned as interfacial DNA or protein residues with random nucleo- that when at least one heavy atom from the protein resi- tides or amino acids. We assume that a contact is formed due is within 4.5 A of at least one heavy atom from the in a random structure at the same location as observed corresponding DNA functional group. Four types of in the native structure. Since the imidazole group only functional groups were considered for DNA nucleotides appears in purines, we ignore any contact involving (Figure 1). Pyrimidines C and T have the phosphate (PP), a purine imidazole group if the purine is replaced by a the sugar (SU) and the pyrimidine (PY) groups. In addi- pyrimidine. tion, purines A and G have a fourth group, the imidazole (IM) group. Note that all DNA groups are residue DNA-binding interaction prediction protocol speciﬁc. To diﬀerentiate them, we add their nucleotide names as preﬁxes, e.g. A.PP represents the phosphate Using structural alignment and the statistical potential, group of an adenine. we developed a new method, DBD-Hunter, to predict A knowledge-based statistical pair potential was devel- DNA–protein interactions, given a target structure. First, oped from an analysis in the DNA–protein complex set the target structure is scanned against the template library DB179 (Supplementary Figure 1). The derivation of the DB179 for similar protein structures with the structural potential was based on the assumption that the frequencies alignment program TM-align (9). Only Ca backbone of observed pair interaction states follow a Boltzmann coordinates are used for the structural alignment and distribution (31,32). The pair interaction energy between for root mean squared deviation (RMSD) calculations. A TM-score >0.4 indicates signiﬁcant structural align- ment (9). To reduce the number of false positives, we employed the higher TM-score threshold of 0.55 for template selection (see below). For templates with a TM- score better than the threshold, the statistical potential energy between the target protein and the template DNA is calculated by evaluating contacts within the structurally aligned regions. The contact evaluation follows the similar procedure adopted in structure threading by replacing original template protein residues with corresponding aligned template residues (29). The templates are then ranked according to their interfacial energies. If the lowest interfacial energy is below a (to be determined) energy threshold, the target is predicted to be a DNA-binding Figure 1. Scheme of DNA nucleotide functional groups considered for protein. If no template is found in either the structural protein–DNA contact analysis. The phosphate, sugar, pyrimidine and alignment or satisfying the energy criterion, the target imidazole groups are colored in green, orange, red and blue, is classiﬁed as a non-DNA-binding protein. For proteins respectively. Note that the functional groups are base speciﬁc, e.g. predicted as a DNA-binding protein, we further infer the phosphate group of adenine is diﬀerent from the phosphate group of thymine. DNA-binding protein residues from their templates. Nucleic Acids Research, 2008, Vol. 36, No. 12 3981 Table 1. Optimization of TM-score and energy threshold parameters A DNA-binding protein residue is deﬁned as a residue ˚ for DBD-Hunter on DB179 and NB3797 with at least one heavy atom within 4.5 A of a DNA functional group. TM-score Energy TP FN FP TN Precision Max. threshold threshold MCC Optimization of parameters for DNA-binding protein range prediction 0.74–1.00 1.1 52 8 6 63 0.90 0.78 The DNA-binding protein prediction method requires two 0.62–0.74 4.8 43 23 10 368 0.81 0.69 threshold parameters: the TM-score threshold and the 0.58–0.62 9.5 4 23 2 568 0.67 0.30 0.55–0.58 12.3 4 28 1 968 0.80 0.31 statistical potential energy threshold. Here, we present two 0.52–0.55 2.8 16 34 188 1496 0.08 0.11 strategies for optimizing these two parameters by max- 0.49–0.52 3.0 17 31 321 2029 0.05 0.09 imizing the Matthews correlation coeﬃcient (MCC) (33) 0.46–0.49 14.1 4 34 20 2676 0.17 0.12 of predictions on DB179 and NB3797. In the ﬁrst strategy, 0.43–0.46 2.3 22 16 981 2070 0.02 0.06 we simply search for the best parameter pair that gives the 0.40–0.43 13.7 2 14 36 2189 0.05 0.07 highest MCC. Supplementary Figure 2 shows the contour The optimal energy threshold that gives the highest MCC of pre- representation of MCC in threshold space. The best MCC dictions is listed for each TM-score range. When TM-scores are below of 0.64 is given by a TM-score threshold of 0.62 and an 0.55, the numbers of false positives greatly exceed the number of true energy threshold of 4.8, corresponding to a sensitivity positives at the maximum MCCs. Therefore, we set the minimum TM- score threshold at 0.55. The optimized threshold values adopted in this of 0.49 and a speciﬁcity of 0.997. The high (>0.60) MCC study were represented in bold. region is located within the TM-score threshold range from 0.53 to 0.67 and the energy threshold range from threshold. Otherwise, the target is classiﬁed as non-DNA- 10 to 2.5. As the TM-score threshold increases, the binding. When applying PSI-BLAST (version 2.2.17), optimal energy threshold corresponding to the highest up to four rounds of scanning on the NCBI-NR non- MCC at a given TM-score threshold increases as well. redundant protein sequence library (the July 2007 release) This can be easily understood: since structures with higher were performed to derive a position-speciﬁc sequence similarity are more likely to share a similar function, the proﬁle for each target sequence. An inclusion E-value energy criterion can be softened as the level of structural threshold of 0.001 and the default values for other similarity increases and vice versa. The observation leads to the second optimization arguments were employed. Using this proﬁle, a last PSI- strategy: rather than using a constant energy threshold, BLAST run was performed on the DB179 sequence the optimal energy threshold can be dependent on the library. If a target hits a template with an E-value higher value of the TM-score. Speciﬁcally, we divided the TM- than the speciﬁed threshold, the target is classiﬁed as a score range from 0.4 to 1.0 into nine regions (Table 1). DNA-binding protein; otherwise, it is classiﬁed as a non- Starting from template hits within the top region, we select DNA-binding protein. The default parameters were the optimal energy threshold that gives the highest MCC employed for the SS method. During the benchmark of predictions on DB179 and NB3797. Positive targets tests on DB179, APO104 and HOLO104, homologs with under the optimal energy criteria were removed from both global sequence identity >35% were excluded from the sets, and we re-run predictions on the reduced target sets template library. for the next TM-score region of template hits. The process The prediction outcome can be classiﬁed and counted was repeated for all nine TM-score regions and generated for each method. The numbers of true positives, false an optimal energy threshold for each region. However, for positives, true negatives and false negatives are designated TM-scores below 0.55, the number of false positives as TP, FP, TN and FN, respectively. Performance greatly exceeds the number of true positives at the measures are deﬁned as the following: maximum MCCs, which are invariably low (<0.15). TP Therefore, the minimum TM-score threshold is set at Sensitivity ¼ Recall ¼ 0.55 to reduce false positives. The list of optimized ðTP þ FNÞ parameters is provided in Table 1. Using these parameters, FP FPR ¼ a sensitivity of 0.58 and a speciﬁcity of 0.995 were ðTN þ FPÞ achieved on the training set DB179 and NB3797, with a TN corresponding MCC of 0.69. The optimal parameters were Specificity ¼ ðTN þ FPÞ used in validation tests on APO104/HOLO104 and the application on SG1697. ðTP þ TNÞ Accuracy ¼ ðTP þ FN þ TN þ FPÞ Assessment of DNA-binding protein prediction methods TP Precision ¼ We compared our prediction method with three diﬀerent ðTP þ FPÞ approaches: TM-align (9), PSI-BLAST (5) and the method ðTP TN FP FNÞ proposed by Szilagyi and Skolnick (denoted as the SS MCC ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðTP þ FNÞðTP þ FPÞðTN þ FPÞðTN þ FNÞ method) (20). The protein structures of DB179 were used as the template library for TM-align. When applying TM-align, a target is classiﬁed as a DNA-binding protein if it hits a template with a TM-score higher than a speciﬁed where FPR denotes false positive rate. 3982 Nucleic Acids Research, 2008, Vol. 36, No. 12 RESULTS We ﬁrst develop the statistical potential and examine its speciﬁcity to both protein sequences and DNA sequences. The statistical potential energy and structural similarity, two features used by our DNA–protein prediction method, were analyzed on DB179/NB3797. The performance of our method was assessed by leave-one-out cross-validation and compared to the three other methods described above. Conformational changes occurred during the apo-to-holo transition were subsequently studied for DNA-binding proteins. The performance of our method was tested on both apo- and holo-structure sets APO/HOLO104. Finally, a total of 1679 structural genomics targets were scanned for DNA-binding proteins as a real world test of the methodology. Statistical pair potential The pair potential parameters have been derived for 20 Figure 2. Distribution of the statistical potential energy for 179 DNA– amino acids and 14 nucleotide functional groups using the protein complexes in the self-consistent test. The insert shows the Boltzmann principle (Supplementary Figure 1). A total of potential energy calculated in a Jackknife test versus the energy calculated in the self-consistent test. 12 771 DNA–protein contacts observed in the nonredun- dant DNA–protein complex structure set DB179 have been considered. The positively charged amino acids Arg jackknife test is 94%, which is 3% lower than the self- and Lys are the most preferred contact partners by DNA consistent test. nucleotides. The result is expected due to the negative charge carried by DNA. The polar amino acids Asn, His, Specificity of the statistical potential Tyr, Gln, Thr and Ser are attracted to the DNA backbone phosphate and sugar groups, but are less preferred by base The requirement of favorable energy interaction of the groups. The hydrophobic residue Leu and positively native DNA–protein complex is a necessary, but not suﬀ- charged residue Glu have the most energetically unfavor- icient condition for characterizing speciﬁc recognitions able interactions with DNA nucleotides. In general, DNA between proteins and DNA. We further assess the speci- base groups have more speciﬁc interactions with amino ﬁcity of our potential parameter set by generating random acids than backbone groups. Imidazole groups, for DNA or protein sequences separately for each complex. example, are favored by only two to three amino acids. The DNA/protein speciﬁcity is measured by the Z-score One case of such favorable pairs is the guanine imidazole (Equation 3) of the native potential energy compared with group and Arg, which is expected because hydrogen energies calculated for random DNA/protein sequences. bonding between them has been frequently observed. By All energies were calculated with the individual jackknife comparison, most polar and positively charged amino parameter set corresponding to each target complex. acids are attracted to the phosphate and sugar groups. In the DNA speciﬁcity test, up to one million DNA A basic requirement for any good statistical potential is sequences were randomly generated for the interfacial the capability to characterize favorable energetic interac- DNA base pairs of each complex. Equal probabilities of tions, given a DNA–protein complex structure. To exam- 0.25 were assigned for the four types of nucleotides. For ine whether our potential meets this requirement, as structures with less than 10 DNA base pairs, we conducted shown in Figure 2, we performed both a self-consistent an exhaustive investigation of all possible combinations. test and a jackknife test on DB179. In the self-consistent As shown in Figure 3A, 109 proteins with speciﬁc DNA test, the interfacial energy E for a complex structure is recognition, e.g. transcription factors and restriction evaluated with the parameter set determined from all 179 endonucleases, have an average Z-score of 1.2, which complexes. The result shows that 97% of these complexes is one unit lower than the average Z-score of 0.2 for 70 have a favorable interfacial energy (E < 0). In the jack- proteins recognizing nonspeciﬁc DNA. The result demon- knife test, also termed leave-one-out cross-validation, the strates a modest Z-score diﬀerence between speciﬁc DNA target structure is excluded from the statistical potential recognition and nonspeciﬁc DNA recognition on average. derivation, and the interfacial energy for the target struc- In the protein speciﬁcity test, one million random ture is then calculated with the corresponding parameter protein sequences were generated for DNA-binding set. As shown in Figure 2, energies calculated from both protein residues of each complex. To avoid overrepresen- tests are closely correlated with a correlation coeﬃcient tation of rare amino acids, we assign the frequency of an >0.99. The average potential energy from the jackknife amino acid type observed in DB179 as the probability test is 24.2, which is 2.1 kT units higher than the of the corresponding amino acid in random sequences. average of 26.3 from the self-consistent test. The fraction As shown in Figure 3B, the mean and SD of the Z-score of complexes with favorable interfacial energy in the for native complex structures is 3.2 and 1.1. Only seven, Nucleic Acids Research, 2008, Vol. 36, No. 12 3983 Figure 3. (A) Distribution of native structure Z-scores among randomly generated DNA sequences. (B) Distribution of native structure Z-scores for randomly generated protein sequences. or 3.9%, of the complex structures have a Z-score higher energy value (E < 0). At an energy threshold of 4.8, the than 1.0. The result suggests that the statistical potential fraction of DNA-binding proteins satisfying the energy is reasonably speciﬁc to DNA-binding proteins. We can, criterion drops slightly to 49%, while only 0.003% (12) therefore, utilize the statistical potential as a feature to of non-DNA-binding protein’s top templates satisfy the discriminate DNA-binding proteins from non-DNA- same criteria. Use of the statistical potential dramatically binding proteins. reduces the number of false positive hits. We have not employed the Z-score of the target sequence/structure relative to the randomized target Characteristic features of DNA-binding proteins sequence aligned to the selected template, as we ﬁnd that For the purpose of discriminating DNA-binding from the performance is essentially the same as when the energy non-DNA-binding proteins, we consider two features: cutoﬀ is used. Since about 25% of DNA-binding residues structure similarity and statistical potential energy. In our are missed on average by the structural alignment and the approach, a target structure is scanned against the DNA sequence is that of the template, the average Z-score template library DB179 for similar structures with TM- of 2.1 for the top Z-score ranked template is not align. Using the TM-score as the structural similarity surprisingly larger (less negative) than that for the native metric, we identify templates that have a score higher than protein–DNA complex whose average is 3.2. Given the a given TM-score threshold. The statistical potential rather small range of Z-scores, this is the origin of the energy is then calculated for the target using the structural comparable performance as to whether an energy cutoﬀ alignment to a qualiﬁed template. The two features were or Z-score criterion is used. examined on DB179 and the non-DNA-binding set NB3797. In the test of DB179, the target structure was Assessment of DNA-binding protein prediction methods excluded from the template library for both the template scanning and the statistical potential derivation. By combining structural comparison with a statistical The distributions of the top TM-score-ranked template potential, we developed DBD-Hunter for DNA-binding for DB179 and NB3797 are shown in Figure 4A. About protein prediction (see Methods section for details). 93% of DNA-binding proteins and 70% of non-DNA- To assess the performance of our method, we compared binding proteins hit at least one template with a TM-score it with three other methods: the SS method, PSI-BLAST higher than a signiﬁcant value of 0.50. As one raises the and TM-align. Figure 5 shows the receiver operator value of the TM-score threshold, the fraction of non- characteristic (ROC) curves and precision-recall (PR) DNA-binding protein structures with a qualiﬁed template curves for benchmark tests on DB179 and NB3797. The decreases more rapidly than that of DNA-binding protein ROC and the PR curves of our method were obtained by structures. At a TM-score threshold of 0.62, only 10% of varying the energy threshold while ﬁxing the TM-score non-DNA-binding proteins have at least one template hit, threshold at 0.62. For the other three methods, the vari- whereas 68% of DNA-binding proteins satisfy this able used to obtain the ROC and PR curves are: the criterion. However, since the size of NB3797 is much threshold deﬁned in ref. (20) for SS, the E-value for PSI- larger than that of DB179, the absolute number of BLAST and the TM-score for TM-align. For comparison, positives from NB3797 is over three times the number of the results of DBD-Hunter using TM-score-dependent positives from DB179. optimized energy threshold, denoted as DBD-Hunter , opt To further reduce false positives, we use the statistical are also provided; the corresponding sensitivity of 58%, potential energy to re-rank the templates preselected from speciﬁcity of 99.5%, precision of 84% and MCC of 0.69 the structural alignment procedure. Figure 4B shows the are the best in our benchmark tests (Table 2). distribution of the top energy ranked template for DB179 Clearly, our method outperforms all other three and NB3797 at a TM-score threshold of 0.62. About 1.3% methods within the low FPR regime (<10%), which is (69) of non-DNA-binding proteins and 57% (102) of relevant for practical applications. The maximum MCCs DNA-binding proteins have a template with a favorable of the four methods are listed in Table 2. DBD-Hunter 3984 Nucleic Acids Research, 2008, Vol. 36, No. 12 Figure 4. Discriminatory feature analysis for DNA-binding and non-DNA-binding proteins. (A) Distribution of the highest TM-score-ranked template on DB179/NB3797. The numbers of template hit are given above the histogram bars. (B) Cumulative faction of the top energy-ranked template versus the statistical potential energy. Only templates higher than the TM-score threshold of 0.62 were considered. Figure 5. Performance comparison of methods for DNA-binding protein prediction. All tests were performed on DB179 and NB3797. The result obtained by DBD-Hunter using optimized threshold parameters is indicated by a star symbol. (A) ROC (sensitivity versus FPR) curves. (B)PR (precision versus sensitivity) curves. Table 2. Comparison of the maximum MCC by four DNA-binding The performance of the SS method is the worst among protein prediction methods on DB179/NB3797 these methods. We note that its FPR is much higher on NB3797 than previous reported FPRs on small control Method Max. MCC Sensitivity FPR Precision sets (20). The threshold used to obtain an FPR of 2% on smaller sets yield an FPR of 5.7% on NB3797. One DBD-Hunter 0.69 0.58 0.005 0.84 opt advantage of our method is that it delivers a high precision DBD-Hunter 0.64 0.49 0.003 0.87 TM-align 0.47 0.52 0.028 0.47 at a reasonable sensitivity. As shown in the PR plot PSI-BLAST 0.56 0.44 0.007 0.76 (Figure 5B), the precision of DBD-Hunter stays at a high SS 0.31 0.40 0.044 0.93 level above 88% for a sensitivity up to 50%. By comparison, none of the other three methods can achieve this level of precision at a sensitivity better than 30%. The achieves the highest maximum MCC of 0.69, compared high precision is important for application to targets on with 0.47 for TM-align, 0.56 for PSI-BLAST and 0.31 for a proteomic scale. SS. As shown in the ROC plot (Figure 5A), the sensitivity of our method jumps to 49% at a very low FPR of 0.3%, Prediction of DNA-binding residues on proteins then gradually increases to 60% at a FPR of 1.6% and Since DBD-Hunter identiﬁes a template for each target, it ﬁnally reaches a plateau of 68% at a FPR of 6.6%. The is attempting to infer DNA-binding sites from the 68% sensitivity limit is due to the TM-score threshold template, whose DNA-binding site is known. The most imposed. If one only considers structural similarity, inferior performance was obtained. For example, TM- straightforward way is to assign DNA-binding function to align gives a sensitivity of 50% with a FPR of 2.8%, which target residues aligned with DNA-binding residues of the is more than nine times the FPR of our method at the template. This approach was conducted on 103 proteins same sensitivity. PSI-BLAST is generally less sensitive predicted as DNA-binding proteins by DBD-Hunter using than the structure-based methods. At a permissive FPR the TM-score-dependent optimal thresholds. These pro- of 4%, PSI-BLAST recognizes about half of the targets. teins include 42 enzymes, 48 transcription factors and Nucleic Acids Research, 2008, Vol. 36, No. 12 3985 residues (15 versus 10 for both of the other two cases) and only one false positive. The second example involves the target, the DBD of catabolite gene activator (36), and the template, the DBD of replication terminator protein (Figure 7B) (37). They share a similar structure, the winged helix DBD, which is a common DNA-binding motif. In fact, the target hits 10 templates with a TM-score higher than 0.62. The top energy-ranked template has the lowest TM-score among these 10 templates, but it made the highest number of correct predictions for DNA-binding residues (12 of 14). In the third example, we examine the target, acute myeloid leukemia 1 protein RUNT domain (38), and the template, the DBD of p53 tumor suppressor (Figure 7C) (39). They closely resemble each other with a b-sandwich fold. Although the DNA-binding sites are located in a largely disordered region composed of two loops and two b-strands, the target was successfully predicted through the template. We note that the same template was hit by Figure 6. Performance on the prediction of DNA-binding residues. A 15 non-DNA-binding proteins above the TM-score total of 103 DNA-binding proteins predicted by DBD-Hunter were threshold of 0.62. Fourteen of these negative cases are examined. The lower, middle and upper quartiles of each box are 25th, correctly classiﬁed as non-DNA-binding by the energy 50th and 75th percentile, respectively. Whiskers extend to a distance of up to 1.5 times the interquartile range. Outliers and averages are criterion, because they exhibit repulsive energies. The only represented by circles and squares, respectively. exception, actinoxantin (PDB code 1acx_), belongs to an antitumour antibiotic chrompoprotein family, whose 13 other types of DNA-binding proteins. For each target members recruit chromophores that cleave DNA sub- structure, we make a binary prediction (DNA binding or strates (40). Although it is not clear whether actinoxantin non-DNA binding) on each residue aligned with the top itself binds to DNA, our prediction suggests that energy-ranked template. Performance measures, sensitiv- actinoxantin binds DNA and that this leads to subsequent ity, speciﬁcity, accuracy and precision were calculated for DNA cleavage by the chromophore. each target structure. The box plot of the results is shown Two restriction endonucleases, HinP1I (41) and MspI in Figure 6. On average, a sensitivity of 72%, a speciﬁcity (42), are presented in the fourth example. Both consist of two domains, aligned with an RMSD of 3.3 A, the largest of 93%, an accuracy of 90% and a precision of 71% were among these examples. The two enzymes extensively obtained. For 81% of the target structures, we achieved interact with DNA; there are 47 DNA-binding residues a precision >60%. The results imply that the closely in the target HinP1I and 36 in the template MspI. Our related target-template pairs were correctly identiﬁed in method successfully identiﬁed 22 DNA-binding residues most cases. and produced nine false positives. In the ﬁfth example, we investigate the target, tran- Examples of DNA-binding protein prediction scriptional repressor CopG (43), and the template, the DBD of methionine repressor MetJ (44). A DNA-binding Six examples of successful predictions by our method are motif, the so-called ribbon-helix-helix motif, is selected. illustrated in Figure 7A–F. In these examples, the The interfacial energy of 7.8 is relatively weak, mainly sequence identity between a target and its template due to the small number of DNA-binding residues ranges from 9% to 17%. The lack of sequence similarity involved. All seven DNA-binding residues of the target makes it diﬃcult for the sequence-based PSI-BLAST are correctly predicted. method to identify these templates. In fact, none of them In the last example, the target, the DBD of Epstein-Barr was hit by PSI-BLAST for the corresponding targets. nuclear antigen 1 (45), hits the template, the DBD of In the ﬁrst example (Figure 7A), the target, the bipartite human papillomavirus-18 E2 (46). The two virus proteins DBD of Tc3 transposase Tc3A (34), consists of two sub- share a low-sequence identity of 10%, yet have high- domains that belong to the homeodomain-like super- structural similarity with a TM-score of 0.75. Their family deﬁned in SCOP. The target hits three templates structure is a ferredoxin-like fold, which was found in above a TM-score threshold of 0.62. As expected, they all many non-DNA-binding proteins. In fact, 41 non-DB share a homeodomain-like structure with a classic HTH proteins from NB3797 hit the same template. All, but one, DNA-binding motif. Each template yielded an interfacial of these false hits were eliminated on the basis of the energy strong enough for making a positive prediction, interfacial energy. despite the fact that only one sub-domain of the target was aligned with the template. The best energy-ranked Conformational changes between apo- and holo-forms template, telomeric protein TRF1 DBD (35), has the lowest TM-score of 0.64 among these three templates, but For any structure-based method for DNA-binding protein it generated the most correct predictions of DNA-binding prediction, it is necessary to examine its performance 3986 Nucleic Acids Research, 2008, Vol. 36, No. 12 Figure 7. Examples of DNA-binding protein predictions on DB179. (A–F) Structural alignment of the target structure and the template in cartoon representations. In each panel, the left snapshot shows the overall alignment, together with the cocrystallized DNA molecules. The color codes for protein and DNA representations are red and purple for the target, and green and cyan for the template, respectively. The right snapshot highlights DNA-binding residues of both the target and the template in the same color code as the left snapshot. Non-DNA-binding residues of the target were dimmed in gray. For a clear view of the binding interface, the two snapshots were taken from diﬀerent orientations. In parentheses, each structure was labeled in the format of xxxxX, where xxxx is the four-digit PDB code and X is the chain identiﬁer of the protein. If the PDB record contains no chain identiﬁer, X is replaced with an underscore. Sequence identity (SID), TM-score (TMS), RMSD and the statistical potential energy E are provided at the bottom of each panel. Graphic images were made with the program VMD (62). on structures determined in the absence of DNA. The conformational changes in the structural aligned regions reason is that the conformational changes occurring on identiﬁed by TM-align. As shown in Supplementary DNA binding may aﬀect the accuracy of the method. To Figure 3, the majority (70%) of pairs have a RMSD global ˚ ˚ address this issue, we have collected 104 pairs of apo- and <3A, but a few (14%) have an RMSD >5A. The global holo-form DNA-binding proteins (APO104/HOLO104) latter are mainly due to ﬂexible termini or relative domain and analyzed their conformational changes. Two RMSD movement of multiple-domain proteins (see examples metrics were calculated: RMSD measures the overall below). If we consider structural alignment instead, the global conformational changes by superposing the two forms in corresponding RMSD is <5A for all pairs and is within TM the sequence aligned regions; and RMSD measures the 3A for 89% of pairs. The corresponding coverage of the TM Nucleic Acids Research, 2008, Vol. 36, No. 12 3987 structural alignment is usually high, better than 90% of the correctly as DNA-binding proteins by PSI-BLAST. DBD- shorter chain for 87% of the pairs (Supplementary Figure 3 Hunter, therefore, is about 60% more sensitive than PSI- insert). The results reveal that apo-to-holo conformational BLAST on APO104. A much more permissive E-value of changes are mostly localized with a RMSD <3A for more 0.001 generates 45 hits for PSI-BLAST, which is still 10% than 70% of DNA-binding proteins. less than the correct predictions made by our method. Prediction of DNA binding using apo-structures Examples of DNA-binding protein prediction on apo-structures We further benchmarked our method on APO104/ HOLO104 using the optimized threshold parameters Four positive predictions on APO104 are illustrated in determined on DB179/NB3797. For a given target, any Figure 9A–D. In these examples, the target apo-forms and template with sequence identity >35% was excluded from their holo-forms exhibit large RMSD values ranging globe the template library and the statistical potential derivation from 3 to 19 A. Despite these signiﬁcant conformational in our tests. As shown in Figure 8A, about the same changes upon DNA binding, the apo-structures were number of APO104 and HOLO104 members hit at least successfully predicted as DNA-binding proteins. one template above the TM-score threshold of 0.52, 94 for The ﬁrst example is the bacteriophage integrase HOLO104 versus 95 for APO104. However, the distribu- protein, a tyrosine recombinase possessing two DBDs, tions of the best structural templates are somewhat the catalytic domain and the central domain (Figure 9A). diﬀerent. The holo target set hit more closely resembled The latter domain is missing in the apo-structure (47), but is present in the holo-structure (48). In the apo-to-holo templates than the apo set did. The latter has 28% less transition, dramatic movement occurred at the C-terminal templates above the TM-score cutoﬀ of 0.68 than the former. In particular, nine holo queries have one template region (residues 331–356) of the catalytic domain, which with a TM-score better than 0.88, but no apo-structure brought a crucial catalytic residue Tyr342 in contact with has a template with such a high level structural similarity. a scissile phosphate of DNA from a distance of 20 A away. Despite the relatively lower structural similarity to their The movement made the major contribution to the large templates, APO104 yielded only slightly fewer number of RMSD of 10 A, because the remainder (residues 170– global correct predictions as HOLO104 by DBD-Hunter. The 330) of the catalytic domain is virtually unchanged with an number of positive predictions is 57 for the holo set and 49 RMSD of 0.4 A. It is the static core region of the target TM for the apo set. These numbers correspond to a sensitivity that allows a hit to a template, the N-terminal domain of 55% for HOLO104 and 47% for APO104, compared of Flp recombinase (49). The target and the template have with the value of 58% observed for DB179. DNA-binding a high TM-score of 0.71, in spite of a low-sequence residues were further inferred from the top-ranked identity of 12%. Major DNA-binding sites, including the template for predicted DNA-binding proteins from the catalytic triad of Arg212-His308-Arg311, were correctly apo/holo sets (Figure 8B), except for target 2frhA that has identiﬁed as DNA-binding residues. Based on the strong a controversial DNA-binding site (see examples below). interfacial energy of 24, a positive prediction was made for the target apo-structure. On average, the predictions yield sensitivities of 68%/ The second example is the Max protein, a transcription 66%, speciﬁcities of 93%/93%, accuracies of 89%/87% and precisions of 67%/66% for HOLO104/APO104. factor from the basic/HLH/zipper (bHLH-Zip) family of Our method was compared with PSI-BLAST on DNA-binding proteins (Figure 9B). Members of this APO104. For a fair comparison, an E-value of 1E-8.5 family form a stable dimeric structure when complexed was chosen for PSI-BLAST such that it provided a similar with DNA, but they are notoriously diﬃcult to stabilize precision rate (82%) to DBD-Hunter (precision rate 84%) under DNA-free conditions. The NMR structures of on DB179/NB3797. Only 31 apo-structures were identiﬁed the apo-form were determined after cross-linking two Figure 8. Prediction of DNA-binding interactions for APO104 and HOLO104. (A) Distribution of the top TM-score ranked templates. Using the statistical potential energy threshold parameters optimized on the benchmark set DB179, DBD-Hunter predicted 48 and 57 targets of DNA-binding proteins for APO104 and HOLO104, respectively. For each predicted DNA-binding protein, DNA-binding residues were predicted. The performance measures of these predictions were shown in (B). The box plots are drawn as in Figure 6. 3988 Nucleic Acids Research, 2008, Vol. 36, No. 12 monomers at the C-termini and introducing two stabiliz- factor of activated T cell (55). Despite such a dramatic ing point mutations (50). As shown in Figure 9B, the apo- in-block movement of the C-terminal domain, p65 was structure closely resembles the holo-form (51), except for correctly classiﬁed as a DNA-binding protein because the ﬁrst 14 N-terminal amino acids of the basic region, most DNA-binding residues located in its N-terminal which are unfolded in the apo-structure but form a helix in domain were correctly identiﬁed through the template. the presence of DNA. Nevertheless, half of the 14 DNA- The last example, the protein SarA, is the most intriguing (Figure 9D). The apo-structure (56) of the binding residues are aligned with DNA-binding residues of the template from the sterol regulatory element binding single-domain transcription factor adopts a dramatically protein (52), and the apo-structure is correctly classiﬁed diﬀerent topology from its holo-structure (57). The as a DNA-binding protein. RMSD is 19 A between these two structures. A nota- global The third example is the p65 subunit (also known as ble diﬀerence is a winged HTH motif present in the apo- RelA) of nuclear factor-kB (Figure 9C). The p65 subunit structure but missing in the holo-form, which instead has consists of two b-sandwich domains connected by a a unique DNA-binding motif. However, it has been sug- ﬂexible 10 amino acid linker. In the DNA-bound form gested that the apo-structure represents the native form of of p65, the N-terminal domain provides most of the DNA- SarA and that the unique DNA-binding mode observed in binding residues, while the C-terminal domain interacts the holo-structure is due to crystallization artifacts (56). with the p50 subunit (not shown) to form a heterodimer In our test, the winged HTH motif of the apo-structure complex (53). The DNA-binding activity of p65 can be was predicted to be DNA-binding by three templates inhibited by IkBa, a protein recognizing p65 that induces (1qbjB, 1sfuA and 1cpgA). In particular, Arg90 is a domain rotation of p65 (54). As shown in Figure 9C, the predicted to be a DNA-binding residue. This is consis- N-terminal domains from the apo- and the holo-structures tent with the mutagenesis study (56), which shows that overlap, whereas the C-terminal domain undergoes about the residue is critical to the DNA-binding function of the a408 rotation around the interdomain linker. Similar SarA. Overall, our prediction provides evidence for the conformational changes have also been observed in the hypothesis that the apo-form structure is functionally alignment of the target to the template NFAT1, a nuclear relevant. Figure 9. Examples of DNA-binding protein prediction on APO104. (A–D) In each panel, the left snapshot shows the structural alignment of the apo-structure and its corresponding holo-structure, and the right snapshot shows the alignment of the target apo-structure versus its template. The apo-, holo- and template structures are colored in red, blue and green, respectively. In (B), all proteins are composed of two monomers. The monomer studied is shown in solid color, while the other monomer is dimmed. PDB codes are given in parentheses. Nucleic Acids Research, 2008, Vol. 36, No. 12 3989 Application to structural genomics targets endonuclease IV due to an altered Zn-binding site (58). DBD-Hunter identiﬁed an endonuclease IV template for Finally, we have applied our method to 1697 protein 1i60A with a high TM-score of 0.76 and predicts that the structures of unknown function determined by the target is non-DNA-binding based on a repulsive statistical structural genomics initiative. The optimized threshold potential energy. Since all but four DNA-binding targets parameters corresponding to a precision of 84% were predicted by DBD-Hunter have sequence identity <25% to employed for this application. A total of 37 targets pre- their templates, it is diﬃcult for PSI-BLAST to identify dicted to be DNA-binding proteins are listed in Table 3. these targets due to low-sequence similarity. In fact, only Fourteen of these targets were previously hypothesized to nine positive predictions are common to both methods. have a function associated with DNA binding, such as Among targets predicted by PSI-BLAST but missed by transcription factor activity. Three targets (1nogA, 1t6sA DBD-Hunter, only one target has a putative function and 1vbkA) have a putative function not related to DNA related to DNA binding. In principle, one can combine binding. These three predictions are probably false these two methods to improve sensitivity. positives. The putative function of the remaining 20 targets was not assigned. By comparison, PSI-BLAST predicted 29 targets as DNA-binding proteins using an E-value of DISCUSSION 1E-8.5, which corresponds to a similar precision of 82% in benchmark tests. Among PSI-BLAST predictions, eight The main goal of the current study is to develop a know- targets have a putative DNA-binding function and two ledge-based method for predicting DNA-binding proteins targets have a putative function not related to DNA and associated DNA-binding residues from structural binding. One (1i60A) of the latter two targets has the fold genomics targets. For this purpose, the method of endonuclease IV, a DNA repair enzyme, but it has had to satisfy three conditions: First, it must be capable been proposed to have a function other than that of of predicting DNA-binding proteins that have low or no Table 3. A list of structural genomics targets predicted as DNA-binding proteins from SG1697 Target Template TM-score RMSD SID Energy Putative function 1j27A 2bdpA 0.63 2.40 0.035 5.4 UK 1nogA 1sknP 0.58 3.28 0.043 15.9 NB 1s7oA 1gdtA 0.67 1.46 0.116 7.1 DB 1sfxA 1h0mD 0.59 2.76 0.155 22.7 DB 1t6sA 1u8rJ 0.65 2.25 0.14 7.3 NB 1tuaA 1jj4A 0.55 2.37 0.123 14.8 UK 1vbkA 1jj4A 0.63 2.21 0.188 5.0 NB 1wi9A 1qbjB 0.68 2.31 0.133 12.2 UK 1wj5A 1qbjB 0.70 2.16 0.177 16.0 UK 1x58A 1w0tA 0.87 1.22 0.275 3.8 DB 1xg7A 2bzfA 0.59 3.46 0.096 13.4 UK 1z7uA 1h0mD 0.61 2.51 0.138 15.6 DB 1zelA 1qbjB 0.61 2.65 0.183 12.1 UK 2da4A 1pufA 0.64 1.72 0.211 29.1 UK 2dceA 1qbjB 0.59 2.14 0.179 10.2 UK 2e1oA 1jkqC 0.55 2.91 0.143 21.4 UK 2eshA 1cgpA 0.62 1.56 0.137 19.4 UK 2esnA 1u8rJ 0.62 2.21 0.121 12.2 DB 2ethA 1u8rJ 0.71 1.84 0.186 17.3 DB 2f2eA 1sfuA 0.65 1.89 0.143 11.7 DB 2ﬁuA 1jj4A 0.66 2.50 0.057 4.9 UK 2fmlA 1qbjB 0.57 2.44 0.075 12.7 UK 2fnaA 1qbjB 0.66 1.83 0.204 5.0 UK 2fyxA 2a6oB 0.79 1.67 0.289 20.9 DB 2hytA 1jt0A 0.75 1.61 0.212 14.1 DB 2g7uA 1cgpA 0.69 1.72 0.20 15.2 DB 2iaiA 1jt0A 0.64 3.36 0.25 5.4 DB 2jn6A 1gdtA 0.70 2.16 0.14 10.7 UK 2nr3A 1sfuA 0.70 2.34 0.103 10.3 UK 2nx4A 1jt0A 0.76 2.07 0.143 2.6 DB 2od5A 1gdtA 0.58 2.36 0.122 9.7 DB 2p8tA 1h0mD 0.57 2.27 0.123 16.8 UK 2pg4A 1z9cF 0.72 2.05 0.181 8.5 UK 2qc0A 1sfuA 0.65 2.18 0.111 9.6 UK 2qvoA 1sfuA 0.72 1.94 0.043 10.5 UK 3b73A 1cgpA 0.65 1.54 0.204 23.7 UK 3bddA 1cgpA 0.65 1.48 0.259 17.4 DB RMSD and sequence identity were calculated for the structurally aligned region between the target and the template with TM- align. Targets are labeled according to their putative function annotated in their PDB records: DB (DNA-binding), NB (function not related to DNA-binding) and UK (unknown). 3990 Nucleic Acids Research, 2008, Vol. 36, No. 12 sequence similarity (<35% identity) to their templates. In previous machine-learning studies (16,17,19,20), If a target has high-sequence similarity (>40%) to any the sizes of the non-DNA-binding protein sets used for template, typically it can be detected using a sequence- training were small, typically ranging from 100 to 250 based method such as PSI-BLAST, which is computa- structures. This raises the concerns that the discriminatory tionally more eﬃcient than structure-based approaches. features may be over-trained and that the FPR may be Second, the method must have a very low FPR because under-estimated as a result. For example, we tested the SS only a small fraction of proteins bind DNA. Assuming method (20) on a much larger control dataset NB3797. that a method with a 10% FPR and 90% sensitivity is Indeed, we found that the FPR on NB3797 is much higher applied to a target set, 10% of which are DNA-binding than that on smaller size data sets. The previously proteins, these numbers translate into a precision rate of reported FPR of 2% increases to a FPR of 5.7% on the about 50%. That is, half of the predictions are false larger set. Since similar features such as the composition positives, which is generally unacceptable for systematic of amino acids and/or the charge/dipole distribution have application on thousands of genomics targets. Third, the also been used in the other studies (16,17,19), the FPRs method has to be validated on structures in the DNA-free reported in these studies should be viewed with caution. form, since all target structures with unknown DNA- A major concern with the use of structure-based methods binding function are solved without DNA present. And is whether discriminatory features derived from holo- the concern that DNA-binding proteins undergo con- structures are transferable to apo-form structures. Two formational changes during the apo-to-holo transition has previous studies have examined performance on small sets to be addressed. In the current study, we have demon- of apo-structures, 52 targets in ref. (20) and 11 targets in ref. strated that DBD-Hunter satisﬁes all three conditions. In (19), and reported similar performance on both holo and benchmark tests, it consistently outperforms the three apo sets. Here, we constructed much larger apo/holo sets other knowledge-based methods: the sequence-based composed of 104 targets for validation. We found that the method PSI-BLAST, the structural-based method TM- sensitivities of our method on these two sets are very close, align and the SS method, which uses both sequence and being just 8% lower on the apo set. The small diﬀerence can structural information. Furthermore, we applied DBD- be understood from structural comparison analysis. 89% Hunter to 1697 structural genomics targets and predicted of apo–holo pairs have an RMSD of <3A in their that 37 proteins bind DNA. structurally aligned region (typically >90% coverage), The current method employs two features, structural which is consistent with the suggestion that the conforma- similarity and the statistical potential energy, for the tional changes of DBDs are mostly localized (59). Notable purpose of discriminating DNA-binding proteins from conformational changes can be categorized into two major non-DNA-binding proteins. Since protein structures with types: (i) refolding at the DNA-binding interface, e.g. the similar function are more likely conserved than their basic region of the leucine-zipper-like protein Max sequences (10), this allows us to identify a target that has (Figure 9B), and (ii) domain reorientation of multiple- low-sequence similarity but high-structural similarity to a domain proteins, e.g. p65 of NF-kB (Figure 9C). The homologous template. In tests on DB179 and APO104, the conformational changes observed at the binding interface structural alignment procedure identiﬁes 60% more DNA- may cause diﬃculty for approaches using strict DNA- binding proteins than PSI-BLAST does. Six pairs of target/ binding motif comparisons. The HTH motif searching template examples from DB179 are given in Figure 7. method, for example, requires an RMSD of <1.6 A bet- Invariably, they have low-sequence identify (<17%) but ween the target and the template (60). Our method is less high-structural similarity (TM-score 0.62). In addition, restrictive because structural comparison is performed for the vast majority of negatives were ﬁltered out during the the entire DBD, the core region of which may have structural comparison procedure. In the test on NB3797, relatively small conformational changes. This is reﬂected 65% negatives were eliminated by structure alone. by the similar performance on APO/HOLO104 sets, a few To achieve high accuracy, however, structural similarity examples of which are provided in Figure 9. In one rare to known DNA-binding proteins is not enough. We note case, the transcription factor SarA adopts diﬀerent folds in that the one-third of non-DNA-binding proteins from the apo- and holo-forms (56,57). Surprisingly, a winged NB3797 have a signiﬁcant structural alignment to DNA- HTH DNA-binding motif was observed in the apo- binding proteins with a TM-score higher than 0.55. structure but not in the holo-structure. Our method To further reduce false positives, an interfacial potential correctly identiﬁed the DNA-binding region of the apo- has been introduced. The potentials are speciﬁc to DNA- structure, including an Arginine crucial to the DNA- binding proteins with an average Z-score of 3.2 in the binding function of SarA. The holo-structure, however, did randomized sequence test. By requiring that the target not yield a positive prediction. Our results support the view structure not only be structurally similar to a known DNA- that the apo-form of SarA more likely represents the native binding protein but that it also has a favorable interfacial conformation (56). potential, we reduced the number of false positives from One advantage of a template-based approach is that one 1327 to 19 in the test on NB3797, corresponding to an can infer functionally relevant details from the template. extremely low FPR of 0.5%. By comparison, FPRs ranging For example, speciﬁc DNA-binding sites can be trans- from 5% to 20% were reported in previous studies ferred from the template. In this respect, DBD-Hunter (16,17,19,20). Due to the reasons mentioned above, these achieves an average sensitivity of 66% and an average high FPRs limit the potential application of these methods accuracy of 87% on predicted DNA-binding proteins to structural genomics targets. in their DNA-free conformation forms (Figure 8B). Nucleic Acids Research, 2008, Vol. 36, No. 12 3991 2. Lee,D., Redfern,O. and Orengo,C. (2007) Predicting protein By comparison, machine-learning algorithms designed function from sequence and structure. Nat. Rev. Mol. Cell Biol., 8, speciﬁcally for DNA-binding site prediction provide an 995–1005. average accuracy ranging from 60% to 82% (21–23). 3. Watson,J.D., Laskowski,R.A. and Thornton,J.M. (2005) Predicting Worldwide structural genomics centers have deposited protein function from sequence and structural data. Curr. Opin. Struct. Biol., 15, 275–284. thousands of protein structures in the PDB. It is of great 4. Whisstock,J.C. and Lesk,A.M. (2003) Prediction of protein function importance to characterize the functions of these targets. from protein sequence and structure. Q. Rev. Biophys., 36, 307–340. With respect to DNA-binding protein prediction, only one 5. Altschul,S.F., Madden,T.L., Schaﬀer,A.A., Zhang,J.H., Zhang,Z., method has been applied to structural genomics targets so Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI- far, despite numerous methods proposed. In their report, BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Jones et al. predicted one DNA-binding protein from 30 6. Krogh,A., Brown,M., Mian,I.S., Sjolander,K. and Haussler,D. targets using a structural motif-based approach (60), (1994) Hidden markov models in computational biology – which is limited to DBDs with a HTH motif. In the applications to protein modeling. J. Mol. Biol., 235, 1501–1531. current study, we have applied our method to 1697 7. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. structural genomics targets and identiﬁed 37 potential Protein Eng., 11, 739–747. DNA-binding proteins. To our knowledge, this is the ﬁrst 8. Holm,L. and Sander,C. (1993) Protein structure comparison by time a structural-based method for DNA-binding protein alignment of distance matrices. J. Mol. Biol., 233, 123–138. prediction has been systematically applied on a genome 9. Zhang,Y. and Skolnick,J. (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res., 33, scale. These predictions provide valuable clues for 2302–2309. assessing protein function experimentally, and it is of 10. Chothia,C. and Lesk,A.M. (1986) The relation between the great interest to conduct experimental validations of these divergence of sequence and structure in proteins. EMBO J., 5, predictions in the future. 823–826. Like all knowledge-based approaches, our method 11. Skolnick,J. and Fetrow,J.S. (2000) From genes to protein structure and function: novel applications of computational is limited by the completeness of the template library. approaches in the genomic era. Trends Biotechnol., 18, 34–39. It cannot predict DNA-binding proteins with novel struc- 12. Luscombe,N.M., Austin,S.E., Berman,H.M. and Thornton,J.M. tures or binding modes that are not included in the (2000) An overview of the structures of protein-DNA complexes. template library, which is the main disadvantage of the Genome Biol., 1, REVIEWS001. 13. Kono,H. and Sarai,A. (1999) Structure-based prediction of DNA current approach. target sites by regulatory proteins. Proteins Struct. Funct. Genet., 35, Future work entails the extension of the methodology 114–131. to the case when the structure of the protein is not yet 14. Luscombe,N.M., Laskowski,R.A. and Thornton,J.M. (2001) Amino solved but has to be predicted. Here, an unanswered acid-base interactions: a three-dimensional analysis of protein-DNA question is how good the predicted structure has to be to interactions at an atomic level. Nucleic Acids Res., 29, 2860–2874. 15. Mandel-Gutfreund,Y. and Margalit,H. (1998) Quantitative para- provide for the accurate prediction of DNA binding. meters for amino acid-base interaction: implications for prediction Another issue is to attempt to model the conformation of protein-DNA binding sites. Nucleic Acids Res., 26, 2306–2312. change on DNA binding. While this is not crucial for the 16. Ahmad,S. and Sarai,A. (2004) Moment-based prediction of DNA- majority of known DNA-binding proteins, it is important binding proteins. J. Mol. Biol., 341, 65–71. 17. Bhardwaj,N., Langlois,R.E., Zhao,G.J. and Lu,H. (2005) Kernel- for a signiﬁcant minority of cases. One promising based machine learning protocol for predicting DNA-binding approach is the extension of TASSER, a protein structure proteins. Nucleic Acids Res., 33, 6486–6493. prediction algorithm (61), to include DNA binding. Thus, 18. Shanahan,H.P., Garcia,M.A., Jones,S. and Thornton,J.M. (2004) while DBD-Hunter is a promising approach, additional Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res., 32, 4732–4741. extensions and improvements are required to increase its 19. Stawiski,E.W., Gregoret,L.M. and Mandel-Gutfreund,Y. (2003) range of applicability. Annotating nucleic acid-binding function based on protein structure. J. Mol. Biol., 326, 1065–1079. 20. Szilagyi,A. and Skolnick,J. (2006) Eﬃcient prediction of nucleic acid binding function from low-resolution protein structures. J. Mol. SUPPLEMENTARY DATA Biol., 358, 922–933. Supplementary Data are available at NAR Online. 21. Ahmad,S., Gromiha,M.M. and Sarai,A. (2004) Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics, 20, 477–486. ACKNOWLEDGEMENTS 22. Bhardwaj,N. and Lu,H. (2007) Residue-level prediction of DNA- binding sites and its application on DNA-binding protein predic- We thank Dr Shashi Pandit for stimulating discussions. tions. FEBS Lett., 581, 1058–1066. This work was supported in part by NIH grant GM- 23. Kuznetsov,I.B., Gou,Z.K., Li,R. and Hwang,S.W. (2006) Using evolutionary and structural information to predict DNA-binding 37408. Funding to pay the Open Access publication sites on DNA-binding proteins. Proteins Struct. Funct. Bioinform., charges for this article was provided by NIH. 64, 19–27. 24. Donald,J.E., Chen,W.W. and Shakhnovich,E.I. (2007) Energetics of Conﬂict of interest statement. None declared. protein-DNA interactions. Nucleic Acids Res., 35, 1039–1047. 25. Liu,Z.J., Mao,F.L., Guo,J.T., Yan,B., Wang,P., Qu,Y.X. and Xu,Y. (2005) Quantitative evaluation of protein-DNA interactions using an optimized knowledge-based potential. Nucleic Acids Res., REFERENCES 33, 546–558. 1. Burley,S.K. (2000) An overview of structural genomics. Nat. Struct. 26. Robertson,T.A. and Varani,G. (2007) An all-atom, distance- Biol., 7, 932–934. dependent scoring function for the prediction of protein-DNA 3992 Nucleic Acids Research, 2008, Vol. 36, No. 12 interactions from structure. Proteins Struct. Funct. Bioinform., 66, 44. Garvie,C.W. and Phillips,S.E.V. (2000) Direct and indirect readout 359–374. in mutant Met repressor-operator complexes. Structure, 8, 905–914. 27. Myers,E.W. and Miller,W. (1988) Optimal alignments in linear 45. Bochkarev,A., Bochkareva,E., Frappier,L. and Edwards,A.M. space. Comput. Appl. Biosci., 4, 11–17. (1998) The 2.2 angstrom structure of a permanganate-sensitive 28. Hubbard,T.J.P., Ailey,B., Brenner,S.E., Murzin,A.G. and DNA site bound by the Epstein-Barr virus origin binding protein, Chothia,C. (1998) SCOP, structural classiﬁcation of proteins EBNA1. J. Mol. Biol., 284, 1273–1278. database: applications to evaluation of the eﬀectiveness of 46. Kim,S.S., Tam,J.K., Wang,A.F. and Hegde,R.S. (2000) The sequence alignment methods and statistics of protein structural structural basis of DNA target discrimination by papillomavirus E2 data. Acta Crystallogr. D Biol. Crystallogr., 54, 1147–1154. proteins. J. Biol. Chem., 275, 31245–31254. 29. Skolnick,J., Kihara,D. and Zhang,Y. (2004) Development and large 47. Kwon,H.J., Tirumalai,R., Landy,A. and Ellenberger,T. (1997) scale benchmark testing of the PROSPECTOR_3 threading Flexibility in DNA recombination: structure of the lambda algorithm. Proteins Struct. Funct. Bioinform., 56, 502–518. integrase catalytic core. Science, 276, 126–131. 30. Li,W.Z. and Godzik,A. (2006) CD-HIT: a fast program for 48. Aihara,H., Kwon,H.J., Nunes-Duby,S.E., Landy,A. and clustering and comparing large sets of protein or nucleotide Ellenberger,T. (2003) A conformational switch controls the DNA sequences. Bioinformatics, 22, 1658–1659. cleavage activity of lambda integrase. Mol. Cell, 12, 793. 31. Sippl,M.J. (1995) Knowledge-based potentials for proteins. Curr. 49. Conway,A.B., Chen,Y. and Rice,P.A. (2003) Structural plasticity Opin. Struct. Biol., 5, 229–235. of the Flp-Holliday junction complex. J. Mol. Biol., 326, 425–434. 32. Lu,H., Lu,L. and Skolnick,J. (2003) Development of uniﬁed 50. Sauve,S., Tremblay,L. and Lavigne,P. (2004) The NMR solution statistical potentials describing protein-protein interactions. Biophys. structure of a mutant of the max b/HLH/LZ free of DNA: insights J., 84, 1895–1901. into the speciﬁc and reversible DNA binding mechanism of dimeric 33. Matthews,B.W. (1975) Comparison of predicted and observed transcription factors. J. Mol. Biol., 342, 813–832. secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 51. Nair,S.K. and Burley,S.K. (2003) X-ray structures of Myc-Max and 405, 442–451. Mad-Max recognizing DNA: molecular bases of regulation by 34. Watkins,S., van Pouderoyen,G. and Sixma,T.K. (2004) Structural proto-oncogenic transcription factors. Cell, 112, 193–205. analysis of the bipartite DNA-binding domain of Tc3 transposase 52. Parraga,A., Bellsolell,L., Ferre-D’Amare,A.R. and Burley,S.K. bound to transposon DNA. Nucleic Acids Res., 32, 4306–4312. (1998) Co-crystal structure of sterol regulatory element binding 35. Court,R., Chapman,L., Fairall,L. and Rhodes,D. (2005) How the protein 1a at 2.3 angstrom resolution. Structure, 6, 661–672. human telomeric proteins TRF1 and TRF2 recognize telomeric 53. Chen,F.E., Huang,D.B., Chen,Y.Q. and Ghosh,G. (1998) Crystal DNA: a view from high-resolution crystal structures. EMBO Rep., structure of p50/p65 heterodimer of transcription factor NF-kappa 6, 39–45. B bound to DNA. Nature, 391, 410–413. 36. Schultz,S.C., Shields,G.C. and Steitz,T.A. (1991) Crystal structure 54. Huxford,T., Huang,D.B., Malek,S. and Ghosh,G. (1998) of a CAP-DNA Complex – the DNA is bent by 90 degrees. Science, The crystal structure of the I kappa B alpha/NF-kappa B 253, 1001–1007. complex reveals mechanisms of NF-kappa B inactivation. Cell, 37. Wilce,J.A., Vivian,J.P., Hastings,A.F., Otting,G., Folmer,R.H.A., 95, 759–770. Duggin,I.G., Wake,R.G. and Wilce,M.C.J. (2001) Structure of the 55. Giﬃn,M.J., Stroud,J.C., Bates,D.L., von Koenig,K.D., Hardin,J. RTP-DNA complex and the mechanism of polar replication fork and Chen,L. (2003) Structure of NFAT1 bound as a dimer to the arrest. Nat. Struct. Biol., 8, 206–210. HIV-1 LTR kappa B element. Nat. Struct. Biol., 10, 800–806. 38. Tahirov,T.H., Inoue-Bungo,T., Morii,H., Fujikawa,A., Sasaki,M., 56. Liu,Y.F., Manna,A.C., Pan,C.H., Kriksunov,I.A., Thiel,D.J., Kimura,K., Shiina,M., Sato,K., Kumasaka,T., Yamamoto,M. et al. Cheung,A.L. and Zhang,G.Y. (2006) Structural and function (2001) Structural analyses of DNA recognition by the AML1/ analyses of the global regulatory protein SarA from Staphylococcus Runx-1 Runt domain and its allosteric control by CBF beta. Cell, aureus. Proc. Natl Acad. Sci. USA, 103, 2392–2397. 104, 755–767. 57. Schumacher,M.A., Hurlburt,B.K. and Brennan,R.G. (2001) 39. Cho,Y.J., Gorina,S., Jeﬀrey,P.D. and Pavletich,N.P. (1994) Crystal Crystal structures of SarA, a pleiotropic regulator of virulence genes structure of a P53 tumor suppressor DNA complex – understanding in S-aureus. Nature, 409, 215–219. tumorigenic mutations. Science, 265, 346–355. 58. Zhang,R.G., Dementieva,I., Duke,N., Collart,F., Quaite- 40. Tanaka,T., Fukuda-Ishisaka,S., Hirama,M. and Otani,T. (2001) Randall,E., Alkire,R., Dieckman,L., Maltsev,N., Korolev,O. and Solution structures of C-1027 apoprotein and its complex with the Joachimiak,A. (2002) Crystal structure of Bacillus subtilis IolI aromatized chromophore. J. Mol. Biol., 309, 267–283. shows endonuclase IV fold with altered Zn binding. Proteins Struct. 41. Horton,J.R., Zhang,X., Maunus,R., Yang,Z., Wilson,G.G., Funct. Genet., 48, 423–426. Roberts,R.J. and Cheng,X.D. (2006) DNA nicking by HinP1I 59. Wright,P.E. and Dyson,H.J. (1999) Intrinsically unstructured endonuclease: bending, base ﬂipping and minor groove expansion. proteins: re-assessing the protein structure-function paradigm. Nucleic Acids Res., 34, 939–948. J. Mol. Biol., 293, 321–331. 42. Xu,Q.S., Roberts,R.J. and Guo,H.C. (2005) Two crystal forms of 60. Jones,S., Barker,J.A., Nobeli,I. and Thornton,J.M. (2003) Using the restriction enzyme MspI-DNA complex show the same novel structural motif templates to identify proteins with DNA binding structure. Protein Sci., 14, 2590–2600. function. Nucleic Acids Res., 31, 2811–2823. 43. Costa,M., Sola,M., del Solar,G., Eritja,R., Hernandez- 61. Zhang,Y. and Skolnick,J. (2004) Automated structure prediction Arriaga,A.M., Espinosa,M., Gomis-Ruth,F.X. and Coll,M. (2001) of weakly homologous proteins on a genomic scale. Proc. Natl Plasmid transcriptional repressor CopG oligomerises to render Acad. Sci. USA, 101, 7594–7599. helical superstructures unbound and in complexes with 62. Humphrey,W., Dalke,A. and Schulten,K. (1996) VMD: visual oligonucleotides. J. Mol. Biol., 310, 403–417. molecular dynamics. J. Mol. Graph., 14, 33–38. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/dbd-hunter-a-knowledge-based-method-for-the-prediction-of-dna-protein-bXC60s0GOo

Loading next page...

References (68)

A. Szilágyi, J. Skolnick (2006)
Efficient prediction of nucleic acid binding function from low-resolution protein structures.
Journal of molecular biology, 358 3
M. Giffin, J. Stroud, D. Bates, Konstanze Koenig, J. Hardin, Lin Chen (2003)
Structure of NFAT1 bound as a dimer to the HIV-1 LTR κB element
Nature Structural Biology, 10
H. Aihara, H. Kwon, S. Nunes-Düby, A. Landy, T. Ellenberger (2003)
A Conformational Switch Controls the DNA Cleavage Activity of λ Integrase
Molecular Cell, 12
H. Kono, A. Sarai (1999)
Structure‐based prediction of DNA target sites by regulatory proteins
Proteins: Structure, 35
J. Horton, Xing Zhang, R. Maunus, Zhe Yang, G. Wilson, R. Roberts, Xiaodong Cheng (2006)
DNA nicking by HinP1I endonuclease: bending, base flipping and minor groove expansion
Nucleic Acids Research, 34
F. Chen, De-bin Huang, Yong-qing Chen, G. Ghosh (1998)
Crystal structure of p50/p65 heterodimer of transcription factor NF-κB bound to DNA
Nature, 391
David Lee, Oliver Redfern, C. Orengo (2007)
Predicting protein function from sequence and structure
Nature Reviews Molecular Cell Biology, 8
W. Humphrey, A. Dalke, K. Schulten (1996)
VMD: visual molecular dynamics.
Journal of molecular graphics, 14 1
Shandar Ahmad, M. Gromiha, A. Sarai (2004)
Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information
Bioinformatics, 20 4
Q. Xu, R. Roberts, Hwai-Chen Guo (2005)
Two crystal forms of the restriction enzyme MspI–DNA complex show the same novel structure
Protein Science, 14
T. Tahirov, T. Inoue-Bungo, H. Morii, A. Fujikawa, M. Sasaki, K. Kimura, M. Shiina, Ko Sato, T. Kumasaka, Masaki Yamamoto, S. Ishii, K. Ogata (2001)
Structural Analyses of DNA Recognition by the AML1/Runx-1 Runt Domain and Its Allosteric Control by CBFβ
Cell, 104
N. Luscombe, R. Laskowski, J. Thornton (2001)
Amino acid?base interactions: a three-dimensional analysis of protein?DNA interactions at an atomic level
Nucleic acids research, 29 13
Pei-ji Chen, Z. Dong, S. Zhen (1998)
An exceptionally well-preserved theropod dinosaur from the Yixian Formation of China
Nature, 391
T. Tanaka, S. Fukuda-Ishisaka, M. Hirama*, T. Otani (2001)
Solution structures of C-1027 apoprotein and its complex with the aromatized chromophore.
Journal of molecular biology, 309 1
Yang Zhang, J. Skolnick (2004)
Automated structure prediction of weakly homologous proteins on a genomic scale.
Proceedings of the National Academy of Sciences of the United States of America, 101 20
T. Huxford, De-bin Huang, S. Malek, G. Ghosh (1998)
The Crystal Structure of the IκBα/NF-κB Complex Reveals Mechanisms of NF-κB Inactivation
Cell, 95
R. Zhang, I. Dementieva, N. Duke, F. Collart, E. Quaite-Randall, R. Alkire, L. Dieckman, N. Maltsev, O. Korolev, A. Joachimiak (2002)
Crystal structure of Bacillus subtilis ioli shows endonuclase IV fold with altered Zn binding
Proteins: Structure, 48
Yunje Cho, S. Gorina, P. Jeffrey, N. Pavletich (1994)
Crystal structure of a p53 tumor suppressor-DNA complex: understanding tumorigenic mutations.
Science, 265 5170
A. Párraga, L. Bellsolell, A. Ferré-D’Amaré, S. Burley (1998)
Co-crystal structure of sterol regulatory element binding protein 1a at 2.3 A resolution.
Structure, 6 5
B. Matthews (1975)
Comparison of the predicted and observed secondary structure of T4 phage lysozyme.
Biochimica et biophysica acta, 405 2
J. Skolnick, J. Fetrow (2000)
From genes to protein structure and function: novel applications of computational approaches in the genomic era.
Trends in biotechnology, 18 1
J. Wilce, J. Vivian, A. Hastings, G. Otting, R. Folmer, I. Duggin, R. Wake, M. Wilce (2001)
Structure of the RTP–DNA complex and the mechanism of polar replication fork arrest
Nature Structural Biology, 8
H. Kwon, R. Tirumalai, A. Landy, T. Ellenberger (1997)
Flexibility in DNA Recombination: Structure of the Lambda Integrase Catalytic Core
, 276
A. Conway, Yu Chen, P. Rice (2003)
Structural plasticity of the Flp-Holliday junction complex.
Journal of molecular biology, 326 2
interactions from structure. Proteins Struct. Funct. Bioinform
M. Schumacher, B. Hurlburt, R. Brennan (2001)
Crystal structures of SarA, a pleiotropic regulator of virulence genes in S. aureus
Nature, 409
N. Bhardwaj, Hui Lu (2007)
Residue‐level prediction of DNA‐binding sites and its application on DNA‐binding protein predictions
FEBS Letters, 581
Seung-Sup Kim, Jeffrey Tam, Ai-Fei Wang, R. Hegde (2000)
The Structural Basis of DNA Target Discrimination by Papillomavirus E2 Proteins*
The Journal of Biological Chemistry, 275
N. Bhardwaj, R. Langlois, Guijun Zhao, Hui Lu (2005)
Kernel-based machine learning protocol for predicting DNA-binding proteins
Nucleic Acids Research, 33
S. Nair, S. Burley (2003)
X-Ray Structures of Myc-Max and Mad-Max Recognizing DNA Molecular Bases of Regulation by Proto-Oncogenic Transcription Factors
Cell, 112
S. Sauvé, L. Tremblay, P. Lavigne (2004)
The NMR solution structure of a mutant of the Max b/HLH/LZ free of DNA: insights into the specific and reversible DNA binding mechanism of dimeric transcription factors.
Journal of molecular biology, 342 3
S. Watkins, G. Pouderoyen, T. Sixma (2004)
Structural analysis of the bipartite DNA-binding domain of Tc3 transposase bound to transposon DNA.
Nucleic acids research, 32 14
M. Costa, M. Solà, G. Solar, R. Eritja, A. Hernández-Arriaga, M. Espinosa, F. Gomis-Rüth, M. Coll (2001)
Plasmid transcriptional repressor CopG oligomerises to render helical superstructures unbound and in complexes with oligonucleotides.
Journal of molecular biology, 310 2
J. Watson, R. Laskowski, J. Thornton (2005)
Predicting protein function from sequence and structural data.
Current opinion in structural biology, 15 3
L. Holm, C. Sander (1993)
Protein structure comparison by alignment of distance matrices.
Journal of molecular biology, 233 1
Yingfang Liu, A. Manna, C. Pan, I. Kriksunov, D. Thiel, A. Cheung, Gongyi Zhang (2006)
Structural and function analyses of the global regulatory protein SarA from Staphylococcus aureus
Proceedings of the National Academy of Sciences of the United States of America, 103
E. Myers, W. Miller (1988)
Optimal alignments in linear space
Computer applications in the biosciences : CABIOS, 4 1
Timothy Robertson, G. Varani (2006)
An all‐atom, distance‐dependent scoring function for the prediction of protein–DNA interactions from structure
Proteins: Structure, 66
Yang Zhang, J. Skolnick (2005)
TM-align: a protein structure alignment algorithm based on the TM-score
Nucleic Acids Research, 33
M. Giffin, J. Stroud, D. Bates, Konstanze Koenig, J. Hardin, Lin Chen (2003)
Structure of NFAT1 bound as a dimer to the HIV-1 LTR kappa B element.
Nature structural biology, 10 10
P. Wright, H. Dyson (1999)
Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm.
Journal of molecular biology, 293 2
S. Burley (2000)
An overview of structural genomics
Nature Structural Biology, 7 Suppl 1
A. Bochkarev, A. Bochkarev, E. Bochkareva, L. Frappier, Aled Edwards (1998)
The 2.2 A structure of a permanganate-sensitive DNA site bound by the Epstein-Barr virus origin binding protein, EBNA1.
Journal of molecular biology, 284 5
I. Kuznetsov, Zhenkun Gou, Run Li, Seungwoo Hwang (2006)
Using evolutionary and structural information to predict DNA‐binding sites on DNA‐binding proteins
Proteins: Structure, 64
M. Sippl (1995)
Knowledge-based potentials for proteins.
Current opinion in structural biology, 5 2
J. Skolnick, D. Kihara, Yang Zhang (2004)
Development and large scale benchmark testing of the PROSPECTOR_3 threading algorithm
Proteins: Structure, 56
Susan Jones, Jonathan Barker, I. Nobeli, J. Thornton (2003)
Using structural motif templates to identify proteins with DNA binding function.
Nucleic acids research, 31 11
Hui Lu, Long Lu, J. Skolnick (2003)
Development of unified statistical potentials describing protein-protein interactions.
Biophysical journal, 84 3
C. Chothia, A. Lesk (1986)
The relation between the divergence of sequence and structure in proteins.
The EMBO Journal, 5
Tim Hubbard, Bart Ailey, Steven Brenner, A. Murzin, Cyrus Chothia (1998)
SCOP, Structural Classification of Proteins database: applications to evaluation of the effectiveness of sequence alignment methods and statistics of protein structural data.
Acta crystallographica. Section D, Biological crystallography, 54 Pt 6 Pt 1
Jason Donald, William Chen, E. Shakhnovich (2006)
Energetics of protein–DNA interactions
Nucleic Acids Research, 35
H. Shanahan, Mario Garcı́a, Susan Jones, J. Thornton (2004)
Identifying DNA-binding proteins using structural motifs and the electrostatic potential.
Nucleic acids research, 32 16
N. Luscombe, S. Austin, H. Berman, J. Thornton (2000)
An overview of the structures of protein-DNA complexes
Genome Biology, 1
C. Garvie, C. Garvie, S. Phillips (2000)
Direct and indirect readout in mutant Met repressor-operator complexes.
Structure, 8 9
S. Altschul, Thomas Madden, A. Schäffer, Jinghui Zhang, Zheng Zhang, W. Miller, D. Lipman (1997)
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic acids research, 25 17
I. Shindyalov, P. Bourne (1998)
Protein structure alignment by incremental combinatorial extension (CE) of the optimal path.
Protein engineering, 11 9
Y. Mandel-Gutfreund, H. Margalit (1998)
Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites.
Nucleic acids research, 26 10
E. Stawiski, L. Gregoret, Y. Mandel-Gutfreund (2003)
Annotating nucleic acid-binding function based on protein structure.
Journal of molecular biology, 326 4
Weizhong Li, A. Godzik (2006)
Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences
Bioinformatics, 22 13
Shandar Ahmad, A. Sarai (2004)
Moment-based prediction of DNA-binding proteins.
Journal of molecular biology, 341 1
H. Aihara, H. Kwon, S. Nunes-Düby, A. Landy, T. Ellenberger (2003)
Erratum: A conformational switch controls the DNA cleavage activity of λ integrase (Molecular Cell (July 2003) 12 (187-198))
Molecular Cell, 12
J. Whisstock, A. Lesk (2003)
Prediction of protein function from protein sequence and structure
Quarterly Reviews of Biophysics, 36
Genome Biol
Anders Krogh, Michael Brown, I. Mian, Kimmen Sjölander, David Haussler (1993)
Hidden Markov models in computational biology. Applications to protein modeling.
Journal of molecular biology, 235 5
R. Court, L. Chapman, L. Fairall, D. Rhodes (2005)
How the human telomeric proteins TRF1 and TRF2 recognize telomeric DNA: a view from high‐resolution crystal structures
EMBO reports, 6
H. Aihara, H. Kwon, S. Nunes-Düby, A. Landy, T. Ellenberger (2003)
A conformational switch controls the DNA cleavage activity of lambda integrase.
Molecular cell, 12 1
SC Schultz, GC Shields, T. Steitz (1991)
Crystal structure of a CAP-DNA complex: the DNA is bent by 90 degrees
Science, 253
Zhijie Liu, Fenglou Mao, Jun-tao Guo, Bo Yan, Pengju Wang, Youxing Qu, Ying Xu (2005)
Quantitative evaluation of protein–DNA interactions using an optimized knowledge-based potential
Nucleic Acids Research, 33

Publisher: Oxford University Press
Copyright: © 2008 The Author(s)
ISSN: 0305-1048
eISSN: 1362-4962
DOI: 10.1093/nar/gkn332
pmid: 18515839
Publisher site: See Article on Publisher Site

Abstract

3978–3992 Nucleic Acids Research, 2008, Vol. 36, No. 12 Published online 31 May 2008 doi:10.1093/nar/gkn332 DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions Mu Gao and Jeffrey Skolnick* Center for the Study of Systems Biology, School of Biology, Georgia Institute of Technology, 250 14th Street NW, Atlanta, GA 30318, USA Received April 4, 2008; Revised May 5, 2008; Accepted May 8, 2008 ABSTRACT targets already deposited in the PDB (http://targetdb. pdb.org/). Since many targets are representatives of pre- The structures of DNA–protein complexes have illu- viously uncharacterized protein families, the function of a minated the diversity of DNA–protein binding large number of these proteins is unknown. Identifying mechanisms shown by different protein families. their function is an important challenge. In recent years, This lack of generality could pose a great challenge many computational methods have been developed to for predicting DNA–protein interactions. To address assist in functional annotation (2–4). Compared to experi- this issue, we have developed a knowledge-based mental studies, computational methods have the advan- method, DNA-binding Domain Hunter (DBD-Hunter), tage of high eﬃciency and low cost. Most are based on the idea of functional inference through homology. for identifying DNA-binding proteins and associated While sequence comparison methods (5,6) are very power- binding sites. The method combines structural com- ful (7–9), they may oﬀer limited help with the task of parison and the evaluation of a statistical potential, assigning functions for structural genomics targets which we derive to describe interactions between because many have low-sequence similarity to previously DNA base pairs and protein residues. We demon- characterized proteins. Structure-based methods may strate that DBD-Hunter is an accurate method for provide additional clues to a protein’s function because predicting DNA-binding function of proteins, and that structure is better conserved than sequence (10). However, DNA-binding protein residues can be reliably inferred since a common fold may be shared by proteins with very from the corresponding templates if identified. In diﬀerent functions, it remains a challenge to infer protein benchmark tests on 4000 proteins, our method function on the basis of structure alone (11). achieved an accuracy of 98% and a precision of 84%, An area where protein structure could potentially be which significantly outperforms three previous useful is in the identiﬁcation of DNA-binding proteins. methods. We further validate the method on DNA- Such proteins play an essential role in a cell and are binding protein structures determined in DNA-free involved in transcription, replication, packaging, repair and rearrangement. It has been estimated that 2–3% of (apo) state. We show that the accuracy of our method prokaryotic proteins and 6–7% of eukaryotic proteins is only slightly affected on apo-structures compared bind DNA (12). To understand the basic rules, many to the performance on holo-structures cocrystallized eﬀorts have investigated the patterns of DNA–protein with DNA. Finally, we apply the method to 1700 interactions observed in the structures of DNA–protein structural genomics targets and predict that 37 tar- complexes (13–15). In contrast to the strict base pairing gets with previously unknown function are likely to rule observed in double-stranded DNA (dsDNA), it is be DNA-binding proteins. DBD-Hunter is freely avail- now clear that there is no simple code for protein–DNA able at http://cssb.biology.gatech.edu/skolnick/ recognition. Instead, numerous mechanisms are exhibited webservice/DBD-Hunter/. by diﬀerent protein families (12). Given the structure of a protein whose function is unknown, one wishes to answer the following questions: INTRODUCTION (i) Does this protein have a DNA-binding function? With the progress of structural genomics projects, an (ii) If so, where are its DNA-binding sites? (iii) What increasing number of protein structures have become speciﬁc DNA sequences, if any, does the protein recog- nize? To address the ﬁrst problem, several knowledge- available (1). As of early February 2008, a total of 160 792 based approaches have been developed (16–20). Shanahan proteins have been registered as targets by structural genomics centers worldwide, with the structures of 5396 et al. (18) used structural comparison to detect three types *To whom correspondence should be addressed. Tel: +1 404 407 8975; Fax: +1 404 385 7478; Email: [email protected] 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. Nucleic Acids Research, 2008, Vol. 36, No. 12 3979 of well-deﬁned DNA-binding structural motifs, helix-turn- all X-ray structures of protein–DNA complexes with helix (HTH), helix-hairpin-helix (HhH) and helix-loop- resolution better than 3.0 A. The resulting 1045 complex helix (HLH), which appear in 30% of 54 structural structures were further partitioned into chains and families based on an earlier classiﬁcation of DNA–protein analyzed. We only consider protein–DNA complexes complex structures (12). Sophisticated machine-learning that satisfy the following three conditions: (i) the protein methods have also been attempted (16,17,19,20). These has a minimum of 40 amino acids; (ii) the DNA molecule utilize techniques such as neural networks (16,19), logistic is dsDNA with at least six base pairs and (iii) the protein regression (20) and support vector machines (17). Features has at least ﬁve protein residues within 4.5 A of the DNA used by these methods include the composition of amino molecule. The analysis led to 1676 DNA-binding protein acids, the charge/dipole moment of the protein molecule chains. An all-against-all global sequence alignment was and the presence of positively charged surface patches. On performed on these protein chains using the ALIGN0 average, these studies reported sensitivities ranging from program (27) from the FASTA2 package. Pairwise 70% to 90% and speciﬁcities ranging from 65% to 95%. sequence identity is deﬁned as the ratio of the number of To address the second question, similar machine-learning identical residues over the length of the shorter sequence. methods have been developed to predict individual DNA- We used the number of DNA–protein contacts, structure binding residues on proteins (21–23), with an average resolution and the available literature to select only accuracy ranging from 65% to 80%. To address the third one representative among protein chains with larger problem, statistical models have been introduced to than 35% sequence identity, leading to a nonredundant characterize the speciﬁcity of DNA sequences for a given set such that any two protein chains have <35% sequence DNA-binding protein (13,24–26), with a few successful identity. SCOP annotations (28) and visual inspection examples reported. were used to identify the DNA-binding domain (DBD) Since DNA-binding proteins likely comprise only a small for each protein chain. The resulted 179 DNA-binding fraction of structural genomics targets, for practical protein domains and associated DNA chains are listed in applications, it is necessary to develop a method with Supplementary Table 1. high precision. Otherwise, the number of false positives NB3797. A control set of 3797 non-DNA-binding protein could easily outnumber the true positives, rending such an chains (NB3797) was created from a nonredundant set approach is impractical for automatic function assignment. All machine-learning methods mentioned above, however, of 7037 protein structures, which comprise the reported relatively low speciﬁcities for non-DNA-binding PROSPECTOR threading template library built from proteins (16,17,19,20). In addition, these rates were the May 2006 PDB release by clustering all PDB protein obtained on small sets of less than 250 structures. It is not structures using a 35% global sequence identity cutoﬀ and clear whether similar speciﬁcity would be obtained on a choosing one representative from each cluster (29). From much larger data set of thousands of structures, a scenario these 7037 chains, we select the 5636 proteins with SCOP relevant to proteomic-scale applications. To address these annotations. We further discarded any chain if its PDB issues, we describe a new method, DNA-binding Domain record contains a DNA molecule or its SCOP annotation Hunter (DBD-Hunter), for the prediction of DNA-binding contains the keyword ‘hypothetical’. Protein chains were proteins and associated DNA-binding sites. The method manually inspected to determine whether its SCOP super- uses both structural comparisons and a DNA–protein family, family and domain annotations contain the key- statistical potential to assess whether or not a given protein word ‘DNA’. Those whose function is associated with binds DNA. We demonstrate that DBD-Hunter achieves DNA-binding were removed. All ribosomal proteins were an extremely high speciﬁcity of 99.5% and a precision of also excluded. For each of the resulting 3911 protein 84% in benchmark tests, with a sensitivity of 47% obtained chains, the keyword ‘DNA’ was searched in the title, on unbound and a sensitivity of 58% on DNA bound abstract and keyword sections of its primary citation. protein structures. By way of illustration as to its appli- Positive hits were inspected by reading the literature to cability, we apply our method to 1697 structural genomics exclude DNA-binding proteins. The ﬁnal 3797 protein targets and predict that 37 previously unknown targets are chains compose NB3797. DNA-binding proteins. APO104/HOLO104. A total of 104 pairs of DNA- binding protein structures determined both in the absence METHODS and presence of DNA were constructed from the PDB May 2007 release. The PDB was queried to retrieve two Availability sets of proteins: the holo structure set consists of 759 All datasets listed below, the statistical potential para- proteins cocrystallized with a dsDNA molecule; the apo meters and a web-server implementation of DBD-Hunter set comprises 35 899 crystal/NMR structures determined are available at http://cssb.biology.gatech.edu/skolnick/ without any DNA. An all-against-all sequence alignment ﬁles/. between the two sets was performed following the same procedure described above. The alignment procedure Data sets resulted in 679 holo–apo pairs, which have sequence DB179. A set of 179 DNA–protein complex structures identity >35% between each pair. The 679 holo chains (DB179) were selected by the following procedure: were further culled by excluding redundant sequences The July 2007 release of the PDB was queried to retrieve with an identity cutoﬀ of 35%. One representative was 3980 Nucleic Acids Research, 2008, Vol. 36, No. 12 selected among proteins with pairwise sequence identity protein residue type and a reduced DNA atom is deﬁned as: >35%, using literature and structure resolution informa- obs tion for guidance. The ﬁnal results are 104 holo-structures e ¼ ln 1 exp and their corresponding apo-structures, denoted as HOLO104 and APO104, respectively. Most of these obs where N is the number of observed contacts for the ad holo–apo pairs have high-sequence identity, 100% for 62 exp pair and N is the number of expected contacts assuming exp pairs and >95% for 91 pairs. The remaining 13 pairs, no preferential interaction. The reference state, N , which have sequence identities ranging from 45% to 95% is deﬁned by the product of the total number of observed are composed of apo–holo homologs from the same obs contacts N and the mole fractions of a and d, namely total SCOP family. exp obs N ¼ N f f 2 total SG1697. A set of structural genomics targets was selected The statistical potential energy E (also named the from the Jan 2008 PDB release. The PDB was queried interfacial energy) of a DNA–protein complex structure with the classiﬁcation keyword ‘structural genomics’, is deﬁned as the sum of pair interactions for all protein– resulting in 1886 PDB entries. These were split into DNA contacts. protein chains, which were further clustered at a 90% The Z-score of a native complex structure is deﬁned as: sequence identity cutoﬀ with the program CD-HIT (30). From each cluster, we randomly select one representative. E E nat ave Z-score ¼ 3 These 1697 representatives compose the set SG1697. Statistical pair potential where E and s are the average and SD of the statistical ave potential energies of all random structures, and E is the nat To derive a statistical pair potential for describing DNA– statistical potential energy of the native complex. In speci- protein interactions, we consider a contact between a ﬁcity tests, random structures were obtained by replacing protein residue and a functional group of DNA deﬁned as interfacial DNA or protein residues with random nucleo- that when at least one heavy atom from the protein resi- tides or amino acids. We assume that a contact is formed due is within 4.5 A of at least one heavy atom from the in a random structure at the same location as observed corresponding DNA functional group. Four types of in the native structure. Since the imidazole group only functional groups were considered for DNA nucleotides appears in purines, we ignore any contact involving (Figure 1). Pyrimidines C and T have the phosphate (PP), a purine imidazole group if the purine is replaced by a the sugar (SU) and the pyrimidine (PY) groups. In addi- pyrimidine. tion, purines A and G have a fourth group, the imidazole (IM) group. Note that all DNA groups are residue DNA-binding interaction prediction protocol speciﬁc. To diﬀerentiate them, we add their nucleotide names as preﬁxes, e.g. A.PP represents the phosphate Using structural alignment and the statistical potential, group of an adenine. we developed a new method, DBD-Hunter, to predict A knowledge-based statistical pair potential was devel- DNA–protein interactions, given a target structure. First, oped from an analysis in the DNA–protein complex set the target structure is scanned against the template library DB179 (Supplementary Figure 1). The derivation of the DB179 for similar protein structures with the structural potential was based on the assumption that the frequencies alignment program TM-align (9). Only Ca backbone of observed pair interaction states follow a Boltzmann coordinates are used for the structural alignment and distribution (31,32). The pair interaction energy between for root mean squared deviation (RMSD) calculations. A TM-score >0.4 indicates signiﬁcant structural align- ment (9). To reduce the number of false positives, we employed the higher TM-score threshold of 0.55 for template selection (see below). For templates with a TM- score better than the threshold, the statistical potential energy between the target protein and the template DNA is calculated by evaluating contacts within the structurally aligned regions. The contact evaluation follows the similar procedure adopted in structure threading by replacing original template protein residues with corresponding aligned template residues (29). The templates are then ranked according to their interfacial energies. If the lowest interfacial energy is below a (to be determined) energy threshold, the target is predicted to be a DNA-binding Figure 1. Scheme of DNA nucleotide functional groups considered for protein. If no template is found in either the structural protein–DNA contact analysis. The phosphate, sugar, pyrimidine and alignment or satisfying the energy criterion, the target imidazole groups are colored in green, orange, red and blue, is classiﬁed as a non-DNA-binding protein. For proteins respectively. Note that the functional groups are base speciﬁc, e.g. predicted as a DNA-binding protein, we further infer the phosphate group of adenine is diﬀerent from the phosphate group of thymine. DNA-binding protein residues from their templates. Nucleic Acids Research, 2008, Vol. 36, No. 12 3981 Table 1. Optimization of TM-score and energy threshold parameters A DNA-binding protein residue is deﬁned as a residue ˚ for DBD-Hunter on DB179 and NB3797 with at least one heavy atom within 4.5 A of a DNA functional group. TM-score Energy TP FN FP TN Precision Max. threshold threshold MCC Optimization of parameters for DNA-binding protein range prediction 0.74–1.00 1.1 52 8 6 63 0.90 0.78 The DNA-binding protein prediction method requires two 0.62–0.74 4.8 43 23 10 368 0.81 0.69 threshold parameters: the TM-score threshold and the 0.58–0.62 9.5 4 23 2 568 0.67 0.30 0.55–0.58 12.3 4 28 1 968 0.80 0.31 statistical potential energy threshold. Here, we present two 0.52–0.55 2.8 16 34 188 1496 0.08 0.11 strategies for optimizing these two parameters by max- 0.49–0.52 3.0 17 31 321 2029 0.05 0.09 imizing the Matthews correlation coeﬃcient (MCC) (33) 0.46–0.49 14.1 4 34 20 2676 0.17 0.12 of predictions on DB179 and NB3797. In the ﬁrst strategy, 0.43–0.46 2.3 22 16 981 2070 0.02 0.06 we simply search for the best parameter pair that gives the 0.40–0.43 13.7 2 14 36 2189 0.05 0.07 highest MCC. Supplementary Figure 2 shows the contour The optimal energy threshold that gives the highest MCC of pre- representation of MCC in threshold space. The best MCC dictions is listed for each TM-score range. When TM-scores are below of 0.64 is given by a TM-score threshold of 0.62 and an 0.55, the numbers of false positives greatly exceed the number of true energy threshold of 4.8, corresponding to a sensitivity positives at the maximum MCCs. Therefore, we set the minimum TM- score threshold at 0.55. The optimized threshold values adopted in this of 0.49 and a speciﬁcity of 0.997. The high (>0.60) MCC study were represented in bold. region is located within the TM-score threshold range from 0.53 to 0.67 and the energy threshold range from threshold. Otherwise, the target is classiﬁed as non-DNA- 10 to 2.5. As the TM-score threshold increases, the binding. When applying PSI-BLAST (version 2.2.17), optimal energy threshold corresponding to the highest up to four rounds of scanning on the NCBI-NR non- MCC at a given TM-score threshold increases as well. redundant protein sequence library (the July 2007 release) This can be easily understood: since structures with higher were performed to derive a position-speciﬁc sequence similarity are more likely to share a similar function, the proﬁle for each target sequence. An inclusion E-value energy criterion can be softened as the level of structural threshold of 0.001 and the default values for other similarity increases and vice versa. The observation leads to the second optimization arguments were employed. Using this proﬁle, a last PSI- strategy: rather than using a constant energy threshold, BLAST run was performed on the DB179 sequence the optimal energy threshold can be dependent on the library. If a target hits a template with an E-value higher value of the TM-score. Speciﬁcally, we divided the TM- than the speciﬁed threshold, the target is classiﬁed as a score range from 0.4 to 1.0 into nine regions (Table 1). DNA-binding protein; otherwise, it is classiﬁed as a non- Starting from template hits within the top region, we select DNA-binding protein. The default parameters were the optimal energy threshold that gives the highest MCC employed for the SS method. During the benchmark of predictions on DB179 and NB3797. Positive targets tests on DB179, APO104 and HOLO104, homologs with under the optimal energy criteria were removed from both global sequence identity >35% were excluded from the sets, and we re-run predictions on the reduced target sets template library. for the next TM-score region of template hits. The process The prediction outcome can be classiﬁed and counted was repeated for all nine TM-score regions and generated for each method. The numbers of true positives, false an optimal energy threshold for each region. However, for positives, true negatives and false negatives are designated TM-scores below 0.55, the number of false positives as TP, FP, TN and FN, respectively. Performance greatly exceeds the number of true positives at the measures are deﬁned as the following: maximum MCCs, which are invariably low (<0.15). TP Therefore, the minimum TM-score threshold is set at Sensitivity ¼ Recall ¼ 0.55 to reduce false positives. The list of optimized ðTP þ FNÞ parameters is provided in Table 1. Using these parameters, FP FPR ¼ a sensitivity of 0.58 and a speciﬁcity of 0.995 were ðTN þ FPÞ achieved on the training set DB179 and NB3797, with a TN corresponding MCC of 0.69. The optimal parameters were Specificity ¼ ðTN þ FPÞ used in validation tests on APO104/HOLO104 and the application on SG1697. ðTP þ TNÞ Accuracy ¼ ðTP þ FN þ TN þ FPÞ Assessment of DNA-binding protein prediction methods TP Precision ¼ We compared our prediction method with three diﬀerent ðTP þ FPÞ approaches: TM-align (9), PSI-BLAST (5) and the method ðTP TN FP FNÞ proposed by Szilagyi and Skolnick (denoted as the SS MCC ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ðTP þ FNÞðTP þ FPÞðTN þ FPÞðTN þ FNÞ method) (20). The protein structures of DB179 were used as the template library for TM-align. When applying TM-align, a target is classiﬁed as a DNA-binding protein if it hits a template with a TM-score higher than a speciﬁed where FPR denotes false positive rate. 3982 Nucleic Acids Research, 2008, Vol. 36, No. 12 RESULTS We ﬁrst develop the statistical potential and examine its speciﬁcity to both protein sequences and DNA sequences. The statistical potential energy and structural similarity, two features used by our DNA–protein prediction method, were analyzed on DB179/NB3797. The performance of our method was assessed by leave-one-out cross-validation and compared to the three other methods described above. Conformational changes occurred during the apo-to-holo transition were subsequently studied for DNA-binding proteins. The performance of our method was tested on both apo- and holo-structure sets APO/HOLO104. Finally, a total of 1679 structural genomics targets were scanned for DNA-binding proteins as a real world test of the methodology. Statistical pair potential The pair potential parameters have been derived for 20 Figure 2. Distribution of the statistical potential energy for 179 DNA– amino acids and 14 nucleotide functional groups using the protein complexes in the self-consistent test. The insert shows the Boltzmann principle (Supplementary Figure 1). A total of potential energy calculated in a Jackknife test versus the energy calculated in the self-consistent test. 12 771 DNA–protein contacts observed in the nonredun- dant DNA–protein complex structure set DB179 have been considered. The positively charged amino acids Arg jackknife test is 94%, which is 3% lower than the self- and Lys are the most preferred contact partners by DNA consistent test. nucleotides. The result is expected due to the negative charge carried by DNA. The polar amino acids Asn, His, Specificity of the statistical potential Tyr, Gln, Thr and Ser are attracted to the DNA backbone phosphate and sugar groups, but are less preferred by base The requirement of favorable energy interaction of the groups. The hydrophobic residue Leu and positively native DNA–protein complex is a necessary, but not suﬀ- charged residue Glu have the most energetically unfavor- icient condition for characterizing speciﬁc recognitions able interactions with DNA nucleotides. In general, DNA between proteins and DNA. We further assess the speci- base groups have more speciﬁc interactions with amino ﬁcity of our potential parameter set by generating random acids than backbone groups. Imidazole groups, for DNA or protein sequences separately for each complex. example, are favored by only two to three amino acids. The DNA/protein speciﬁcity is measured by the Z-score One case of such favorable pairs is the guanine imidazole (Equation 3) of the native potential energy compared with group and Arg, which is expected because hydrogen energies calculated for random DNA/protein sequences. bonding between them has been frequently observed. By All energies were calculated with the individual jackknife comparison, most polar and positively charged amino parameter set corresponding to each target complex. acids are attracted to the phosphate and sugar groups. In the DNA speciﬁcity test, up to one million DNA A basic requirement for any good statistical potential is sequences were randomly generated for the interfacial the capability to characterize favorable energetic interac- DNA base pairs of each complex. Equal probabilities of tions, given a DNA–protein complex structure. To exam- 0.25 were assigned for the four types of nucleotides. For ine whether our potential meets this requirement, as structures with less than 10 DNA base pairs, we conducted shown in Figure 2, we performed both a self-consistent an exhaustive investigation of all possible combinations. test and a jackknife test on DB179. In the self-consistent As shown in Figure 3A, 109 proteins with speciﬁc DNA test, the interfacial energy E for a complex structure is recognition, e.g. transcription factors and restriction evaluated with the parameter set determined from all 179 endonucleases, have an average Z-score of 1.2, which complexes. The result shows that 97% of these complexes is one unit lower than the average Z-score of 0.2 for 70 have a favorable interfacial energy (E < 0). In the jack- proteins recognizing nonspeciﬁc DNA. The result demon- knife test, also termed leave-one-out cross-validation, the strates a modest Z-score diﬀerence between speciﬁc DNA target structure is excluded from the statistical potential recognition and nonspeciﬁc DNA recognition on average. derivation, and the interfacial energy for the target struc- In the protein speciﬁcity test, one million random ture is then calculated with the corresponding parameter protein sequences were generated for DNA-binding set. As shown in Figure 2, energies calculated from both protein residues of each complex. To avoid overrepresen- tests are closely correlated with a correlation coeﬃcient tation of rare amino acids, we assign the frequency of an >0.99. The average potential energy from the jackknife amino acid type observed in DB179 as the probability test is 24.2, which is 2.1 kT units higher than the of the corresponding amino acid in random sequences. average of 26.3 from the self-consistent test. The fraction As shown in Figure 3B, the mean and SD of the Z-score of complexes with favorable interfacial energy in the for native complex structures is 3.2 and 1.1. Only seven, Nucleic Acids Research, 2008, Vol. 36, No. 12 3983 Figure 3. (A) Distribution of native structure Z-scores among randomly generated DNA sequences. (B) Distribution of native structure Z-scores for randomly generated protein sequences. or 3.9%, of the complex structures have a Z-score higher energy value (E < 0). At an energy threshold of 4.8, the than 1.0. The result suggests that the statistical potential fraction of DNA-binding proteins satisfying the energy is reasonably speciﬁc to DNA-binding proteins. We can, criterion drops slightly to 49%, while only 0.003% (12) therefore, utilize the statistical potential as a feature to of non-DNA-binding protein’s top templates satisfy the discriminate DNA-binding proteins from non-DNA- same criteria. Use of the statistical potential dramatically binding proteins. reduces the number of false positive hits. We have not employed the Z-score of the target sequence/structure relative to the randomized target Characteristic features of DNA-binding proteins sequence aligned to the selected template, as we ﬁnd that For the purpose of discriminating DNA-binding from the performance is essentially the same as when the energy non-DNA-binding proteins, we consider two features: cutoﬀ is used. Since about 25% of DNA-binding residues structure similarity and statistical potential energy. In our are missed on average by the structural alignment and the approach, a target structure is scanned against the DNA sequence is that of the template, the average Z-score template library DB179 for similar structures with TM- of 2.1 for the top Z-score ranked template is not align. Using the TM-score as the structural similarity surprisingly larger (less negative) than that for the native metric, we identify templates that have a score higher than protein–DNA complex whose average is 3.2. Given the a given TM-score threshold. The statistical potential rather small range of Z-scores, this is the origin of the energy is then calculated for the target using the structural comparable performance as to whether an energy cutoﬀ alignment to a qualiﬁed template. The two features were or Z-score criterion is used. examined on DB179 and the non-DNA-binding set NB3797. In the test of DB179, the target structure was Assessment of DNA-binding protein prediction methods excluded from the template library for both the template scanning and the statistical potential derivation. By combining structural comparison with a statistical The distributions of the top TM-score-ranked template potential, we developed DBD-Hunter for DNA-binding for DB179 and NB3797 are shown in Figure 4A. About protein prediction (see Methods section for details). 93% of DNA-binding proteins and 70% of non-DNA- To assess the performance of our method, we compared binding proteins hit at least one template with a TM-score it with three other methods: the SS method, PSI-BLAST higher than a signiﬁcant value of 0.50. As one raises the and TM-align. Figure 5 shows the receiver operator value of the TM-score threshold, the fraction of non- characteristic (ROC) curves and precision-recall (PR) DNA-binding protein structures with a qualiﬁed template curves for benchmark tests on DB179 and NB3797. The decreases more rapidly than that of DNA-binding protein ROC and the PR curves of our method were obtained by structures. At a TM-score threshold of 0.62, only 10% of varying the energy threshold while ﬁxing the TM-score non-DNA-binding proteins have at least one template hit, threshold at 0.62. For the other three methods, the vari- whereas 68% of DNA-binding proteins satisfy this able used to obtain the ROC and PR curves are: the criterion. However, since the size of NB3797 is much threshold deﬁned in ref. (20) for SS, the E-value for PSI- larger than that of DB179, the absolute number of BLAST and the TM-score for TM-align. For comparison, positives from NB3797 is over three times the number of the results of DBD-Hunter using TM-score-dependent positives from DB179. optimized energy threshold, denoted as DBD-Hunter , opt To further reduce false positives, we use the statistical are also provided; the corresponding sensitivity of 58%, potential energy to re-rank the templates preselected from speciﬁcity of 99.5%, precision of 84% and MCC of 0.69 the structural alignment procedure. Figure 4B shows the are the best in our benchmark tests (Table 2). distribution of the top energy ranked template for DB179 Clearly, our method outperforms all other three and NB3797 at a TM-score threshold of 0.62. About 1.3% methods within the low FPR regime (<10%), which is (69) of non-DNA-binding proteins and 57% (102) of relevant for practical applications. The maximum MCCs DNA-binding proteins have a template with a favorable of the four methods are listed in Table 2. DBD-Hunter 3984 Nucleic Acids Research, 2008, Vol. 36, No. 12 Figure 4. Discriminatory feature analysis for DNA-binding and non-DNA-binding proteins. (A) Distribution of the highest TM-score-ranked template on DB179/NB3797. The numbers of template hit are given above the histogram bars. (B) Cumulative faction of the top energy-ranked template versus the statistical potential energy. Only templates higher than the TM-score threshold of 0.62 were considered. Figure 5. Performance comparison of methods for DNA-binding protein prediction. All tests were performed on DB179 and NB3797. The result obtained by DBD-Hunter using optimized threshold parameters is indicated by a star symbol. (A) ROC (sensitivity versus FPR) curves. (B)PR (precision versus sensitivity) curves. Table 2. Comparison of the maximum MCC by four DNA-binding The performance of the SS method is the worst among protein prediction methods on DB179/NB3797 these methods. We note that its FPR is much higher on NB3797 than previous reported FPRs on small control Method Max. MCC Sensitivity FPR Precision sets (20). The threshold used to obtain an FPR of 2% on smaller sets yield an FPR of 5.7% on NB3797. One DBD-Hunter 0.69 0.58 0.005 0.84 opt advantage of our method is that it delivers a high precision DBD-Hunter 0.64 0.49 0.003 0.87 TM-align 0.47 0.52 0.028 0.47 at a reasonable sensitivity. As shown in the PR plot PSI-BLAST 0.56 0.44 0.007 0.76 (Figure 5B), the precision of DBD-Hunter stays at a high SS 0.31 0.40 0.044 0.93 level above 88% for a sensitivity up to 50%. By comparison, none of the other three methods can achieve this level of precision at a sensitivity better than 30%. The achieves the highest maximum MCC of 0.69, compared high precision is important for application to targets on with 0.47 for TM-align, 0.56 for PSI-BLAST and 0.31 for a proteomic scale. SS. As shown in the ROC plot (Figure 5A), the sensitivity of our method jumps to 49% at a very low FPR of 0.3%, Prediction of DNA-binding residues on proteins then gradually increases to 60% at a FPR of 1.6% and Since DBD-Hunter identiﬁes a template for each target, it ﬁnally reaches a plateau of 68% at a FPR of 6.6%. The is attempting to infer DNA-binding sites from the 68% sensitivity limit is due to the TM-score threshold template, whose DNA-binding site is known. The most imposed. If one only considers structural similarity, inferior performance was obtained. For example, TM- straightforward way is to assign DNA-binding function to align gives a sensitivity of 50% with a FPR of 2.8%, which target residues aligned with DNA-binding residues of the is more than nine times the FPR of our method at the template. This approach was conducted on 103 proteins same sensitivity. PSI-BLAST is generally less sensitive predicted as DNA-binding proteins by DBD-Hunter using than the structure-based methods. At a permissive FPR the TM-score-dependent optimal thresholds. These pro- of 4%, PSI-BLAST recognizes about half of the targets. teins include 42 enzymes, 48 transcription factors and Nucleic Acids Research, 2008, Vol. 36, No. 12 3985 residues (15 versus 10 for both of the other two cases) and only one false positive. The second example involves the target, the DBD of catabolite gene activator (36), and the template, the DBD of replication terminator protein (Figure 7B) (37). They share a similar structure, the winged helix DBD, which is a common DNA-binding motif. In fact, the target hits 10 templates with a TM-score higher than 0.62. The top energy-ranked template has the lowest TM-score among these 10 templates, but it made the highest number of correct predictions for DNA-binding residues (12 of 14). In the third example, we examine the target, acute myeloid leukemia 1 protein RUNT domain (38), and the template, the DBD of p53 tumor suppressor (Figure 7C) (39). They closely resemble each other with a b-sandwich fold. Although the DNA-binding sites are located in a largely disordered region composed of two loops and two b-strands, the target was successfully predicted through the template. We note that the same template was hit by Figure 6. Performance on the prediction of DNA-binding residues. A 15 non-DNA-binding proteins above the TM-score total of 103 DNA-binding proteins predicted by DBD-Hunter were threshold of 0.62. Fourteen of these negative cases are examined. The lower, middle and upper quartiles of each box are 25th, correctly classiﬁed as non-DNA-binding by the energy 50th and 75th percentile, respectively. Whiskers extend to a distance of up to 1.5 times the interquartile range. Outliers and averages are criterion, because they exhibit repulsive energies. The only represented by circles and squares, respectively. exception, actinoxantin (PDB code 1acx_), belongs to an antitumour antibiotic chrompoprotein family, whose 13 other types of DNA-binding proteins. For each target members recruit chromophores that cleave DNA sub- structure, we make a binary prediction (DNA binding or strates (40). Although it is not clear whether actinoxantin non-DNA binding) on each residue aligned with the top itself binds to DNA, our prediction suggests that energy-ranked template. Performance measures, sensitiv- actinoxantin binds DNA and that this leads to subsequent ity, speciﬁcity, accuracy and precision were calculated for DNA cleavage by the chromophore. each target structure. The box plot of the results is shown Two restriction endonucleases, HinP1I (41) and MspI in Figure 6. On average, a sensitivity of 72%, a speciﬁcity (42), are presented in the fourth example. Both consist of two domains, aligned with an RMSD of 3.3 A, the largest of 93%, an accuracy of 90% and a precision of 71% were among these examples. The two enzymes extensively obtained. For 81% of the target structures, we achieved interact with DNA; there are 47 DNA-binding residues a precision >60%. The results imply that the closely in the target HinP1I and 36 in the template MspI. Our related target-template pairs were correctly identiﬁed in method successfully identiﬁed 22 DNA-binding residues most cases. and produced nine false positives. In the ﬁfth example, we investigate the target, tran- Examples of DNA-binding protein prediction scriptional repressor CopG (43), and the template, the DBD of methionine repressor MetJ (44). A DNA-binding Six examples of successful predictions by our method are motif, the so-called ribbon-helix-helix motif, is selected. illustrated in Figure 7A–F. In these examples, the The interfacial energy of 7.8 is relatively weak, mainly sequence identity between a target and its template due to the small number of DNA-binding residues ranges from 9% to 17%. The lack of sequence similarity involved. All seven DNA-binding residues of the target makes it diﬃcult for the sequence-based PSI-BLAST are correctly predicted. method to identify these templates. In fact, none of them In the last example, the target, the DBD of Epstein-Barr was hit by PSI-BLAST for the corresponding targets. nuclear antigen 1 (45), hits the template, the DBD of In the ﬁrst example (Figure 7A), the target, the bipartite human papillomavirus-18 E2 (46). The two virus proteins DBD of Tc3 transposase Tc3A (34), consists of two sub- share a low-sequence identity of 10%, yet have high- domains that belong to the homeodomain-like super- structural similarity with a TM-score of 0.75. Their family deﬁned in SCOP. The target hits three templates structure is a ferredoxin-like fold, which was found in above a TM-score threshold of 0.62. As expected, they all many non-DNA-binding proteins. In fact, 41 non-DB share a homeodomain-like structure with a classic HTH proteins from NB3797 hit the same template. All, but one, DNA-binding motif. Each template yielded an interfacial of these false hits were eliminated on the basis of the energy strong enough for making a positive prediction, interfacial energy. despite the fact that only one sub-domain of the target was aligned with the template. The best energy-ranked Conformational changes between apo- and holo-forms template, telomeric protein TRF1 DBD (35), has the lowest TM-score of 0.64 among these three templates, but For any structure-based method for DNA-binding protein it generated the most correct predictions of DNA-binding prediction, it is necessary to examine its performance 3986 Nucleic Acids Research, 2008, Vol. 36, No. 12 Figure 7. Examples of DNA-binding protein predictions on DB179. (A–F) Structural alignment of the target structure and the template in cartoon representations. In each panel, the left snapshot shows the overall alignment, together with the cocrystallized DNA molecules. The color codes for protein and DNA representations are red and purple for the target, and green and cyan for the template, respectively. The right snapshot highlights DNA-binding residues of both the target and the template in the same color code as the left snapshot. Non-DNA-binding residues of the target were dimmed in gray. For a clear view of the binding interface, the two snapshots were taken from diﬀerent orientations. In parentheses, each structure was labeled in the format of xxxxX, where xxxx is the four-digit PDB code and X is the chain identiﬁer of the protein. If the PDB record contains no chain identiﬁer, X is replaced with an underscore. Sequence identity (SID), TM-score (TMS), RMSD and the statistical potential energy E are provided at the bottom of each panel. Graphic images were made with the program VMD (62). on structures determined in the absence of DNA. The conformational changes in the structural aligned regions reason is that the conformational changes occurring on identiﬁed by TM-align. As shown in Supplementary DNA binding may aﬀect the accuracy of the method. To Figure 3, the majority (70%) of pairs have a RMSD global ˚ ˚ address this issue, we have collected 104 pairs of apo- and <3A, but a few (14%) have an RMSD >5A. The global holo-form DNA-binding proteins (APO104/HOLO104) latter are mainly due to ﬂexible termini or relative domain and analyzed their conformational changes. Two RMSD movement of multiple-domain proteins (see examples metrics were calculated: RMSD measures the overall below). If we consider structural alignment instead, the global conformational changes by superposing the two forms in corresponding RMSD is <5A for all pairs and is within TM the sequence aligned regions; and RMSD measures the 3A for 89% of pairs. The corresponding coverage of the TM Nucleic Acids Research, 2008, Vol. 36, No. 12 3987 structural alignment is usually high, better than 90% of the correctly as DNA-binding proteins by PSI-BLAST. DBD- shorter chain for 87% of the pairs (Supplementary Figure 3 Hunter, therefore, is about 60% more sensitive than PSI- insert). The results reveal that apo-to-holo conformational BLAST on APO104. A much more permissive E-value of changes are mostly localized with a RMSD <3A for more 0.001 generates 45 hits for PSI-BLAST, which is still 10% than 70% of DNA-binding proteins. less than the correct predictions made by our method. Prediction of DNA binding using apo-structures Examples of DNA-binding protein prediction on apo-structures We further benchmarked our method on APO104/ HOLO104 using the optimized threshold parameters Four positive predictions on APO104 are illustrated in determined on DB179/NB3797. For a given target, any Figure 9A–D. In these examples, the target apo-forms and template with sequence identity >35% was excluded from their holo-forms exhibit large RMSD values ranging globe the template library and the statistical potential derivation from 3 to 19 A. Despite these signiﬁcant conformational in our tests. As shown in Figure 8A, about the same changes upon DNA binding, the apo-structures were number of APO104 and HOLO104 members hit at least successfully predicted as DNA-binding proteins. one template above the TM-score threshold of 0.52, 94 for The ﬁrst example is the bacteriophage integrase HOLO104 versus 95 for APO104. However, the distribu- protein, a tyrosine recombinase possessing two DBDs, tions of the best structural templates are somewhat the catalytic domain and the central domain (Figure 9A). diﬀerent. The holo target set hit more closely resembled The latter domain is missing in the apo-structure (47), but is present in the holo-structure (48). In the apo-to-holo templates than the apo set did. The latter has 28% less transition, dramatic movement occurred at the C-terminal templates above the TM-score cutoﬀ of 0.68 than the former. In particular, nine holo queries have one template region (residues 331–356) of the catalytic domain, which with a TM-score better than 0.88, but no apo-structure brought a crucial catalytic residue Tyr342 in contact with has a template with such a high level structural similarity. a scissile phosphate of DNA from a distance of 20 A away. Despite the relatively lower structural similarity to their The movement made the major contribution to the large templates, APO104 yielded only slightly fewer number of RMSD of 10 A, because the remainder (residues 170– global correct predictions as HOLO104 by DBD-Hunter. The 330) of the catalytic domain is virtually unchanged with an number of positive predictions is 57 for the holo set and 49 RMSD of 0.4 A. It is the static core region of the target TM for the apo set. These numbers correspond to a sensitivity that allows a hit to a template, the N-terminal domain of 55% for HOLO104 and 47% for APO104, compared of Flp recombinase (49). The target and the template have with the value of 58% observed for DB179. DNA-binding a high TM-score of 0.71, in spite of a low-sequence residues were further inferred from the top-ranked identity of 12%. Major DNA-binding sites, including the template for predicted DNA-binding proteins from the catalytic triad of Arg212-His308-Arg311, were correctly apo/holo sets (Figure 8B), except for target 2frhA that has identiﬁed as DNA-binding residues. Based on the strong a controversial DNA-binding site (see examples below). interfacial energy of 24, a positive prediction was made for the target apo-structure. On average, the predictions yield sensitivities of 68%/ The second example is the Max protein, a transcription 66%, speciﬁcities of 93%/93%, accuracies of 89%/87% and precisions of 67%/66% for HOLO104/APO104. factor from the basic/HLH/zipper (bHLH-Zip) family of Our method was compared with PSI-BLAST on DNA-binding proteins (Figure 9B). Members of this APO104. For a fair comparison, an E-value of 1E-8.5 family form a stable dimeric structure when complexed was chosen for PSI-BLAST such that it provided a similar with DNA, but they are notoriously diﬃcult to stabilize precision rate (82%) to DBD-Hunter (precision rate 84%) under DNA-free conditions. The NMR structures of on DB179/NB3797. Only 31 apo-structures were identiﬁed the apo-form were determined after cross-linking two Figure 8. Prediction of DNA-binding interactions for APO104 and HOLO104. (A) Distribution of the top TM-score ranked templates. Using the statistical potential energy threshold parameters optimized on the benchmark set DB179, DBD-Hunter predicted 48 and 57 targets of DNA-binding proteins for APO104 and HOLO104, respectively. For each predicted DNA-binding protein, DNA-binding residues were predicted. The performance measures of these predictions were shown in (B). The box plots are drawn as in Figure 6. 3988 Nucleic Acids Research, 2008, Vol. 36, No. 12 monomers at the C-termini and introducing two stabiliz- factor of activated T cell (55). Despite such a dramatic ing point mutations (50). As shown in Figure 9B, the apo- in-block movement of the C-terminal domain, p65 was structure closely resembles the holo-form (51), except for correctly classiﬁed as a DNA-binding protein because the ﬁrst 14 N-terminal amino acids of the basic region, most DNA-binding residues located in its N-terminal which are unfolded in the apo-structure but form a helix in domain were correctly identiﬁed through the template. the presence of DNA. Nevertheless, half of the 14 DNA- The last example, the protein SarA, is the most intriguing (Figure 9D). The apo-structure (56) of the binding residues are aligned with DNA-binding residues of the template from the sterol regulatory element binding single-domain transcription factor adopts a dramatically protein (52), and the apo-structure is correctly classiﬁed diﬀerent topology from its holo-structure (57). The as a DNA-binding protein. RMSD is 19 A between these two structures. A nota- global The third example is the p65 subunit (also known as ble diﬀerence is a winged HTH motif present in the apo- RelA) of nuclear factor-kB (Figure 9C). The p65 subunit structure but missing in the holo-form, which instead has consists of two b-sandwich domains connected by a a unique DNA-binding motif. However, it has been sug- ﬂexible 10 amino acid linker. In the DNA-bound form gested that the apo-structure represents the native form of of p65, the N-terminal domain provides most of the DNA- SarA and that the unique DNA-binding mode observed in binding residues, while the C-terminal domain interacts the holo-structure is due to crystallization artifacts (56). with the p50 subunit (not shown) to form a heterodimer In our test, the winged HTH motif of the apo-structure complex (53). The DNA-binding activity of p65 can be was predicted to be DNA-binding by three templates inhibited by IkBa, a protein recognizing p65 that induces (1qbjB, 1sfuA and 1cpgA). In particular, Arg90 is a domain rotation of p65 (54). As shown in Figure 9C, the predicted to be a DNA-binding residue. This is consis- N-terminal domains from the apo- and the holo-structures tent with the mutagenesis study (56), which shows that overlap, whereas the C-terminal domain undergoes about the residue is critical to the DNA-binding function of the a408 rotation around the interdomain linker. Similar SarA. Overall, our prediction provides evidence for the conformational changes have also been observed in the hypothesis that the apo-form structure is functionally alignment of the target to the template NFAT1, a nuclear relevant. Figure 9. Examples of DNA-binding protein prediction on APO104. (A–D) In each panel, the left snapshot shows the structural alignment of the apo-structure and its corresponding holo-structure, and the right snapshot shows the alignment of the target apo-structure versus its template. The apo-, holo- and template structures are colored in red, blue and green, respectively. In (B), all proteins are composed of two monomers. The monomer studied is shown in solid color, while the other monomer is dimmed. PDB codes are given in parentheses. Nucleic Acids Research, 2008, Vol. 36, No. 12 3989 Application to structural genomics targets endonuclease IV due to an altered Zn-binding site (58). DBD-Hunter identiﬁed an endonuclease IV template for Finally, we have applied our method to 1697 protein 1i60A with a high TM-score of 0.76 and predicts that the structures of unknown function determined by the target is non-DNA-binding based on a repulsive statistical structural genomics initiative. The optimized threshold potential energy. Since all but four DNA-binding targets parameters corresponding to a precision of 84% were predicted by DBD-Hunter have sequence identity <25% to employed for this application. A total of 37 targets pre- their templates, it is diﬃcult for PSI-BLAST to identify dicted to be DNA-binding proteins are listed in Table 3. these targets due to low-sequence similarity. In fact, only Fourteen of these targets were previously hypothesized to nine positive predictions are common to both methods. have a function associated with DNA binding, such as Among targets predicted by PSI-BLAST but missed by transcription factor activity. Three targets (1nogA, 1t6sA DBD-Hunter, only one target has a putative function and 1vbkA) have a putative function not related to DNA related to DNA binding. In principle, one can combine binding. These three predictions are probably false these two methods to improve sensitivity. positives. The putative function of the remaining 20 targets was not assigned. By comparison, PSI-BLAST predicted 29 targets as DNA-binding proteins using an E-value of DISCUSSION 1E-8.5, which corresponds to a similar precision of 82% in benchmark tests. Among PSI-BLAST predictions, eight The main goal of the current study is to develop a know- targets have a putative DNA-binding function and two ledge-based method for predicting DNA-binding proteins targets have a putative function not related to DNA and associated DNA-binding residues from structural binding. One (1i60A) of the latter two targets has the fold genomics targets. For this purpose, the method of endonuclease IV, a DNA repair enzyme, but it has had to satisfy three conditions: First, it must be capable been proposed to have a function other than that of of predicting DNA-binding proteins that have low or no Table 3. A list of structural genomics targets predicted as DNA-binding proteins from SG1697 Target Template TM-score RMSD SID Energy Putative function 1j27A 2bdpA 0.63 2.40 0.035 5.4 UK 1nogA 1sknP 0.58 3.28 0.043 15.9 NB 1s7oA 1gdtA 0.67 1.46 0.116 7.1 DB 1sfxA 1h0mD 0.59 2.76 0.155 22.7 DB 1t6sA 1u8rJ 0.65 2.25 0.14 7.3 NB 1tuaA 1jj4A 0.55 2.37 0.123 14.8 UK 1vbkA 1jj4A 0.63 2.21 0.188 5.0 NB 1wi9A 1qbjB 0.68 2.31 0.133 12.2 UK 1wj5A 1qbjB 0.70 2.16 0.177 16.0 UK 1x58A 1w0tA 0.87 1.22 0.275 3.8 DB 1xg7A 2bzfA 0.59 3.46 0.096 13.4 UK 1z7uA 1h0mD 0.61 2.51 0.138 15.6 DB 1zelA 1qbjB 0.61 2.65 0.183 12.1 UK 2da4A 1pufA 0.64 1.72 0.211 29.1 UK 2dceA 1qbjB 0.59 2.14 0.179 10.2 UK 2e1oA 1jkqC 0.55 2.91 0.143 21.4 UK 2eshA 1cgpA 0.62 1.56 0.137 19.4 UK 2esnA 1u8rJ 0.62 2.21 0.121 12.2 DB 2ethA 1u8rJ 0.71 1.84 0.186 17.3 DB 2f2eA 1sfuA 0.65 1.89 0.143 11.7 DB 2ﬁuA 1jj4A 0.66 2.50 0.057 4.9 UK 2fmlA 1qbjB 0.57 2.44 0.075 12.7 UK 2fnaA 1qbjB 0.66 1.83 0.204 5.0 UK 2fyxA 2a6oB 0.79 1.67 0.289 20.9 DB 2hytA 1jt0A 0.75 1.61 0.212 14.1 DB 2g7uA 1cgpA 0.69 1.72 0.20 15.2 DB 2iaiA 1jt0A 0.64 3.36 0.25 5.4 DB 2jn6A 1gdtA 0.70 2.16 0.14 10.7 UK 2nr3A 1sfuA 0.70 2.34 0.103 10.3 UK 2nx4A 1jt0A 0.76 2.07 0.143 2.6 DB 2od5A 1gdtA 0.58 2.36 0.122 9.7 DB 2p8tA 1h0mD 0.57 2.27 0.123 16.8 UK 2pg4A 1z9cF 0.72 2.05 0.181 8.5 UK 2qc0A 1sfuA 0.65 2.18 0.111 9.6 UK 2qvoA 1sfuA 0.72 1.94 0.043 10.5 UK 3b73A 1cgpA 0.65 1.54 0.204 23.7 UK 3bddA 1cgpA 0.65 1.48 0.259 17.4 DB RMSD and sequence identity were calculated for the structurally aligned region between the target and the template with TM- align. Targets are labeled according to their putative function annotated in their PDB records: DB (DNA-binding), NB (function not related to DNA-binding) and UK (unknown). 3990 Nucleic Acids Research, 2008, Vol. 36, No. 12 sequence similarity (<35% identity) to their templates. In previous machine-learning studies (16,17,19,20), If a target has high-sequence similarity (>40%) to any the sizes of the non-DNA-binding protein sets used for template, typically it can be detected using a sequence- training were small, typically ranging from 100 to 250 based method such as PSI-BLAST, which is computa- structures. This raises the concerns that the discriminatory tionally more eﬃcient than structure-based approaches. features may be over-trained and that the FPR may be Second, the method must have a very low FPR because under-estimated as a result. For example, we tested the SS only a small fraction of proteins bind DNA. Assuming method (20) on a much larger control dataset NB3797. that a method with a 10% FPR and 90% sensitivity is Indeed, we found that the FPR on NB3797 is much higher applied to a target set, 10% of which are DNA-binding than that on smaller size data sets. The previously proteins, these numbers translate into a precision rate of reported FPR of 2% increases to a FPR of 5.7% on the about 50%. That is, half of the predictions are false larger set. Since similar features such as the composition positives, which is generally unacceptable for systematic of amino acids and/or the charge/dipole distribution have application on thousands of genomics targets. Third, the also been used in the other studies (16,17,19), the FPRs method has to be validated on structures in the DNA-free reported in these studies should be viewed with caution. form, since all target structures with unknown DNA- A major concern with the use of structure-based methods binding function are solved without DNA present. And is whether discriminatory features derived from holo- the concern that DNA-binding proteins undergo con- structures are transferable to apo-form structures. Two formational changes during the apo-to-holo transition has previous studies have examined performance on small sets to be addressed. In the current study, we have demon- of apo-structures, 52 targets in ref. (20) and 11 targets in ref. strated that DBD-Hunter satisﬁes all three conditions. In (19), and reported similar performance on both holo and benchmark tests, it consistently outperforms the three apo sets. Here, we constructed much larger apo/holo sets other knowledge-based methods: the sequence-based composed of 104 targets for validation. We found that the method PSI-BLAST, the structural-based method TM- sensitivities of our method on these two sets are very close, align and the SS method, which uses both sequence and being just 8% lower on the apo set. The small diﬀerence can structural information. Furthermore, we applied DBD- be understood from structural comparison analysis. 89% Hunter to 1697 structural genomics targets and predicted of apo–holo pairs have an RMSD of <3A in their that 37 proteins bind DNA. structurally aligned region (typically >90% coverage), The current method employs two features, structural which is consistent with the suggestion that the conforma- similarity and the statistical potential energy, for the tional changes of DBDs are mostly localized (59). Notable purpose of discriminating DNA-binding proteins from conformational changes can be categorized into two major non-DNA-binding proteins. Since protein structures with types: (i) refolding at the DNA-binding interface, e.g. the similar function are more likely conserved than their basic region of the leucine-zipper-like protein Max sequences (10), this allows us to identify a target that has (Figure 9B), and (ii) domain reorientation of multiple- low-sequence similarity but high-structural similarity to a domain proteins, e.g. p65 of NF-kB (Figure 9C). The homologous template. In tests on DB179 and APO104, the conformational changes observed at the binding interface structural alignment procedure identiﬁes 60% more DNA- may cause diﬃculty for approaches using strict DNA- binding proteins than PSI-BLAST does. Six pairs of target/ binding motif comparisons. The HTH motif searching template examples from DB179 are given in Figure 7. method, for example, requires an RMSD of <1.6 A bet- Invariably, they have low-sequence identify (<17%) but ween the target and the template (60). Our method is less high-structural similarity (TM-score 0.62). In addition, restrictive because structural comparison is performed for the vast majority of negatives were ﬁltered out during the the entire DBD, the core region of which may have structural comparison procedure. In the test on NB3797, relatively small conformational changes. This is reﬂected 65% negatives were eliminated by structure alone. by the similar performance on APO/HOLO104 sets, a few To achieve high accuracy, however, structural similarity examples of which are provided in Figure 9. In one rare to known DNA-binding proteins is not enough. We note case, the transcription factor SarA adopts diﬀerent folds in that the one-third of non-DNA-binding proteins from the apo- and holo-forms (56,57). Surprisingly, a winged NB3797 have a signiﬁcant structural alignment to DNA- HTH DNA-binding motif was observed in the apo- binding proteins with a TM-score higher than 0.55. structure but not in the holo-structure. Our method To further reduce false positives, an interfacial potential correctly identiﬁed the DNA-binding region of the apo- has been introduced. The potentials are speciﬁc to DNA- structure, including an Arginine crucial to the DNA- binding proteins with an average Z-score of 3.2 in the binding function of SarA. The holo-structure, however, did randomized sequence test. By requiring that the target not yield a positive prediction. Our results support the view structure not only be structurally similar to a known DNA- that the apo-form of SarA more likely represents the native binding protein but that it also has a favorable interfacial conformation (56). potential, we reduced the number of false positives from One advantage of a template-based approach is that one 1327 to 19 in the test on NB3797, corresponding to an can infer functionally relevant details from the template. extremely low FPR of 0.5%. By comparison, FPRs ranging For example, speciﬁc DNA-binding sites can be trans- from 5% to 20% were reported in previous studies ferred from the template. In this respect, DBD-Hunter (16,17,19,20). Due to the reasons mentioned above, these achieves an average sensitivity of 66% and an average high FPRs limit the potential application of these methods accuracy of 87% on predicted DNA-binding proteins to structural genomics targets. in their DNA-free conformation forms (Figure 8B). Nucleic Acids Research, 2008, Vol. 36, No. 12 3991 2. Lee,D., Redfern,O. and Orengo,C. (2007) Predicting protein By comparison, machine-learning algorithms designed function from sequence and structure. Nat. Rev. Mol. Cell Biol., 8, speciﬁcally for DNA-binding site prediction provide an 995–1005. average accuracy ranging from 60% to 82% (21–23). 3. Watson,J.D., Laskowski,R.A. and Thornton,J.M. (2005) Predicting Worldwide structural genomics centers have deposited protein function from sequence and structural data. Curr. Opin. Struct. Biol., 15, 275–284. thousands of protein structures in the PDB. It is of great 4. Whisstock,J.C. and Lesk,A.M. (2003) Prediction of protein function importance to characterize the functions of these targets. from protein sequence and structure. Q. Rev. Biophys., 36, 307–340. With respect to DNA-binding protein prediction, only one 5. Altschul,S.F., Madden,T.L., Schaﬀer,A.A., Zhang,J.H., Zhang,Z., method has been applied to structural genomics targets so Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI- far, despite numerous methods proposed. In their report, BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402. Jones et al. predicted one DNA-binding protein from 30 6. Krogh,A., Brown,M., Mian,I.S., Sjolander,K. and Haussler,D. targets using a structural motif-based approach (60), (1994) Hidden markov models in computational biology – which is limited to DBDs with a HTH motif. In the applications to protein modeling. J. Mol. Biol., 235, 1501–1531. current study, we have applied our method to 1697 7. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. structural genomics targets and identiﬁed 37 potential Protein Eng., 11, 739–747. DNA-binding proteins. To our knowledge, this is the ﬁrst 8. Holm,L. and Sander,C. (1993) Protein structure comparison by time a structural-based method for DNA-binding protein alignment of distance matrices. J. Mol. Biol., 233, 123–138. prediction has been systematically applied on a genome 9. Zhang,Y. and Skolnick,J. (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res., 33, scale. These predictions provide valuable clues for 2302–2309. assessing protein function experimentally, and it is of 10. Chothia,C. and Lesk,A.M. (1986) The relation between the great interest to conduct experimental validations of these divergence of sequence and structure in proteins. EMBO J., 5, predictions in the future. 823–826. Like all knowledge-based approaches, our method 11. Skolnick,J. and Fetrow,J.S. (2000) From genes to protein structure and function: novel applications of computational is limited by the completeness of the template library. approaches in the genomic era. Trends Biotechnol., 18, 34–39. It cannot predict DNA-binding proteins with novel struc- 12. Luscombe,N.M., Austin,S.E., Berman,H.M. and Thornton,J.M. tures or binding modes that are not included in the (2000) An overview of the structures of protein-DNA complexes. template library, which is the main disadvantage of the Genome Biol., 1, REVIEWS001. 13. Kono,H. and Sarai,A. (1999) Structure-based prediction of DNA current approach. target sites by regulatory proteins. Proteins Struct. Funct. Genet., 35, Future work entails the extension of the methodology 114–131. to the case when the structure of the protein is not yet 14. Luscombe,N.M., Laskowski,R.A. and Thornton,J.M. (2001) Amino solved but has to be predicted. Here, an unanswered acid-base interactions: a three-dimensional analysis of protein-DNA question is how good the predicted structure has to be to interactions at an atomic level. Nucleic Acids Res., 29, 2860–2874. 15. Mandel-Gutfreund,Y. and Margalit,H. (1998) Quantitative para- provide for the accurate prediction of DNA binding. meters for amino acid-base interaction: implications for prediction Another issue is to attempt to model the conformation of protein-DNA binding sites. Nucleic Acids Res., 26, 2306–2312. change on DNA binding. While this is not crucial for the 16. Ahmad,S. and Sarai,A. (2004) Moment-based prediction of DNA- majority of known DNA-binding proteins, it is important binding proteins. J. Mol. Biol., 341, 65–71. 17. Bhardwaj,N., Langlois,R.E., Zhao,G.J. and Lu,H. (2005) Kernel- for a signiﬁcant minority of cases. One promising based machine learning protocol for predicting DNA-binding approach is the extension of TASSER, a protein structure proteins. Nucleic Acids Res., 33, 6486–6493. prediction algorithm (61), to include DNA binding. Thus, 18. Shanahan,H.P., Garcia,M.A., Jones,S. and Thornton,J.M. (2004) while DBD-Hunter is a promising approach, additional Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res., 32, 4732–4741. extensions and improvements are required to increase its 19. Stawiski,E.W., Gregoret,L.M. and Mandel-Gutfreund,Y. (2003) range of applicability. Annotating nucleic acid-binding function based on protein structure. J. Mol. Biol., 326, 1065–1079. 20. Szilagyi,A. and Skolnick,J. (2006) Eﬃcient prediction of nucleic acid binding function from low-resolution protein structures. J. Mol. SUPPLEMENTARY DATA Biol., 358, 922–933. Supplementary Data are available at NAR Online. 21. Ahmad,S., Gromiha,M.M. and Sarai,A. (2004) Analysis and prediction of DNA-binding proteins and their binding residues based on composition, sequence and structural information. Bioinformatics, 20, 477–486. ACKNOWLEDGEMENTS 22. Bhardwaj,N. and Lu,H. (2007) Residue-level prediction of DNA- binding sites and its application on DNA-binding protein predic- We thank Dr Shashi Pandit for stimulating discussions. tions. FEBS Lett., 581, 1058–1066. This work was supported in part by NIH grant GM- 23. Kuznetsov,I.B., Gou,Z.K., Li,R. and Hwang,S.W. (2006) Using evolutionary and structural information to predict DNA-binding 37408. Funding to pay the Open Access publication sites on DNA-binding proteins. Proteins Struct. Funct. Bioinform., charges for this article was provided by NIH. 64, 19–27. 24. Donald,J.E., Chen,W.W. and Shakhnovich,E.I. (2007) Energetics of Conﬂict of interest statement. None declared. protein-DNA interactions. Nucleic Acids Res., 35, 1039–1047. 25. Liu,Z.J., Mao,F.L., Guo,J.T., Yan,B., Wang,P., Qu,Y.X. and Xu,Y. (2005) Quantitative evaluation of protein-DNA interactions using an optimized knowledge-based potential. Nucleic Acids Res., REFERENCES 33, 546–558. 1. Burley,S.K. (2000) An overview of structural genomics. Nat. Struct. 26. Robertson,T.A. and Varani,G. (2007) An all-atom, distance- Biol., 7, 932–934. dependent scoring function for the prediction of protein-DNA 3992 Nucleic Acids Research, 2008, Vol. 36, No. 12 interactions from structure. Proteins Struct. Funct. Bioinform., 66, 44. Garvie,C.W. and Phillips,S.E.V. (2000) Direct and indirect readout 359–374. in mutant Met repressor-operator complexes. Structure, 8, 905–914. 27. Myers,E.W. and Miller,W. (1988) Optimal alignments in linear 45. Bochkarev,A., Bochkareva,E., Frappier,L. and Edwards,A.M. space. Comput. Appl. Biosci., 4, 11–17. (1998) The 2.2 angstrom structure of a permanganate-sensitive 28. Hubbard,T.J.P., Ailey,B., Brenner,S.E., Murzin,A.G. and DNA site bound by the Epstein-Barr virus origin binding protein, Chothia,C. (1998) SCOP, structural classiﬁcation of proteins EBNA1. J. Mol. Biol., 284, 1273–1278. database: applications to evaluation of the eﬀectiveness of 46. Kim,S.S., Tam,J.K., Wang,A.F. and Hegde,R.S. (2000) The sequence alignment methods and statistics of protein structural structural basis of DNA target discrimination by papillomavirus E2 data. Acta Crystallogr. D Biol. Crystallogr., 54, 1147–1154. proteins. J. Biol. Chem., 275, 31245–31254. 29. Skolnick,J., Kihara,D. and Zhang,Y. (2004) Development and large 47. Kwon,H.J., Tirumalai,R., Landy,A. and Ellenberger,T. (1997) scale benchmark testing of the PROSPECTOR_3 threading Flexibility in DNA recombination: structure of the lambda algorithm. Proteins Struct. Funct. Bioinform., 56, 502–518. integrase catalytic core. Science, 276, 126–131. 30. Li,W.Z. and Godzik,A. (2006) CD-HIT: a fast program for 48. Aihara,H., Kwon,H.J., Nunes-Duby,S.E., Landy,A. and clustering and comparing large sets of protein or nucleotide Ellenberger,T. (2003) A conformational switch controls the DNA sequences. Bioinformatics, 22, 1658–1659. cleavage activity of lambda integrase. Mol. Cell, 12, 793. 31. Sippl,M.J. (1995) Knowledge-based potentials for proteins. Curr. 49. Conway,A.B., Chen,Y. and Rice,P.A. (2003) Structural plasticity Opin. Struct. Biol., 5, 229–235. of the Flp-Holliday junction complex. J. Mol. Biol., 326, 425–434. 32. Lu,H., Lu,L. and Skolnick,J. (2003) Development of uniﬁed 50. Sauve,S., Tremblay,L. and Lavigne,P. (2004) The NMR solution statistical potentials describing protein-protein interactions. Biophys. structure of a mutant of the max b/HLH/LZ free of DNA: insights J., 84, 1895–1901. into the speciﬁc and reversible DNA binding mechanism of dimeric 33. Matthews,B.W. (1975) Comparison of predicted and observed transcription factors. J. Mol. Biol., 342, 813–832. secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 51. Nair,S.K. and Burley,S.K. (2003) X-ray structures of Myc-Max and 405, 442–451. Mad-Max recognizing DNA: molecular bases of regulation by 34. Watkins,S., van Pouderoyen,G. and Sixma,T.K. (2004) Structural proto-oncogenic transcription factors. Cell, 112, 193–205. analysis of the bipartite DNA-binding domain of Tc3 transposase 52. Parraga,A., Bellsolell,L., Ferre-D’Amare,A.R. and Burley,S.K. bound to transposon DNA. Nucleic Acids Res., 32, 4306–4312. (1998) Co-crystal structure of sterol regulatory element binding 35. Court,R., Chapman,L., Fairall,L. and Rhodes,D. (2005) How the protein 1a at 2.3 angstrom resolution. Structure, 6, 661–672. human telomeric proteins TRF1 and TRF2 recognize telomeric 53. Chen,F.E., Huang,D.B., Chen,Y.Q. and Ghosh,G. (1998) Crystal DNA: a view from high-resolution crystal structures. EMBO Rep., structure of p50/p65 heterodimer of transcription factor NF-kappa 6, 39–45. B bound to DNA. Nature, 391, 410–413. 36. Schultz,S.C., Shields,G.C. and Steitz,T.A. (1991) Crystal structure 54. Huxford,T., Huang,D.B., Malek,S. and Ghosh,G. (1998) of a CAP-DNA Complex – the DNA is bent by 90 degrees. Science, The crystal structure of the I kappa B alpha/NF-kappa B 253, 1001–1007. complex reveals mechanisms of NF-kappa B inactivation. Cell, 37. Wilce,J.A., Vivian,J.P., Hastings,A.F., Otting,G., Folmer,R.H.A., 95, 759–770. Duggin,I.G., Wake,R.G. and Wilce,M.C.J. (2001) Structure of the 55. Giﬃn,M.J., Stroud,J.C., Bates,D.L., von Koenig,K.D., Hardin,J. RTP-DNA complex and the mechanism of polar replication fork and Chen,L. (2003) Structure of NFAT1 bound as a dimer to the arrest. Nat. Struct. Biol., 8, 206–210. HIV-1 LTR kappa B element. Nat. Struct. Biol., 10, 800–806. 38. Tahirov,T.H., Inoue-Bungo,T., Morii,H., Fujikawa,A., Sasaki,M., 56. Liu,Y.F., Manna,A.C., Pan,C.H., Kriksunov,I.A., Thiel,D.J., Kimura,K., Shiina,M., Sato,K., Kumasaka,T., Yamamoto,M. et al. Cheung,A.L. and Zhang,G.Y. (2006) Structural and function (2001) Structural analyses of DNA recognition by the AML1/ analyses of the global regulatory protein SarA from Staphylococcus Runx-1 Runt domain and its allosteric control by CBF beta. Cell, aureus. Proc. Natl Acad. Sci. USA, 103, 2392–2397. 104, 755–767. 57. Schumacher,M.A., Hurlburt,B.K. and Brennan,R.G. (2001) 39. Cho,Y.J., Gorina,S., Jeﬀrey,P.D. and Pavletich,N.P. (1994) Crystal Crystal structures of SarA, a pleiotropic regulator of virulence genes structure of a P53 tumor suppressor DNA complex – understanding in S-aureus. Nature, 409, 215–219. tumorigenic mutations. Science, 265, 346–355. 58. Zhang,R.G., Dementieva,I., Duke,N., Collart,F., Quaite- 40. Tanaka,T., Fukuda-Ishisaka,S., Hirama,M. and Otani,T. (2001) Randall,E., Alkire,R., Dieckman,L., Maltsev,N., Korolev,O. and Solution structures of C-1027 apoprotein and its complex with the Joachimiak,A. (2002) Crystal structure of Bacillus subtilis IolI aromatized chromophore. J. Mol. Biol., 309, 267–283. shows endonuclase IV fold with altered Zn binding. Proteins Struct. 41. Horton,J.R., Zhang,X., Maunus,R., Yang,Z., Wilson,G.G., Funct. Genet., 48, 423–426. Roberts,R.J. and Cheng,X.D. (2006) DNA nicking by HinP1I 59. Wright,P.E. and Dyson,H.J. (1999) Intrinsically unstructured endonuclease: bending, base ﬂipping and minor groove expansion. proteins: re-assessing the protein structure-function paradigm. Nucleic Acids Res., 34, 939–948. J. Mol. Biol., 293, 321–331. 42. Xu,Q.S., Roberts,R.J. and Guo,H.C. (2005) Two crystal forms of 60. Jones,S., Barker,J.A., Nobeli,I. and Thornton,J.M. (2003) Using the restriction enzyme MspI-DNA complex show the same novel structural motif templates to identify proteins with DNA binding structure. Protein Sci., 14, 2590–2600. function. Nucleic Acids Res., 31, 2811–2823. 43. Costa,M., Sola,M., del Solar,G., Eritja,R., Hernandez- 61. Zhang,Y. and Skolnick,J. (2004) Automated structure prediction Arriaga,A.M., Espinosa,M., Gomis-Ruth,F.X. and Coll,M. (2001) of weakly homologous proteins on a genomic scale. Proc. Natl Plasmid transcriptional repressor CopG oligomerises to render Acad. Sci. USA, 101, 7594–7599. helical superstructures unbound and in complexes with 62. Humphrey,W., Dalke,A. and Schulten,K. (1996) VMD: visual oligonucleotides. J. Mol. Biol., 310, 403–417. molecular dynamics. J. Mol. Graph., 14, 33–38.

Journal

Nucleic Acids Research – Oxford University Press

Published: Jul 31, 2008

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions

DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions

DBD-Hunter: a knowledge-based method for the prediction of DNA–protein interactions

References (68)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies