UniPROBE: an online database of protein binding microarray data on protein–DNA interactions

Daniel E. Newburger; Martha L. Bulyk

doi:10.1093/nar/gkn660

UniPROBE: an online database of protein binding microarray data on protein–DNA interactions

Newburger, Daniel E.; Bulyk, Martha L. 2009-01-07 00:00:00 Published online 8 October 2008 Nucleic Acids Research, 2009, Vol. 37, Database issue D77–D82 doi:10.1093/nar/gkn660 UniPROBE: an online database of protein binding microarray data on protein–DNA interactions 1 1,2,3, Daniel E. Newburger and Martha L. Bulyk * 1 2 Division of Genetics, Department of Medicine, Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School and Harvard-MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115, USA Received August 15, 2008; Revised September 18, 2008; Accepted September 21, 2008 ABSTRACT stimuli, diﬀerentiation and development in an organism. Despite recent advances in this ﬁeld, the vast majority The UniPROBE (Universal PBM Resource for of TFs in most major model organisms and pathogens Oligonucleotide Binding Evaluation) database remain either uncharacterized or poorly described (1). hosts data generated by universal protein binding The development of universal (2) protein binding micro- microarray (PBM) technology on the in vitro DNA- array (PBM) technology (3) (Figure 1) oﬀers a new avenue binding specificities of proteins. This initial release for the exploration of protein–DNA binding speciﬁcity. of the UniPROBE database provides a centralized Universal PBMs provide an eﬃcient and comprehensive method for in vitro interrogation of DNA-binding pref- resource for accessing comprehensive PBM data erences. PBM technology complements other currently on the preferences of proteins for all possible available technologies, such as chromatin immunoprecipi- sequence variants (‘words’) of length k (‘k-mers’), tation coupled with either microarray readout (4–7) or as well as position weight matrix (PWM) and graphi- high-throughput sequencing (8–10) that identify genomic cal sequence logo representations of the k-mer regions bound in vivo. data. In total, the database hosts DNA-binding Universal PBMs achieve comprehensive, high-resolu- data for over 175 nonredundant proteins from a tion determination of proteins’ DNA-binding preferences diverse collection of organisms, including the pro- by measuring the binding preferences of a protein over karyote Vibrio harveyi, the eukaryotic malarial para- all possible k-mers of a given length (2,11). Currently site Plasmodium falciparum, the parasitic employed custom array designs contain a set of 60-bp Apicomplexan Cryptosporidium parvum, the yeast DNA probes that encompass all possible permutations Saccharomyces cerevisiae, the worm Caenorhabdi- of either 9 (Bulyk Lab, unpublished data) or 10 bp (12), depending upon the microarray design (2,12). In addition tis elegans, mouse and human. Current web tools to covering all contiguous 9-mers or 10-mers, these array include a text-based search, a function for asses- designs also oﬀer an extensive set of gapped permuta- sing motif similarity between user-entered data tions that provide coverage of binding sites of greater and database PWMs, and a function for locating length. Together, these data can be synthesized to produce putative binding sites along user-entered nucleotide high-conﬁdence measurements of the relative preferences sequences. The UniPROBE database is available at of a protein for all possible sequence variants belonging http://thebrain.bwh.harvard.edu/uniprobe/. to a wide range of k-mer patterns typically found in TF binding site motifs (2,12). PBM enrichment scores from the PBM signal intensity data are typically calculated for INTRODUCTION each of the more than 2.3 million 8-mers (i.e. binding site The characterization of transcription factors’ (TFs’) ‘words’ with eight informative nucleotide positions, DNA-binding speciﬁcities represents a critical step including all contiguous 8-mers and a large collection of towards understanding the regulation of gene expression gapped 8-mers). These 8-mers encompass the full aﬃnity and elucidating the biophysical properties governing range of DNA binding preferences, from the most prefer- protein–DNA interactions. The study of DNA-binding entially bound k-mers to low-aﬃnity k-mers and nonspe- speciﬁcities therefore has profound implications for the ciﬁcally associated sequences (2). analysis and prediction of the regulatory networks The TRANSFAC (13) and JASPAR (14) databases con- that govern intracellular function, responses to external tain hundreds of matrices constructed from DNA binding *To whom correspondence should be addressed. Tel: +1 617 525 4725; Fax: +1 617 525 4705; Email: [email protected] 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D78 Nucleic Acids Research, 2009, Vol. 37, Database issue number of binding site sequences in a JASPAR SELEX dataset is just 28, while the median number of nonredun- dant binding site sequences is just 11. In contrast, the DNA binding proﬁles obtained from our PBM experiments pro- vide information on the direct binding preferences of a given protein over all k-mer DNA binding sequence var- iants, measured in vitro using the universal PBM technol- ogy; the number of sequence variants examined is limited only by the number of features on the microarray. In addi- tion to the k-mer binding proﬁles, these procedures also provide DNA binding sequence PWMs derived from the k- mer data using our Seed-and-Wobble algorithm (2). The UniPROBE database hosts the high-resolution DNA binding proﬁles obtained from PBM experiments on known and predicted TFs (2,3,18–21). The database currently contains DNA binding proﬁles for many pro- teins not included in similar databases such as JASPAR (14) and TRANSFAC (13), and it oﬀers several tools for searching the database and analyzing user-deﬁned binding proﬁles or DNA sequences. The resources and analysis tools oﬀered by the UniPROBE database promise to facil- itate previously untenable, downstream genomic analyses, and we anticipate that it will represent an important geno- Figure 1. Universal PBM schema. Universal PBMs containing all possible 10-mers within 60-mer probes are ﬁrst synthesized as single- mic resource as additional PBM data are compiled. stranded oligonucleotide arrays, to which a common primer is annealed and extended in order to biochemically convert the single-stranded array to a double-stranded DNA (dsDNA) array (these steps are not DATABASE CONSTRUCTION shown in the ﬁgure) (2). The dsDNA array is then bound by protein, stained with a ﬂuorophore-conjugated antibody, and scanned; the The UniPROBE database is managed by a MySQL rela- quantiﬁed array data are then normalized by the relative amounts of tional database that provides the back-end for user queries DNA in each spot, and used to calculate k-mer binding data (2). and facilitates the data retrieval necessary for the site’s PWMs can be calculated either from the k-mer binding data using our Seed-and-Wobble algorithm (2) or from the 60-mer probe data analysis tools. All HTML pages are dynamically gener- using other motif ﬁnding algorithms (34). ated by PHP scripts hosted on an Apache server, and several JavaScript libraries provide interactive interfaces that facilitate site navigation and form accessibility. The site data from various data types; indeed, a given position Apache server also hosts all downloadable data, available weight matrix (PWM) in TRANSFAC frequently is by HTTP connection. derived from binding sequence data compiled from mul- tiple experimental methods, which include lower through- DATABASE CONTENT put approaches such as gel retardation (i.e. electrophoretic mobility shift assays), DNase I footprinting, immuno- The UniPROBE database hosts the results of PBM experi- precipitation, supershift assays and methylation protec- ments, subsequent computational analyses performed tion and higher throughput approaches such as in vitro on these data, and protein annotations. The site currently selection (15) (SELEX). The PAZAR database (16) is a hosts PBM data for over 175 nonredundant proteins from meta-database that contains TF binding site data. It con- a wide range of organisms, including the prokaryote Vibrio tains PWMs from the JASPAR core database and in vivo harveyi, the eukaryotic malarial parasite Plasmodium TF binding site data and cis-regulatory module infor- falciparum, the parasitic Apicomplexan Cryptosporidium mation from various sources, including other databases. parvum, the yeast Saccharomyces cerevisiae, the worm A review of databases of cis-regulatory modules is Caenorhabditis elegans, mouse and human (2,18,20,21). beyond the scope of this article. These data already encompass the majority of mouse The universal PBM technology has several key advan- homeodomain TFs and will soon include more than tages over in vitro selection approaches, such as SAGE– 100 additional mouse proteins (labs of Bulyk, M.L. and SELEX (17). SELEX does have the capability to interro- Hughes, T.R., unpublished results), nearly 90 additional gate sequences spanning a wide range of aﬃnities, but it S. cerevisiae proteins (labs of Bulyk, M.L. and LaBaer, J., requires a signiﬁcant increase in cost and labor to achieve unpublished results), and over 20 additional C. elegans the necessary depth of sequencing. Moreover, SELEX proteins (labs of Bulyk, M.L. and Walhout, A.J., unpub- data have limited sensitivity because one cannot distin- lished results). The UniPROBE database will addition- guish DNA binding site sequence variants missing from ally host data for Drosophila melanogaster TFs from the collected data from those that are truly not bound by ongoing projects in the Bulyk laboratory, and we antici- the given TF. In a survey of all SELEX datasets in the pate the addition of several datasets from other labora- 2006 JASPAR database, we found that the median total tories using this microarray technology. Nucleic Acids Research, 2009, Vol. 37, Database issue D79 For each DNA-binding protein, the server holds several reference databases. The website structure and interface diﬀerent data types, including: (i) unprocessed 60-mer section subsequently discusses these external annotations probe signal intensity data; (ii) normalized probe intensi- in further detail. ties; (iii) TF-binding DNA proﬁle representations and (iv) publication-speciﬁc data. The unprocessed (or ‘raw’) Agilent array data include information on probe position, WEBSITE STRUCTURE AND INTERFACE 60-mer probe sequence, and Cy3 (DNA, from incorpo- The Browse page of the UniPROBE database site presents rated Cy3-dUTP) and Alexa 488 (protein, from Alexa a table containing each hosted TF, that protein’s struc- 488 conjugated anti-GST antibody) signal intensities, the tural class, and the publication with which the protein’s latter of which are necessary for accurately assessing rela- PBM data are associated. The entries are accompanied by tive DNA binding (2). The normalized probe intensities brief descriptions of protein function retrieved from IHOP are derived from the raw data after adjusting for relative (28) or from a species-speciﬁc database such as SGD (29) DNA concentrations at each spot and for spatial nonuni- or WormBase (30). The table then presents a link to a formities within the microarray. The in-house software zipped ﬁle containing all factor-associated PBM data used for normalization and subsequent binding proﬁle and a link to a Details page containing further factor generation will soon be available for download as the annotations and a display of relevant features from the Universal PBM Analysis Suite (11). PBM experiments. The database oﬀers several diﬀerent binding proﬁle The Details page (Figure 2) described above has several representations for assessing TF speciﬁcity and aﬃnity. components. The ﬁrst section (Figure 2A) provides addi- First, we use our Seed-and-Wobble algorithm (2) to gen- tional annotations for the factor of interest, including erate PWM motifs, which represent the observed prob- ability of ﬁnding a given nucleotide in a given position unique gene or protein accession numbers (if available within a TF’s DNA target site. PWMs currently serve as as provided by the species-speciﬁc database), gene syno- one of the primary methods for quantitative representa- nyms, DNA-binding domain amino acid sequence (if tion of DNA binding site motifs (13,14,22), and they are available) and links to databases such as IHOP (28), useful for creating graphical sequence logos (23,24) and RefSeq (31), UniProt (32) and JASPAR (14). The for performing sequence analysis using any of several pub- second section (Figure 2B) displays the PWM and the lished software tools (22,25). Graphical sequence logos matrix motif logo derived from Seed-and-Wobble analysis of the Seed-and-Wobble motifs, generated using the algo- of PBM k-mer data. Although it does not directly display rithms deﬁned by enoLOGOS (24), are also present in k-mer data, links to the data ﬁles containing complete k- the UniPROBE database. mer data, normalized 60-mer probe data, and raw probe Second, we provide two k-mer-based DNA binding data are provided below the motif logo. The ﬁnal section proﬁles for each TF. The universal PBM designs facilitate of the Details page (Figure 2C) displays a table that pre- k-mer binding proﬁle construction because they allow sents the experimental conditions and protein sequences for full coverage of all 8-mers of width 12 or less. The used to produce each PBM dataset for the TF of interest. k-mer proﬁles have several advantages over the traditional In addition to the Browse and Details pages, several PWM model. For example, comprehensive coverage of other pages facilitate the download of speciﬁc materials. ungapped and gapped 8-bp sequence variants can provide The Downloads page distributes ZIP ﬁles containing all insight into nucleotide interdependence within DNA instances of speciﬁc data types (i.e. all PWMs or all raw binding site sequences; whereas, mononucleotide inde- data) in the entire database, along with ZIP ﬁles of the pendence is implicit in traditional PWMs (26,27). The data associated with each given publication. The Down- database’s ﬁrst k-mer-based binding proﬁle consists of loads page also provides links to the core SQL tables used the median signal intensities and PBM enrichment scores by the database and documentation for these tables. As an associated with each contiguous 8-mer, where enrichment alternative method of accessing UniPROBE ﬁles, one can scores are calculated using a variant of the Wilcoxon– use the Apache HTTP Server index to browse for ﬁles or Mann–Whitney statistic and range from 0.5 for the most use the Explore page to view the directory structure in a favored k-mers to 0.5 for the most disfavored k-mers. Microsoft Explorer style interface. The second k-mer-based binding proﬁle includes the Most pages in the database website also provide forms top-scoring gapped 8-mer patterns (up to 10 positions) for performing text searches on the database and for inter- as determined by a 0.25 enrichment score cutoﬀ; we rogating PBM-derived binding proﬁles. The following used this threshold to avoid excessive ﬁle size for the section describes these tools in detail. gapped pattern proﬁle and note that the Universal PBM Analysis Suite (11) can be used to generate full proﬁles for all gapped 8-mer patterns up to 12-nt positions in length. SEARCH AND ANALYSIS TOOLS In addition to these PBM data ﬁles, the UniPROBE database also provides relevant experimental information Several tools for conducting database searches and per- and factor-speciﬁc annotations. Experimental features forming analyses on TF k-mer binding proﬁles (Figure 3) include protein expression method, sequence and con- enhance the database’s utility. A simple search ﬁeld in the struct information (i.e. full-length protein or DNA-bind- top right corner of the site’s horizontal navigation bar ing domain only). Factor annotations include functional provides a modiﬁed full-text MySQL search across species descriptions and links to a variety of protein and gene names, gene names, synonyms and annotations. Under the D80 Nucleic Acids Research, 2009, Vol. 37, Database issue Figure 2. Details Page for the Mus musculus TF Hdx. This page includes (A) gene and protein annotations for Hdx, (B) PBM-derived motif data for the factor and (C) PBM experimental information for the Hdx data. Advanced Search navigation bar, a customizable text to 20 binding site representations in one of the following search allows the user to enter multiple key words formats: frequency matrix, count matrix, Meta-MEME within speciﬁc database ﬁelds for a higher precision 3.x motif or IUPAC motif. As additional options, the query. These terms may be linked by AND or OR user can restrict the query to PWMs from a particular Boolean operators by selecting the Match All or Match species, specify a maximum similarity threshold, set a Any radio buttons, respectively (Figure 3A). The Browse minimum matrix overlap or choose the comparison algo- page presents the match results (Figure 3B) for both rithm. Available algorithms include Euclidean distance, search methods in the same table format used by the Pearson correlation, Kullback–Liebler Divergence and default Browse display, which provides basic annotations the Sandelin–Wasserman function, which are described for each matching protein and links to download or view in detail in the documentation for the Tomtom program the associated PBM data. (33). Upon query submission, the tool returns a table of The Advanced Search bar also contains two analysis statistics describing the best scoring alignment between a tools available for online use. The ﬁrst tool, which uses given pair of motifs, a Tomtom E-value (33) that quanti- the Tomtom program (33) from the Meta-MEME suite ﬁes their similarity and a graphical alignment of the two (22), provides a platform for comparing standard motif motifs’ logos (Figure 3D). This table also provides links representations against the PBM-derived PWMs in the to each matching factor’s Details page for further investi- database (Figure 3C). The user may enter or upload up gation of the PBM data. Nucleic Acids Research, 2009, Vol. 37, Database issue D81 Figure 3. Database search tools and formatted query results. Search options include (A) a text-based search, (B) a tool for comparing standard motif representations against PBM-derived motifs in the database using the Tomtom program (33) from the Meta-MEME suite (22) and (C) a tool for scanning FASTA-formatted nucleotide sequences for matches to TF 8-mer binding proﬁles in the UniPROBE database. The Database Browser table formats the search results for viewing by (D) highlighting text search term matches, (E) presenting a graphical view of motif alignment and (F) illustrating 8-mer binding site matches along the input sequence (x-axis). The second analysis tool uses contiguous 8-mer PBM other labs generating universal PBM data to contact us enrichment score data to scan user-supplied input DNA by email if they wish to add their data to the UniPROBE sequences for putative TF binding sites (Figure 3E). To database following the acceptance of their data for pub- perform this DNA scan, the user must ﬁrst upload or lication. The development of several additional tools may enter a FASTA ﬁle containing up to 30 DNA sequences, also enhance the website, and they may include a down- each of a 10-kb maximum permissible length. The user load manager, a local BLASTP function for identifying then speciﬁes the species of interest and an enrichment matches in our database to a user-speciﬁed query protein, score threshold, and the tool scans the input DNA and a DNA-binding preference prediction tool for user- sequence using an 8-bp sliding window to detect whether speciﬁed query proteins. any TFs from the species of interest have enrichment scores greater than the user’s threshold for that particular AVAILABILITY AND LICENSE sequence. The website displays the results of the scan both as an HTML table available for plain-text download and All data hosted by the UniPROBE database are freely as a simple graphic indicating binding site position and TF available for distribution at the database website. The identity (Figure 3F). This tool may be useful not only for sequences of the 60-mer DNA probes synthesized on our generating hypotheses about putative regulatory interac- custom-designed universal arrays are available under the tions but also for minimizing the unintentional creation of terms of the academic research use license described at new binding sites for unrelated factors when designing http://thebrain.bwh.harvard.edu/uniprobe/academic- site-directed mutagenesis experiments. license.php. All pages have been tested under Firefox 2.0, Firefox 3.0 and Internet Explorer 7. FUTURE DIRECTIONS ACKNOWLEDGEMENTS The upcoming publication of several large PBM datasets of yeast, ﬂy and mouse TFs will contribute signiﬁcantly to We thank A. Philippakis and F. S. He for the survey of the breadth of coverage in the database. We encourage SELEX datasets in the 2006 JASPAR database, users of the database to register at http://thebrain.bwh. I. Adzhubey for technical assistance, M. Berger for harvard.edu/uniprobe/register.php to receive updates con- helpful discussions and S. Gisselbrecht, R. P. McCord, cerning the addition of new datasets and changes to the A. Gehrke, and A. Aboukhalil for critical reading of database interface or analysis tools. We also encourage the article. D82 Nucleic Acids Research, 2009, Vol. 37, Database issue 16. Portales-Casamar,E., Kirov,S., Lim,J., Lithwick,S., Swanson,M.I., FUNDING Ticoll,A., Snoddy,J. and Wasserman,W.W. (2007) PAZAR: a National Institutes of Health (R01 HG003985 to M.L.B.). framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol., 8, R207. Funding for open access charge: National Institutes of 17. Roulet,E., Busso,S., Camargo,A.A., Simpson,A.J., Mermod,N. and Health (R01 HG003985 to M.L.B.). Bucher,P. (2002) High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Conﬂict of interest statement. None declared. Nat. Biotechnol., 20, 831–835. 18. Berger,M.F., Badis,G., Gehrke,A.R., Talukder,S., Philippakis,A.A., Pena-Castillo,L., Alleyne,T.M., Mnaimneh,S., Botvinnik,O.B., REFERENCES Chan,E.T. et al. (2008) Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell, 1. Bulyk,M.L. (2006) DNA microarray technologies for measuring 133, 1266–1276. protein–DNA interactions. Curr. Opin. Biotechnol., 17, 422–430. 19. Choi,Y., Qin,Y., Berger,M.F., Ballow,D.J., Bulyk,M.L. and 2. Berger,M.F., Philippakis,A.A., Qureshi,A.M., He,F.S., Estep,P.W. Rajkovic,A. (2007) Microarray analyses of newborn mouse ovaries 3rd and Bulyk,M.L. (2006) Compact, universal DNA microarrays lacking Nobox. Biol. Reprod., 77, 312–319. to comprehensively determine transcription-factor binding site 20. De Silva,E.K., Gehrke,A.R., Olszewski,K., Leon,I., Chahal,J.S., speciﬁcities. Nat. Biotechnol., 24, 1429–1435. Bulyk,M.L. and Llinas,M. (2008) Speciﬁc DNA-binding by 3. Mukherjee,S., Berger,M.F., Jona,G., Wang,X.S., Muzzey,D., Apicomplexan AP2 transcription factors. Proc. Natl Acad. Sci. Snyder,M., Young,R.A. and Bulyk,M.L. (2004) Rapid analysis USA, 105, 8393–8398. of the DNA-binding speciﬁcities of transcription factors with DNA microarrays. Nat. Genet., 36, 1331–1339. 21. Pompeani,A.J., Irgon,J.J., Berger,M.F., Bulyk,M.L., Wingreen,N.S. 4. Reid,J.L., Iyer,V.R., Brown,P.O. and Struhl,K. (2000) Coordinate and Bassler,B.L. (2008) The Vibrio harveyi master quorum-sensing regulation of yeast ribosomal protein genes is associated with regulator, LuxR, a TetR-type protein is both an activator and targeted recruitment of Esa1 histone acetylase. Mol. Cell, 6, a repressor: DNA recognition and binding speciﬁcity at target 1297–1307. promoters. Mol. Microbiol., 70, 76–88. 5. Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., 22. Grundy,W.N., Bailey,T.L., Elkan,C.P. and Baker,M.E. (1997) Meta-MEME: motif-based hidden Markov models of protein Simon,I., Zeitlinger,J., Schreiber,J., Hannett,N., Kanin,E. et al. families. Comput. Appl. Biosci., 13, 397–406. (2000) Genome-wide location and function of DNA binding 23. Crooks,G.E., Hon,G., Chandonia,J.M. and Brenner,S.E. (2004) proteins. Science, 290, 2306–2309. WebLogo: a sequence logo generator. Genome Res., 14, 6. Iyer,V.R., Horak,C.E., Scafe,C.S., Botstein,D., Snyder,M. and 1188–1190. Brown,P.O. (2001) Genomic binding sites of the yeast cell-cycle 24. Workman,C.T., Yin,Y., Corcoran,D.L., Ideker,T., Stormo,G.D. transcription factors SBF and MBF. Nature, 409, 533–538. and Benos,P.V. (2005) enoLOGOS: a versatile web tool for 7. Lieb,J.D., Liu,X., Botstein,D. and Brown,P.O. (2001) Promoter- energy normalized sequence logos. Nucleic Acids Res., 33, speciﬁc binding of Rap1 revealed by genome-wide maps of protein- W389–W392. DNA association. Nat. Genet., 28, 327–334. 25. Warner,J.B., Philippakis,A.A., Jaeger,S.A., He,F.S., Lin,J. and 8. Wei,C.L., Wu,Q., Vega,V.B., Chiu,K.P., Ng,P., Zhang,T., Bulyk,M.L. (2008) Systematic identiﬁcation of mammalian regula- Shahab,A., Yong,H.C., Fu,Y., Weng,Z. et al. (2006) A global tory motifs’ target genes and functions. Nat. Methods, 5, 347–353. map of p53 transcription-factor binding sites in the human genome. 26. Bulyk,M.L., Johnson,P.L. and Church,G.M. (2002) Nucleotides of Cell, 124, 207–219. transcription factor binding sites exert interdependent eﬀects on the 9. Johnson,D.S., Mortazavi,A., Myers,R.M. and Wold,B. (2007) binding aﬃnities of transcription factors. Nucleic Acids Res., 30, Genome-wide mapping of in vivo protein–DNA interactions. 1255–1261. Science, 316, 1497–1502. 27. Man,T.K. and Stormo,G.D. (2001) Non-independence of Mnt 10. Robertson,G., Hirst,M., Bainbridge,M., Bilenky,M., Zhao,Y., repressor-operator interaction determined by a new quantitative Zeng,T., Euskirchen,G., Bernier,B., Varhol,R., Delaney,A. et al. multiple ﬂuorescence relative aﬃnity (QuMFRA) assay. (2007) Genome-wide proﬁles of STAT1 DNA association using Nucleic Acids Res., 29, 2471–2478. chromatin immunoprecipitation and massively parallel sequencing. 28. Hoﬀmann,R. and Valencia,A. (2004) A gene network for navigating Nat. Methods, 4, 651–657. the literature. Nat. Genet., 36, 664. 11. Berger,M. and Bulyk,M. Universal protein binding microarrays for 29. Hong,E.L., Balakrishnan,R., Dong,Q., Christie,K.R., Park,J., the comprehensive characterization of the DNA binding speciﬁcities of transcription factors. Nat. Protoc. (in press). Binkley,G., Costanzo,M.C., Dwight,S.S., Engel,S.R., Fisk,D.G. 12. Philippakis,A.A., Qureshi,A.M., Berger,M.F. and Bulyk,M.L. et al. (2008) Gene Ontology annotations at SGD: new data (2008) Design of compact, universal DNA Microarrays for protein sources and annotation methods. Nucleic Acids Res., 36, binding microarray experiments. (Presented at RECOMB 2007 D577–D581. conference) J. Comput. Biol., 15, 655–665. 30. Rogers,A., Antoshechkin,I., Bieri,T., Blasiar,D., Bastiani,C., 13. Matys,V., Fricke,E., Geﬀers,R., Gossling,E., Haubrock,M., Canaran,P., Chan,J., Chen,W.J., Davis,P., Fernandes,J. et al. (2008) WormBase 2007. Nucleic Acids Res., 36, D612–D617. Hehl,R., Hornischer,K., Karas,D., Kel,A.E., Kel-Margoulis,O.V. 31. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI reference et al. (2003) TRANSFAC: transcriptional regulation, from patterns sequences (RefSeq): a curated non-redundant sequence database to proﬁles. Nucleic Acids Res., 31, 374–378. of genomes, transcripts and proteins. Nucleic Acids Res., 35, 14. Bryne,J.C., Valen,E., Tang,M.H., Marstrand,T., Winther,O., D61–D65. da Piedade,I., Krogh,A., Lenhard,B. and Sandelin,A. (2008) 32. The UniProt Consortium. (2008) The universal protein resource JASPAR, the open access database of transcription factor- (UniProt). Nucleic Acids Res., 36, D190–D195. binding proﬁles: new content and tools in the 2008 update. 33. Gupta,S., Stamatoyannopoulos,J.A., Bailey,T.L. and Noble,W.S. Nucleic Acids Res., 36, D102–D106. (2007) Quantifying similarity between motifs. Genome Biol., 8, R24. 15. Oliphant,A.R., Brandl,C.J. and Struhl,K. (1989) Deﬁning the 34. Huber,B.R. and Bulyk,M.L. (2006) Meta-analysis discovery of sequence speciﬁcity of DNA-binding proteins by selecting binding tissue-speciﬁc DNA sequence motifs from mammalian gene sites from random-sequence oligonucleotides: analysis of yeast expression data. BMC Bioinformatics, 7, 229. GCN4 protein. Mol. Cell Biol., 9, 2944–2949. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/uniprobe-an-online-database-of-protein-binding-microarray-data-on-jQGhqKea8q

Loading next page...

References (40)

V. Iyer, C. Horak, C. Scafe, D. Botstein, M. Snyder, P. Brown (2001)
Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF
Nature, 409
M. Bulyk (2006)
DNA microarray technologies for measuring protein-DNA interactions.
Current opinion in biotechnology, 17 4
Chia-Lin Wei, Qiang Wu, V. Vega, K. Chiu, Patrick Ng, Tao Zhang, A. Shahab, H. Yong, Yutao Fu, Z. Weng, Jianjun Liu, X. Zhao, Joon-Lin Chew, Y. Lee, V. Kuznetsov, W. Sung, L. Miller, B. Lim, E. Liu, Qiang Yu, H. Ng, Y. Ruan (2006)
A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome
Cell, 124
Youngsok Choi, Y. Qin, M. Berger, D. Ballow, M. Bulyk, A. Rajkovic (2007)
Microarray Analyses of Newborn Mouse Ovaries Lacking Nobox1
, 77
Bing Ren, F. Robert, John Wyrick, O. Aparicio, E. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, Elenita Kanin, T. Volkert, Christopher Wilson, S. Bell, R. Young (2000)
Genome-wide location and function of DNA binding proteins.
Science, 290 5500
David Johnson, A. Mortazavi, R. Myers, B. Wold (2007)
Genome-Wide Mapping of in Vivo Protein-DNA Interactions
Science, 316
T. Man, G. Stormo (2001)
Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay.
Nucleic acids research, 29 12
Funding for open access charge
J. Lieb, Xiaole Liu, D. Botstein, P. Brown (2001)
Promoter-specific binding of Rap1 revealed by genome-wide maps of protein–DNA association
Nature Genetics, 28
M. Bulyk, Philip Johnson, G. Church (2002)
Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors.
Nucleic acids research, 30 5
A. Oliphant, C. Brandl, K. Struhl (1989)
Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein
Molecular and Cellular Biology, 9
(1996)
Maturation of a central
C. Workman, Yutong Yin, D. Corcoran, T. Ideker, G. Stormo, P. Benos (2005)
enoLOGOS: a versatile web tool for energy normalized sequence logos
Nucleic Acids Research, 33
J. Bryne, Eivind Valen, Man-Hung Tang, T. Marstrand, O. Winther, I. Piedade, A. Krogh, B. Lenhard, A. Sandelin (2007)
JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update
Nucleic Acids Research, 36
M. Berger, Gwenael Badis, Andrew Gehrke, Shaheynoor Talukder, A. Philippakis, Lourdes Peña-Castillo, Trevis Alleyne, S. Mnaimneh, O. Botvinnik, Esther Chan, Faiqua Khalid, Wen Zhang, Daniel Newburger, S. Jaeger, Q. Morris, M. Bulyk, T. Hughes (2008)
Variation in Homeodomain DNA Binding Revealed by High-Resolution Analysis of Sequence Preferences
Cell, 133
V. Matys, E. Fricke, R. Geffers, E. Gößling, Martin Haubrock, R. Hehl, K. Hornischer, D. Karas, A. Kel, O. Kel-Margoulis, D. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Münch, I. Reuter, S. Rotert, H. Saxel, Maurice Scheer, S. Thiele, E. Wingender (2003)
TRANSFAC®: transcriptional regulation, from patterns to profiles
Nucleic Acids Res., 31
Gordon Robertson, M. Hirst, M. Bainbridge, M. Bilenky, Yongjun Zhao, Thomas Zeng, G. Euskirchen, B. Bernier, R. Varhol, Allen Delaney, N. Thiessen, O. Griffith, A. He, M. Marra, M. Snyder, Steven Jones (2007)
Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing
Nature Methods, 4
Shobhit Gupta, J. Stamatoyannopoulos, T. Bailey, William Noble (2007)
Quantifying similarity between motifs
Genome Biology, 8
E. Roulet, Stéphane Busso, A. Camargo, A. Simpson, N. Mermod, P. Bucher (2002)
High-throughput SELEX–SAGE method for quantitative modeling of transcription-factor binding sites
Nature Biotechnology, 20
J. Warner, A. Philippakis, S. Jaeger, F. He, Jolinta Lin, M. Bulyk (2008)
Systematic identification of mammalian regulatory motifs' target genes and functions
Nature Methods, 5
(2007)
NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Res., 35
G. Crooks, G. Hon, J. Chandonia, S. Brenner (2004)
WebLogo: a sequence logo generator.
Genome research, 14 6
J. Reid, V. Iyer, Patrick Brown, K. Struhl (2000)
Coordinate regulation of yeast ribosomal protein genes is associated with targeted recruitment of Esa1 histone acetylase.
Molecular cell, 6 6
Elodie Portales-Casamar, S. Kirov, Jonathan Lim, S. Lithwick, Magdalena Swanson, Amy Ticoll, J. Snoddy, W. Wasserman (2007)
PAZAR: a framework for collection and dissemination of cis-regulatory sequence annotation
Genome Biology, 8
A. Bairoch, R. Apweiler, Cathy Wu, W. Barker, B. Boeckmann, Serenella Ferro, E. Gasteiger, Hongzhan Huang, R. Lopez, M. Magrane, M. Martin, D. Natale, C. O’Donovan, Nicole Redaschi, L. Yeh (2004)
The Universal Protein Resource (UniProt)
Nucleic Acids Research, 33
Anthony Rogers, Igor Antoshechkin, Tamberlyn Bieri, Darin Blasiar, C. Bastiani, Payan Canaran, Juancarlos Chan, Wen Chen, Paul Davis, Jolene Fernandes, Tristan Fiedler, Michael Han, Todd Harris, Ranjana Kishore, R. Lee, Sheldon McKay, Hans-Michael Müller, Cecilia Nakamura, Philip Ozersky, Andrei Petcherski, Gary Schindelman, Erich Schwarz, William Spooner, Mary Tuli, Kimberly Auken, Daniel Wang, Xiaodong Wang, Gary Williams, Karen Yook, Richard Durbin, Lincoln Stein, John Spieth, Paul Sternberg (2007)
WormBase 2007
Nucleic Acids Research, 36
K. Pruitt, T. Tatusova, D. Maglott (2004)
NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
Nucleic Acids Research, 33
Eurie Hong, R. Balakrishnan, Q. Dong, K. Christie, Julie Park, G. Binkley, M. Costanzo, S. Dwight, S. Engel, D. Fisk, J. Hirschman, B. Hitz, Cynthia Krieger, M. Livstone, S. Miyasato, R. Nash, R. Oughtred, M. Skrzypek, S. Weng, E. Wong, Kathy Zhu, K. Dolinski, D. Botstein, J. Cherry (2007)
Gene Ontology annotations at SGD: new data sources and annotation methods
Nucleic Acids Research, 36
M. Berger, M. Bulyk (2009)
Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors
Nature Protocols, 4
Erandi Silva, Andrew Gehrke, Kellen Olszewski, I. Leon, Jasdave Chahal, M. Bulyk, M. Llinás (2008)
Specific DNA-binding by Apicomplexan AP2 transcription factors
Proceedings of the National Academy of Sciences, 105
(2008)
The universal protein resource (UniProt)
Nucleic Acids Res, 36
R. Hoffmann, A. Valencia (2004)
A gene network for navigating the literature
Nature Genetics, 36
M. Kimmel, N. Braun, R. Bosch (2010)
Conflict of interest statement. None declared.
(2007)
Nucleic Acids Res
(2006)
Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data
BMC Bioinformatics, 7
A. Pompeani, J. Irgon, M. Berger, M. Bulyk, N. Wingreen, B. Bassler (2008)
The Vibrio harveyi master quorum-sensing regulator, LuxR, a TetR-type protein is both an activator and a repressor: DNA recognition and binding specificity at target promoters
Molecular Microbiology, 70
A. Philippakis, A. Qureshi, M. Berger, M. Bulyk (2007)
Design of Compact, Universal DNA Microarrays for Protein Binding Microarray Experiments
Journal of computational biology : a journal of computational molecular cell biology, 15 7
Sonali Mukherjee, M. Berger, G. Jona, Xun Wang, D. Muzzey, M. Snyder, R. Young, M. Bulyk (2004)
Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays
Nature Genetics, 36
W. Grundy, T. Bailey, C. Elkan, M. Baker (1997)
Meta-MEME: motif-based hidden Markov models of protein families
Computer applications in the biosciences : CABIOS, 13 4
M. Berger, A. Philippakis, A. Qureshi, F. He, Preston Estep, M. Bulyk (2006)
Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities
Nature Biotechnology, 24

Publisher: Oxford University Press
Copyright: © 2008 The Author(s)
ISSN: 0305-1048
eISSN: 1362-4962
DOI: 10.1093/nar/gkn660
pmid: 18842628
Publisher site: See Article on Publisher Site

Abstract

Published online 8 October 2008 Nucleic Acids Research, 2009, Vol. 37, Database issue D77–D82 doi:10.1093/nar/gkn660 UniPROBE: an online database of protein binding microarray data on protein–DNA interactions 1 1,2,3, Daniel E. Newburger and Martha L. Bulyk * 1 2 Division of Genetics, Department of Medicine, Department of Pathology, Brigham and Women’s Hospital and Harvard Medical School and Harvard-MIT Division of Health Sciences and Technology (HST), Harvard Medical School, Boston, MA 02115, USA Received August 15, 2008; Revised September 18, 2008; Accepted September 21, 2008 ABSTRACT stimuli, diﬀerentiation and development in an organism. Despite recent advances in this ﬁeld, the vast majority The UniPROBE (Universal PBM Resource for of TFs in most major model organisms and pathogens Oligonucleotide Binding Evaluation) database remain either uncharacterized or poorly described (1). hosts data generated by universal protein binding The development of universal (2) protein binding micro- microarray (PBM) technology on the in vitro DNA- array (PBM) technology (3) (Figure 1) oﬀers a new avenue binding specificities of proteins. This initial release for the exploration of protein–DNA binding speciﬁcity. of the UniPROBE database provides a centralized Universal PBMs provide an eﬃcient and comprehensive method for in vitro interrogation of DNA-binding pref- resource for accessing comprehensive PBM data erences. PBM technology complements other currently on the preferences of proteins for all possible available technologies, such as chromatin immunoprecipi- sequence variants (‘words’) of length k (‘k-mers’), tation coupled with either microarray readout (4–7) or as well as position weight matrix (PWM) and graphi- high-throughput sequencing (8–10) that identify genomic cal sequence logo representations of the k-mer regions bound in vivo. data. In total, the database hosts DNA-binding Universal PBMs achieve comprehensive, high-resolu- data for over 175 nonredundant proteins from a tion determination of proteins’ DNA-binding preferences diverse collection of organisms, including the pro- by measuring the binding preferences of a protein over karyote Vibrio harveyi, the eukaryotic malarial para- all possible k-mers of a given length (2,11). Currently site Plasmodium falciparum, the parasitic employed custom array designs contain a set of 60-bp Apicomplexan Cryptosporidium parvum, the yeast DNA probes that encompass all possible permutations Saccharomyces cerevisiae, the worm Caenorhabdi- of either 9 (Bulyk Lab, unpublished data) or 10 bp (12), depending upon the microarray design (2,12). In addition tis elegans, mouse and human. Current web tools to covering all contiguous 9-mers or 10-mers, these array include a text-based search, a function for asses- designs also oﬀer an extensive set of gapped permuta- sing motif similarity between user-entered data tions that provide coverage of binding sites of greater and database PWMs, and a function for locating length. Together, these data can be synthesized to produce putative binding sites along user-entered nucleotide high-conﬁdence measurements of the relative preferences sequences. The UniPROBE database is available at of a protein for all possible sequence variants belonging http://thebrain.bwh.harvard.edu/uniprobe/. to a wide range of k-mer patterns typically found in TF binding site motifs (2,12). PBM enrichment scores from the PBM signal intensity data are typically calculated for INTRODUCTION each of the more than 2.3 million 8-mers (i.e. binding site The characterization of transcription factors’ (TFs’) ‘words’ with eight informative nucleotide positions, DNA-binding speciﬁcities represents a critical step including all contiguous 8-mers and a large collection of towards understanding the regulation of gene expression gapped 8-mers). These 8-mers encompass the full aﬃnity and elucidating the biophysical properties governing range of DNA binding preferences, from the most prefer- protein–DNA interactions. The study of DNA-binding entially bound k-mers to low-aﬃnity k-mers and nonspe- speciﬁcities therefore has profound implications for the ciﬁcally associated sequences (2). analysis and prediction of the regulatory networks The TRANSFAC (13) and JASPAR (14) databases con- that govern intracellular function, responses to external tain hundreds of matrices constructed from DNA binding *To whom correspondence should be addressed. Tel: +1 617 525 4725; Fax: +1 617 525 4705; Email: [email protected] 2008 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. D78 Nucleic Acids Research, 2009, Vol. 37, Database issue number of binding site sequences in a JASPAR SELEX dataset is just 28, while the median number of nonredun- dant binding site sequences is just 11. In contrast, the DNA binding proﬁles obtained from our PBM experiments pro- vide information on the direct binding preferences of a given protein over all k-mer DNA binding sequence var- iants, measured in vitro using the universal PBM technol- ogy; the number of sequence variants examined is limited only by the number of features on the microarray. In addi- tion to the k-mer binding proﬁles, these procedures also provide DNA binding sequence PWMs derived from the k- mer data using our Seed-and-Wobble algorithm (2). The UniPROBE database hosts the high-resolution DNA binding proﬁles obtained from PBM experiments on known and predicted TFs (2,3,18–21). The database currently contains DNA binding proﬁles for many pro- teins not included in similar databases such as JASPAR (14) and TRANSFAC (13), and it oﬀers several tools for searching the database and analyzing user-deﬁned binding proﬁles or DNA sequences. The resources and analysis tools oﬀered by the UniPROBE database promise to facil- itate previously untenable, downstream genomic analyses, and we anticipate that it will represent an important geno- Figure 1. Universal PBM schema. Universal PBMs containing all possible 10-mers within 60-mer probes are ﬁrst synthesized as single- mic resource as additional PBM data are compiled. stranded oligonucleotide arrays, to which a common primer is annealed and extended in order to biochemically convert the single-stranded array to a double-stranded DNA (dsDNA) array (these steps are not DATABASE CONSTRUCTION shown in the ﬁgure) (2). The dsDNA array is then bound by protein, stained with a ﬂuorophore-conjugated antibody, and scanned; the The UniPROBE database is managed by a MySQL rela- quantiﬁed array data are then normalized by the relative amounts of tional database that provides the back-end for user queries DNA in each spot, and used to calculate k-mer binding data (2). and facilitates the data retrieval necessary for the site’s PWMs can be calculated either from the k-mer binding data using our Seed-and-Wobble algorithm (2) or from the 60-mer probe data analysis tools. All HTML pages are dynamically gener- using other motif ﬁnding algorithms (34). ated by PHP scripts hosted on an Apache server, and several JavaScript libraries provide interactive interfaces that facilitate site navigation and form accessibility. The site data from various data types; indeed, a given position Apache server also hosts all downloadable data, available weight matrix (PWM) in TRANSFAC frequently is by HTTP connection. derived from binding sequence data compiled from mul- tiple experimental methods, which include lower through- DATABASE CONTENT put approaches such as gel retardation (i.e. electrophoretic mobility shift assays), DNase I footprinting, immuno- The UniPROBE database hosts the results of PBM experi- precipitation, supershift assays and methylation protec- ments, subsequent computational analyses performed tion and higher throughput approaches such as in vitro on these data, and protein annotations. The site currently selection (15) (SELEX). The PAZAR database (16) is a hosts PBM data for over 175 nonredundant proteins from meta-database that contains TF binding site data. It con- a wide range of organisms, including the prokaryote Vibrio tains PWMs from the JASPAR core database and in vivo harveyi, the eukaryotic malarial parasite Plasmodium TF binding site data and cis-regulatory module infor- falciparum, the parasitic Apicomplexan Cryptosporidium mation from various sources, including other databases. parvum, the yeast Saccharomyces cerevisiae, the worm A review of databases of cis-regulatory modules is Caenorhabditis elegans, mouse and human (2,18,20,21). beyond the scope of this article. These data already encompass the majority of mouse The universal PBM technology has several key advan- homeodomain TFs and will soon include more than tages over in vitro selection approaches, such as SAGE– 100 additional mouse proteins (labs of Bulyk, M.L. and SELEX (17). SELEX does have the capability to interro- Hughes, T.R., unpublished results), nearly 90 additional gate sequences spanning a wide range of aﬃnities, but it S. cerevisiae proteins (labs of Bulyk, M.L. and LaBaer, J., requires a signiﬁcant increase in cost and labor to achieve unpublished results), and over 20 additional C. elegans the necessary depth of sequencing. Moreover, SELEX proteins (labs of Bulyk, M.L. and Walhout, A.J., unpub- data have limited sensitivity because one cannot distin- lished results). The UniPROBE database will addition- guish DNA binding site sequence variants missing from ally host data for Drosophila melanogaster TFs from the collected data from those that are truly not bound by ongoing projects in the Bulyk laboratory, and we antici- the given TF. In a survey of all SELEX datasets in the pate the addition of several datasets from other labora- 2006 JASPAR database, we found that the median total tories using this microarray technology. Nucleic Acids Research, 2009, Vol. 37, Database issue D79 For each DNA-binding protein, the server holds several reference databases. The website structure and interface diﬀerent data types, including: (i) unprocessed 60-mer section subsequently discusses these external annotations probe signal intensity data; (ii) normalized probe intensi- in further detail. ties; (iii) TF-binding DNA proﬁle representations and (iv) publication-speciﬁc data. The unprocessed (or ‘raw’) Agilent array data include information on probe position, WEBSITE STRUCTURE AND INTERFACE 60-mer probe sequence, and Cy3 (DNA, from incorpo- The Browse page of the UniPROBE database site presents rated Cy3-dUTP) and Alexa 488 (protein, from Alexa a table containing each hosted TF, that protein’s struc- 488 conjugated anti-GST antibody) signal intensities, the tural class, and the publication with which the protein’s latter of which are necessary for accurately assessing rela- PBM data are associated. The entries are accompanied by tive DNA binding (2). The normalized probe intensities brief descriptions of protein function retrieved from IHOP are derived from the raw data after adjusting for relative (28) or from a species-speciﬁc database such as SGD (29) DNA concentrations at each spot and for spatial nonuni- or WormBase (30). The table then presents a link to a formities within the microarray. The in-house software zipped ﬁle containing all factor-associated PBM data used for normalization and subsequent binding proﬁle and a link to a Details page containing further factor generation will soon be available for download as the annotations and a display of relevant features from the Universal PBM Analysis Suite (11). PBM experiments. The database oﬀers several diﬀerent binding proﬁle The Details page (Figure 2) described above has several representations for assessing TF speciﬁcity and aﬃnity. components. The ﬁrst section (Figure 2A) provides addi- First, we use our Seed-and-Wobble algorithm (2) to gen- tional annotations for the factor of interest, including erate PWM motifs, which represent the observed prob- ability of ﬁnding a given nucleotide in a given position unique gene or protein accession numbers (if available within a TF’s DNA target site. PWMs currently serve as as provided by the species-speciﬁc database), gene syno- one of the primary methods for quantitative representa- nyms, DNA-binding domain amino acid sequence (if tion of DNA binding site motifs (13,14,22), and they are available) and links to databases such as IHOP (28), useful for creating graphical sequence logos (23,24) and RefSeq (31), UniProt (32) and JASPAR (14). The for performing sequence analysis using any of several pub- second section (Figure 2B) displays the PWM and the lished software tools (22,25). Graphical sequence logos matrix motif logo derived from Seed-and-Wobble analysis of the Seed-and-Wobble motifs, generated using the algo- of PBM k-mer data. Although it does not directly display rithms deﬁned by enoLOGOS (24), are also present in k-mer data, links to the data ﬁles containing complete k- the UniPROBE database. mer data, normalized 60-mer probe data, and raw probe Second, we provide two k-mer-based DNA binding data are provided below the motif logo. The ﬁnal section proﬁles for each TF. The universal PBM designs facilitate of the Details page (Figure 2C) displays a table that pre- k-mer binding proﬁle construction because they allow sents the experimental conditions and protein sequences for full coverage of all 8-mers of width 12 or less. The used to produce each PBM dataset for the TF of interest. k-mer proﬁles have several advantages over the traditional In addition to the Browse and Details pages, several PWM model. For example, comprehensive coverage of other pages facilitate the download of speciﬁc materials. ungapped and gapped 8-bp sequence variants can provide The Downloads page distributes ZIP ﬁles containing all insight into nucleotide interdependence within DNA instances of speciﬁc data types (i.e. all PWMs or all raw binding site sequences; whereas, mononucleotide inde- data) in the entire database, along with ZIP ﬁles of the pendence is implicit in traditional PWMs (26,27). The data associated with each given publication. The Down- database’s ﬁrst k-mer-based binding proﬁle consists of loads page also provides links to the core SQL tables used the median signal intensities and PBM enrichment scores by the database and documentation for these tables. As an associated with each contiguous 8-mer, where enrichment alternative method of accessing UniPROBE ﬁles, one can scores are calculated using a variant of the Wilcoxon– use the Apache HTTP Server index to browse for ﬁles or Mann–Whitney statistic and range from 0.5 for the most use the Explore page to view the directory structure in a favored k-mers to 0.5 for the most disfavored k-mers. Microsoft Explorer style interface. The second k-mer-based binding proﬁle includes the Most pages in the database website also provide forms top-scoring gapped 8-mer patterns (up to 10 positions) for performing text searches on the database and for inter- as determined by a 0.25 enrichment score cutoﬀ; we rogating PBM-derived binding proﬁles. The following used this threshold to avoid excessive ﬁle size for the section describes these tools in detail. gapped pattern proﬁle and note that the Universal PBM Analysis Suite (11) can be used to generate full proﬁles for all gapped 8-mer patterns up to 12-nt positions in length. SEARCH AND ANALYSIS TOOLS In addition to these PBM data ﬁles, the UniPROBE database also provides relevant experimental information Several tools for conducting database searches and per- and factor-speciﬁc annotations. Experimental features forming analyses on TF k-mer binding proﬁles (Figure 3) include protein expression method, sequence and con- enhance the database’s utility. A simple search ﬁeld in the struct information (i.e. full-length protein or DNA-bind- top right corner of the site’s horizontal navigation bar ing domain only). Factor annotations include functional provides a modiﬁed full-text MySQL search across species descriptions and links to a variety of protein and gene names, gene names, synonyms and annotations. Under the D80 Nucleic Acids Research, 2009, Vol. 37, Database issue Figure 2. Details Page for the Mus musculus TF Hdx. This page includes (A) gene and protein annotations for Hdx, (B) PBM-derived motif data for the factor and (C) PBM experimental information for the Hdx data. Advanced Search navigation bar, a customizable text to 20 binding site representations in one of the following search allows the user to enter multiple key words formats: frequency matrix, count matrix, Meta-MEME within speciﬁc database ﬁelds for a higher precision 3.x motif or IUPAC motif. As additional options, the query. These terms may be linked by AND or OR user can restrict the query to PWMs from a particular Boolean operators by selecting the Match All or Match species, specify a maximum similarity threshold, set a Any radio buttons, respectively (Figure 3A). The Browse minimum matrix overlap or choose the comparison algo- page presents the match results (Figure 3B) for both rithm. Available algorithms include Euclidean distance, search methods in the same table format used by the Pearson correlation, Kullback–Liebler Divergence and default Browse display, which provides basic annotations the Sandelin–Wasserman function, which are described for each matching protein and links to download or view in detail in the documentation for the Tomtom program the associated PBM data. (33). Upon query submission, the tool returns a table of The Advanced Search bar also contains two analysis statistics describing the best scoring alignment between a tools available for online use. The ﬁrst tool, which uses given pair of motifs, a Tomtom E-value (33) that quanti- the Tomtom program (33) from the Meta-MEME suite ﬁes their similarity and a graphical alignment of the two (22), provides a platform for comparing standard motif motifs’ logos (Figure 3D). This table also provides links representations against the PBM-derived PWMs in the to each matching factor’s Details page for further investi- database (Figure 3C). The user may enter or upload up gation of the PBM data. Nucleic Acids Research, 2009, Vol. 37, Database issue D81 Figure 3. Database search tools and formatted query results. Search options include (A) a text-based search, (B) a tool for comparing standard motif representations against PBM-derived motifs in the database using the Tomtom program (33) from the Meta-MEME suite (22) and (C) a tool for scanning FASTA-formatted nucleotide sequences for matches to TF 8-mer binding proﬁles in the UniPROBE database. The Database Browser table formats the search results for viewing by (D) highlighting text search term matches, (E) presenting a graphical view of motif alignment and (F) illustrating 8-mer binding site matches along the input sequence (x-axis). The second analysis tool uses contiguous 8-mer PBM other labs generating universal PBM data to contact us enrichment score data to scan user-supplied input DNA by email if they wish to add their data to the UniPROBE sequences for putative TF binding sites (Figure 3E). To database following the acceptance of their data for pub- perform this DNA scan, the user must ﬁrst upload or lication. The development of several additional tools may enter a FASTA ﬁle containing up to 30 DNA sequences, also enhance the website, and they may include a down- each of a 10-kb maximum permissible length. The user load manager, a local BLASTP function for identifying then speciﬁes the species of interest and an enrichment matches in our database to a user-speciﬁed query protein, score threshold, and the tool scans the input DNA and a DNA-binding preference prediction tool for user- sequence using an 8-bp sliding window to detect whether speciﬁed query proteins. any TFs from the species of interest have enrichment scores greater than the user’s threshold for that particular AVAILABILITY AND LICENSE sequence. The website displays the results of the scan both as an HTML table available for plain-text download and All data hosted by the UniPROBE database are freely as a simple graphic indicating binding site position and TF available for distribution at the database website. The identity (Figure 3F). This tool may be useful not only for sequences of the 60-mer DNA probes synthesized on our generating hypotheses about putative regulatory interac- custom-designed universal arrays are available under the tions but also for minimizing the unintentional creation of terms of the academic research use license described at new binding sites for unrelated factors when designing http://thebrain.bwh.harvard.edu/uniprobe/academic- site-directed mutagenesis experiments. license.php. All pages have been tested under Firefox 2.0, Firefox 3.0 and Internet Explorer 7. FUTURE DIRECTIONS ACKNOWLEDGEMENTS The upcoming publication of several large PBM datasets of yeast, ﬂy and mouse TFs will contribute signiﬁcantly to We thank A. Philippakis and F. S. He for the survey of the breadth of coverage in the database. We encourage SELEX datasets in the 2006 JASPAR database, users of the database to register at http://thebrain.bwh. I. Adzhubey for technical assistance, M. Berger for harvard.edu/uniprobe/register.php to receive updates con- helpful discussions and S. Gisselbrecht, R. P. McCord, cerning the addition of new datasets and changes to the A. Gehrke, and A. Aboukhalil for critical reading of database interface or analysis tools. We also encourage the article. D82 Nucleic Acids Research, 2009, Vol. 37, Database issue 16. Portales-Casamar,E., Kirov,S., Lim,J., Lithwick,S., Swanson,M.I., FUNDING Ticoll,A., Snoddy,J. and Wasserman,W.W. (2007) PAZAR: a National Institutes of Health (R01 HG003985 to M.L.B.). framework for collection and dissemination of cis-regulatory sequence annotation. Genome Biol., 8, R207. Funding for open access charge: National Institutes of 17. Roulet,E., Busso,S., Camargo,A.A., Simpson,A.J., Mermod,N. and Health (R01 HG003985 to M.L.B.). Bucher,P. (2002) High-throughput SELEX SAGE method for quantitative modeling of transcription-factor binding sites. Conﬂict of interest statement. None declared. Nat. Biotechnol., 20, 831–835. 18. Berger,M.F., Badis,G., Gehrke,A.R., Talukder,S., Philippakis,A.A., Pena-Castillo,L., Alleyne,T.M., Mnaimneh,S., Botvinnik,O.B., REFERENCES Chan,E.T. et al. (2008) Variation in homeodomain DNA binding revealed by high-resolution analysis of sequence preferences. Cell, 1. Bulyk,M.L. (2006) DNA microarray technologies for measuring 133, 1266–1276. protein–DNA interactions. Curr. Opin. Biotechnol., 17, 422–430. 19. Choi,Y., Qin,Y., Berger,M.F., Ballow,D.J., Bulyk,M.L. and 2. Berger,M.F., Philippakis,A.A., Qureshi,A.M., He,F.S., Estep,P.W. Rajkovic,A. (2007) Microarray analyses of newborn mouse ovaries 3rd and Bulyk,M.L. (2006) Compact, universal DNA microarrays lacking Nobox. Biol. Reprod., 77, 312–319. to comprehensively determine transcription-factor binding site 20. De Silva,E.K., Gehrke,A.R., Olszewski,K., Leon,I., Chahal,J.S., speciﬁcities. Nat. Biotechnol., 24, 1429–1435. Bulyk,M.L. and Llinas,M. (2008) Speciﬁc DNA-binding by 3. Mukherjee,S., Berger,M.F., Jona,G., Wang,X.S., Muzzey,D., Apicomplexan AP2 transcription factors. Proc. Natl Acad. Sci. Snyder,M., Young,R.A. and Bulyk,M.L. (2004) Rapid analysis USA, 105, 8393–8398. of the DNA-binding speciﬁcities of transcription factors with DNA microarrays. Nat. Genet., 36, 1331–1339. 21. Pompeani,A.J., Irgon,J.J., Berger,M.F., Bulyk,M.L., Wingreen,N.S. 4. Reid,J.L., Iyer,V.R., Brown,P.O. and Struhl,K. (2000) Coordinate and Bassler,B.L. (2008) The Vibrio harveyi master quorum-sensing regulation of yeast ribosomal protein genes is associated with regulator, LuxR, a TetR-type protein is both an activator and targeted recruitment of Esa1 histone acetylase. Mol. Cell, 6, a repressor: DNA recognition and binding speciﬁcity at target 1297–1307. promoters. Mol. Microbiol., 70, 76–88. 5. Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., 22. Grundy,W.N., Bailey,T.L., Elkan,C.P. and Baker,M.E. (1997) Meta-MEME: motif-based hidden Markov models of protein Simon,I., Zeitlinger,J., Schreiber,J., Hannett,N., Kanin,E. et al. families. Comput. Appl. Biosci., 13, 397–406. (2000) Genome-wide location and function of DNA binding 23. Crooks,G.E., Hon,G., Chandonia,J.M. and Brenner,S.E. (2004) proteins. Science, 290, 2306–2309. WebLogo: a sequence logo generator. Genome Res., 14, 6. Iyer,V.R., Horak,C.E., Scafe,C.S., Botstein,D., Snyder,M. and 1188–1190. Brown,P.O. (2001) Genomic binding sites of the yeast cell-cycle 24. Workman,C.T., Yin,Y., Corcoran,D.L., Ideker,T., Stormo,G.D. transcription factors SBF and MBF. Nature, 409, 533–538. and Benos,P.V. (2005) enoLOGOS: a versatile web tool for 7. Lieb,J.D., Liu,X., Botstein,D. and Brown,P.O. (2001) Promoter- energy normalized sequence logos. Nucleic Acids Res., 33, speciﬁc binding of Rap1 revealed by genome-wide maps of protein- W389–W392. DNA association. Nat. Genet., 28, 327–334. 25. Warner,J.B., Philippakis,A.A., Jaeger,S.A., He,F.S., Lin,J. and 8. Wei,C.L., Wu,Q., Vega,V.B., Chiu,K.P., Ng,P., Zhang,T., Bulyk,M.L. (2008) Systematic identiﬁcation of mammalian regula- Shahab,A., Yong,H.C., Fu,Y., Weng,Z. et al. (2006) A global tory motifs’ target genes and functions. Nat. Methods, 5, 347–353. map of p53 transcription-factor binding sites in the human genome. 26. Bulyk,M.L., Johnson,P.L. and Church,G.M. (2002) Nucleotides of Cell, 124, 207–219. transcription factor binding sites exert interdependent eﬀects on the 9. Johnson,D.S., Mortazavi,A., Myers,R.M. and Wold,B. (2007) binding aﬃnities of transcription factors. Nucleic Acids Res., 30, Genome-wide mapping of in vivo protein–DNA interactions. 1255–1261. Science, 316, 1497–1502. 27. Man,T.K. and Stormo,G.D. (2001) Non-independence of Mnt 10. Robertson,G., Hirst,M., Bainbridge,M., Bilenky,M., Zhao,Y., repressor-operator interaction determined by a new quantitative Zeng,T., Euskirchen,G., Bernier,B., Varhol,R., Delaney,A. et al. multiple ﬂuorescence relative aﬃnity (QuMFRA) assay. (2007) Genome-wide proﬁles of STAT1 DNA association using Nucleic Acids Res., 29, 2471–2478. chromatin immunoprecipitation and massively parallel sequencing. 28. Hoﬀmann,R. and Valencia,A. (2004) A gene network for navigating Nat. Methods, 4, 651–657. the literature. Nat. Genet., 36, 664. 11. Berger,M. and Bulyk,M. Universal protein binding microarrays for 29. Hong,E.L., Balakrishnan,R., Dong,Q., Christie,K.R., Park,J., the comprehensive characterization of the DNA binding speciﬁcities of transcription factors. Nat. Protoc. (in press). Binkley,G., Costanzo,M.C., Dwight,S.S., Engel,S.R., Fisk,D.G. 12. Philippakis,A.A., Qureshi,A.M., Berger,M.F. and Bulyk,M.L. et al. (2008) Gene Ontology annotations at SGD: new data (2008) Design of compact, universal DNA Microarrays for protein sources and annotation methods. Nucleic Acids Res., 36, binding microarray experiments. (Presented at RECOMB 2007 D577–D581. conference) J. Comput. Biol., 15, 655–665. 30. Rogers,A., Antoshechkin,I., Bieri,T., Blasiar,D., Bastiani,C., 13. Matys,V., Fricke,E., Geﬀers,R., Gossling,E., Haubrock,M., Canaran,P., Chan,J., Chen,W.J., Davis,P., Fernandes,J. et al. (2008) WormBase 2007. Nucleic Acids Res., 36, D612–D617. Hehl,R., Hornischer,K., Karas,D., Kel,A.E., Kel-Margoulis,O.V. 31. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI reference et al. (2003) TRANSFAC: transcriptional regulation, from patterns sequences (RefSeq): a curated non-redundant sequence database to proﬁles. Nucleic Acids Res., 31, 374–378. of genomes, transcripts and proteins. Nucleic Acids Res., 35, 14. Bryne,J.C., Valen,E., Tang,M.H., Marstrand,T., Winther,O., D61–D65. da Piedade,I., Krogh,A., Lenhard,B. and Sandelin,A. (2008) 32. The UniProt Consortium. (2008) The universal protein resource JASPAR, the open access database of transcription factor- (UniProt). Nucleic Acids Res., 36, D190–D195. binding proﬁles: new content and tools in the 2008 update. 33. Gupta,S., Stamatoyannopoulos,J.A., Bailey,T.L. and Noble,W.S. Nucleic Acids Res., 36, D102–D106. (2007) Quantifying similarity between motifs. Genome Biol., 8, R24. 15. Oliphant,A.R., Brandl,C.J. and Struhl,K. (1989) Deﬁning the 34. Huber,B.R. and Bulyk,M.L. (2006) Meta-analysis discovery of sequence speciﬁcity of DNA-binding proteins by selecting binding tissue-speciﬁc DNA sequence motifs from mammalian gene sites from random-sequence oligonucleotides: analysis of yeast expression data. BMC Bioinformatics, 7, 229. GCN4 protein. Mol. Cell Biol., 9, 2944–2949.

Journal

Nucleic Acids Research – Oxford University Press

Published: Jan 7, 2009

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

UniPROBE: an online database of protein binding microarray data on protein–DNA interactions

UniPROBE: an online database of protein binding microarray data on protein–DNA interactions

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

UniPROBE: an online database of protein binding microarray data on protein–DNA interactions

UniPROBE: an online database of protein binding microarray data on protein–DNA interactions

References (40)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies