Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection

Hong Sun; Tias Guns; Ana Carolina Fierro; Lieven Thorrez; Siegfried Nijssen; Kathleen Marchal

doi:10.1093/nar/gks237

Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection

Sun, Hong; Guns, Tias; Fierro, Ana Carolina; Thorrez, Lieven; Nijssen, Siegfried; Marchal, Kathleen 2012-07-15 00:00:00 Published online 15 March 2012 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 doi:10.1093/nar/gks237 Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection 1,2,3 4 1 2,5 4 Hong Sun , Tias Guns , Ana Carolina Fierro , Lieven Thorrez , Siegfried Nijssen 1,6, and Kathleen Marchal * 1 2 Department of Microbial and Molecular Systems, Department of Electrical Engineering, Katholieke Universiteit 3 4 Leuven, IBBT-K.U.Leuven Future Health Department, Kasteelpark Arenberg 10, box 2446, Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Interdepartmental Stem Cell Institute, Katholieke Universiteit Leuven, O&N I Herestraat 49, 3000 Leuven and Department of Plant Biotechnology and Bioinformatics, Ghent University, VIB, Technologiepark 927, 9052 Gent, Belgium Received April 19, 2011; Revised February 28, 2012; Accepted February 29, 2012 (TFs) (1). Searching for cis-acting regulatory modules ABSTRACT (CRMs), or combinations of motifs that often co-occur Computationally retrieving biologically relevant cis- in a set of coregulated sequences, helps in unraveling the regulatory modules (CRMs) is not straightforward. mode of combinatorial regulation. CRM detection is Because of the large number of candidates and the customarily being applied on a set of intergenic regions imperfection of the screening methods, many located upstream of coexpressed genes; such genes are for spurious CRMs are detected that are as high scoring example identiﬁed by microarray experiments. Except for de novo methods (2,3), most CRM detection methods rely as the biologically true ones. Using ChIP-information on a motif screening step. In this step, all sites that match allows not only to reduce the regions in which the given motifs of TFs, are located in the selected sequences binding sites of the assayed transcription factor (TF) (4,5). Subsequently, a combinatorial search is performed should be located, but also allows restricting the valid to identify a set of motifs (called a CRM), that occur CRMs to those that contain the assayed TF (here frequently in the given set of intergenic sequences. referred to as applying CRM detection in a query- Usually, a score is assigned to each of the obtained based mode). In this study, we show that exploiting CRMs that assess their statistical signiﬁcance in a set of ChIP-information in a query-based way makes background sequences (4,5). Some methods apply a highly in silico CRM detection a much more feasible structured deﬁnition in which a CRM consists of combin- endeavor. To be able to handle the large datasets, ations of motifs that need to occur in a speciﬁc order, with the query-based setting and other specificities speciﬁc orientations, and within certain distances (6–8). Although the overrepresentation of a structured CRM in proper to CRM detection on ChIP-Seq based data, a gene set is likely to be biologically relevant (9,10), the we developed a novel powerful CRM detection degree to which biologically relevant CRMs are structured method ‘CPModule’. By applying it on a well-studied is still largely unknown (11). Therefore, most methods rely ChIP-Seq data set involved in self-renewal of mouse on less constrained CRM models, hereafter referred to as embryonic stem cells, we demonstrate how our tool unstructured CRM detection methods (4,5,12–16), can recover combinatorial regulation of five known The combinatorial search underlying CRM detection is TFs that are key in the self-renewal of mouse embry- highly complex. Adding to this complexity, is the fact that onic stem cells. Additionally, we make a number of often the regions containing the binding sites are not well new predictions on combinatorial regulation of these delineated [e.g. when starting from a coexpressed gene five key TFs with other TFs documented in TRANSFAC. sets, the intergenic sequences typically range several 1000s of base pairs upstream of the transcription start site (TSS)]. Such long intergenic sequences reduce the INTRODUCTION signal/noise ratio to such extent that in silico CRM detec- In eukaryotes, transcriptional regulation is mediated by tion becomes ineffective. Hence the search for combina- the concerted action of different transcription factors torial regulation is often limited to the proximal promoter *To whom correspondence should be addressed. Tel: +32 486909943; Fax: +32 16 321963; Email: [email protected] The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 2 OF 16 region, whereas at least some of the sites responsible for screening was performed with the tool Clover (28). Given the observed expression behavior might also be located in a PWM of W nucleotides and a sequence, it calculates a regions distant from the TSS of the given genes (such as score for each subsequence of length W. A threshold on for instance in enhancers). Nowadays chromatin-immuno- the score determines whether the subsequence is precipitation (ChIP)-based techniques are becoming considered a potential binding site of the motif, also increasingly popular for the genome-wide identiﬁcation called a hit or a motif site. The default threshold of 6.0 of TF binding sites (17,18). Such techniques make it was used to deﬁne a stringent screening, resulting in few feasible to locate, at least for the assayed TF, the approxi- hits per motif and sequence on average, while a threshold mate binding regions. Using ChIP-bound sequences thus of 3.0 corresponds to a non-stringent screening resulting in allows largely reducing the regions in which the binding many more hits. sites of the assayed TF should be located (typically 500 bp instead of thousands of bp) (19,20) and does not restrict Filtering based on nucleosome occupancy the search for CRMs to the proximal promoter region We ﬁltered potential motif sites by removing sites that (7,21–23). exceed a given value for the nucleosome occupancy score In addition to these obvious advantages, using (NuOS). To calculate the nucleosome occupancy score of a ChIP-information also allows for a query-based search potential motif site, we ﬁrst assigned a nucleosome strategy. Instead of searching for all possible CRMs in occupancy probability to each base pair position, using the input set as is traditionally being done, we can limit the prediction model ‘NuPoP’ of Xi et al. 2010 (29). The our search to those CRMs that contain the ChIP-assayed nucleosome occupancy score was then calculated as the TF (7,8). Incorporating knowledge of the assayed TF geometric mean of the nucleosome occupancy probabilities during CRM detection in such a query-based way allows at all positions of the potential motif site (30). predicting complex CRMs in which the assayed TF is To determine the optimal threshold values for the involved. In this work, we studied the extent to which NuOS, we tested the effect of different ﬁltering thresholds using ChIP-derived information can help in increasing on their ability to: (i) reduce the number of motif site the performance of CRM detection, compared to the predictions per region and per TF, while (ii) not too application of CRM detection in a traditional much compromising the sensitivity of recovering true non-query-based setting. To this end, we developed a binding sites (see Supplementary File S1). Based on novel powerful CRM detection method ‘CPModule’, an these tests, predicted motif sites located within a low prob- unstructured CRM detection method based on a con- ability of nucleosome occupancy (<10%) (when using a straint programming for itemset mining framework (24). ﬁltering based on the NuOS) were retained. Besides handling the speciﬁc challenges of CRM detection on ChIP-Seq based data, CPModule can be used in both a query and non-query-based mode. Its exhaustive search CRM detection using constraint programming for itemset strategy allows making an assessment of the total mining (Figure 1B) number of valid CRMs that are present in the input set CPModule uses as input the motif sites located in the input and of the degree to which a CRM of interest gets sequences by motif screening. The result of the screening prioritized among the total number of candidates. and ﬁltering step is for each motif M and sequence S aset Applying CPModule on a well-studied ChIP-Seq data of motif hits MHðÞ M,S¼ ðÞ l ,r , .. . ,ðl ,r Þ .Motif M 1 1 n n set involved in self-renewal of mouse embryonic stem cells has a hit at ðl ,r Þ with 1 l r jSj;here ðl ,r Þ is an i i i i i i (25) showed that using a query-based setting is in most interval between start position l and stop position r on i i cases the only sensible way to perform CRM detection. the sequence S.jj S corresponds to the length of sequence S. Besides recovering well described benchmark CRMs, we The combinatorial search problem of ﬁnding a motif set also make several novel predictions on the combinatorial is solved by using the constraint programming (CP) for regulation of the ﬁve key regulators, involved in the itemset mining framework (24). The core of CPModule process of self-renewal, with other TFs documented in enumerates all possible motif sets, where a motif set TRANSFAC (26). M ¼fg M , ... M is deﬁned as a subset of all screened 1 n motifs. A CRM is a motif set M that is valid given a set of domain-speciﬁc constraints (more details provided MATERIALS AND METHODS below): (i) the frequency constraint, which requires that Motif screening and ﬁltering (Figure 1A) the motif set occurs in a predeﬁned minimal number of sequences S from the input set, (ii) the proximity Motif screening constraint, which requires that hits of motifs in a set The motif models used for screening are position weight should occur in each other’s proximity. The maximal matrices (PWMs) from TRANSFAC (26). To remove distance of the region in which hits should co-occur is redundancy among the PWMs, each of them was speciﬁed by the user and controls the level of proximity, compared to all the other PWMs using the program (iii) the redundancy constraint, which requires that the MotifComparison (27). MotifComparison calculates the motif set M must be non-redundant with respect to its Kullback–Leiber distance between matrices; PWMs showing a mutual distance <0.1 were grouped. All supersets. Optionally, (iv) a query constraint ensures PWMs in a group were removed except for one represen- that a given query motif must be part of the motif set tative one. This resulted in a ﬁnal list of 516 PWMs. Motif found (Figure 1B). PAGE 3 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 Figure 1. CPModule analysis ﬂow. (A) The input consists of a library of PWMs and a set of sequences. In the ﬁrst step, prior to the actual CRM detection a screening with public motif databases is performed. Here, we combine standard PWM screening with ﬁltering based on nucleosome occupancy. Motif sites displaying a high nucleosome occupancy are ﬁltered out (indicated as the transparent shapes in A). (B) The second step consists of the actual combinatorial search. Here, we use a constraint programming for itemset mining approach to enumerate all valid motif sets, i.e. combinations of motifs (i) that occur frequently in the input set (frequency constraint). Only valid motif sets will be considered (indicated in regular boxes), while invalid ones will not (indicated in dashed boxes); (ii) of which the motif sites contributing to the motif set occur in each other’s proximity (proximity constraint). Only valid motif sets will be considered (indicated in regular boxes), while invalid ones will not (indicated in dashed boxes); (iii) that are non-redundant (redundancy constraint). The motif sets in the dashed box are redundant with the motif set in the regular box and will not be considered; (iv) that contain a query-motif (query-based constraint), which corresponds in this work to the motif of the ChIP-assayed TF. Valid motif sets are indicated in regular boxes. (C) Valid motif sets or CRMs are ﬁnally assigned a P-value that expresses their speciﬁcity for the input set. More formally, a motif set M ¼fg M , ... M will only highly efﬁcient algorithm. Consequently, we need to 1 n be considered as a potential CRM in a sequence S if and encode our problem as a constraint satisfaction problem. only if its set of hit regions (HR) on that sequence S is not We do so as follows. Motif sets are represented by empty, where the set of hit regions (HR) consists of those Boolean variables: there is a Boolean variable M for rangesðÞ l,l+ of base pairs in which all selected motifs M every possible motif, indicating whether this motif is are present, e.g. they have at least one hit with interval part of the motif set M. If a certain M ¼ 1, then we say 0 0 l ,r in that range: that the motif is in the motif set; otherwise the motif is not in the set. Furthermore, we have a Boolean variable S for HRðÞ M,S¼ðÞ l,l+ : 1 l jj S ,8M 2 M : every genomic sequence, indicating whether the motif set ð1Þ 0 0 0 0 will be considered as a potential CRM in a sequence, i.e. 9 l ,r 2 MHðÞ M,S : l l < r l+ whether S 2 ’ðM,SÞ. Lastly, we deﬁne a Boolean variable Given a set of sequences S, the subset of sequences in seqM f for every motif i and every sequence j. The variable ij which the set of hit regions is not empty is denoted by seqM f indicates whether motif M is in the proximity of all ij i ’ðM,SÞ. motifs in motif set M on sequence j. To ﬁnd the motif sets that fulﬁll the above require- The ‘constraints’ are imposed on these variables as ments, we use a general and principled approach based follows: on ‘constraint programming’. In constraint programming, problems are modeled as constraint satisfaction problem Proximity constraint in terms of constraints on variables. The constraint The proximity constraint couples the seqM f variables to ij speciﬁcation is then solved by a general purpose, yet the variables representing the motifs. Formally, we deﬁne e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 4 OF 16 the seqM f variables as follows for every motif on every Genome-wide enrichment score calculation and ranking ij genomic sequence: (Figure 1C) To assess the signiﬁcance of the found motif sets (CRMs), 8 : seqM f ¼ 1 $ð9ðÞ l,r 2 HR M,S ij ij j ð2Þ we calculate for each an enrichment score (P-value) 0 0 0 0 9 l ,r 2 MH M ,S : l l r rÞ i j adapting the strategy proposed in Gallo et al. (33). The main modiﬁcation is that we use a set of background In other words, if in a particular genomic sequence a hit sequences in the calculation of this score. These back- of a particular motif is within a hit region (HR)ofthe ground sequences are used to estimate the proportion motif set, this motif’s variable for that sequence must be p of sequences in the whole genome that contain the 1. Observe that seqM f =1 will hold for all motifs in the ij motif set. We compare the number of observed input motif set M, for all sequences that are in ’ðM,SÞ; however, sequences that contain a particular motif set ’ðM,SÞ there may be additional motifs that have hits in the prox- with the number of sequences that is expected to contain imity of regions in HRðM,S Þ. this motif set. The latter is estimated by counting the number of background sequences containing the motif Frequency constraint set ’ðM,S Þ. This set is obtained after applying background The constraint that imposes a minimum size on ’ðM,SÞ is exactly the same screening and ﬁltering strategy on the ~ ~ formalized as: S min frequency. Here, the S j j background sequences as was applied on the input sequences. Based on this estimate of the probability to variables are deﬁned in terms of the seqM f variables, ij observe a valid motif set in a random set, we calculate such to ensure that sequences are only counted (S ¼ 1) a genome-wide enrichment score (P-value) by means of if all selected motifs (8 : M ¼ 1) occur within each i i a cumulative binomial distribution: other’s proximity in that sequence (SeqM ¼ 1): ij jj S ~ g ~ jj S 8 : S ¼ 1 $ð8 : M ¼ 1 ! SeqM ¼ 1Þð3Þ i jj Si j j i i ij P valueðÞ M ¼ p ð1 pÞ : ð5Þ i¼ ’ðÞ M,S jj Redundancy constraint Where P ¼j’ðM,S Þj=jS j; S is the set of background background The redundancy constraint requires that we cannot add a input sequences; jSj is the number of input sequences; motif to a motif set without losing one sequence in its S is the set of background sequences. background corresponding sequence set ’ðM ,SÞ. This can be The background sequences were derived by sampling enforced as follows on the Boolean variables: from the mouse genome [Version mm9, NCBI Build 37, ~ g ~ UCSC database (34)], a large number of intergenic 8 : ð8 : S ¼ 1 ! SeqM ¼ 1Þ! M ¼ 1, ð4Þ i j j ij i sequences (2000 background sequences for the synthetic stating that a motif must be part of the set (M ¼ 1) if i data, 5000 background sequences for each of the on all selected sequences (8 : S ¼ 1) the motif is within j j ChIP-Seq assays). To exclude the possibility that the com- position of the background set would inﬂuence the the proximity of the others (SeqM ¼ 1). ij estimated background occurrences of the CRMs, we Query-based constraint compiled background sets consisting of either putative The query-based constraint requires that each motif set promoter sequences, that is sequences located upstream contains at least a given motif. This can be enforced by of a gene’s transcription start site, or sets made from back- requiring that the corresponding Boolean variable M ground sequences located in putative enhancer regions, satisﬁes the constraint that M ¼ 1. Note that the proxim- that is sequences corresponding to regions bound by the ity constraint will ensure that only motif sets will be enhancer binding proteins factor CTCF [downloaded considered with hits close enough to this given motif. from ENCODE (35)]. In our experiments, the compos- This combination of constraints is solved by a con- ition of the background sets did not inﬂuence our ﬁnal straint programming system by means of a depth ﬁrst ranking. All results presented in the article use a back- backtracking search. The search strategy alternates ground set based on proximal promoter regions. between branching, in which a variable is assigned a When dealing with ChIP-Seq data, where each sequence value from its domain (Boolean value), and propagation, is a region around an assayed transcription factor site, we the process of using a constraint to remove values from selected the background sequences such that each the domain of variables. The search strategy is similar to sequence contains at least one motif site of the assayed strategies that have been used in itemset mining (24). The TF (which does not overlap with a ChIP-bound region). main difference with traditional itemset mining is the In this setting, the number of background sequences that inclusion of proximity constraints and the inclusion of a qualiﬁes is variable for each data set. To have an equal redundancy constraint for this type of data. The advan- number of sequences for each background set, we tage of using an existing CP system (31) is that additional randomly sampled for each data set 5000 sequences constraints can be added in a modular and straightfor- from the set of qualifying background sequences (5310 ward way, preventing the reimplementation of the was the maximal number of sequences that could be itemset mining strategy from scratch. For more details obtained for the data set with the smallest cognate back- on the implementation of the different constraints we ground set). Note that with this strategy we approximate refer to ref. (32). the P-value calculation in a conservative way as we cannot PAGE 5 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 exclude that the background contains sequences with true total counts of the predicted CRM. Based on these sites of the assessed TF that remained unbound under the counts, a correlation coefﬁcient (CC) is deﬁned as follows: assessed conditions. TP TN FN FP The motif sets are ranked according to their enrichment CC ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð6Þ ðTP+FNÞðTN+FPÞðTP+FPÞðTN+FNÞ score [P-value (M), Equation (5)]. Note that with this ranking, for two redundant motif sets that occur in the The value of this coefﬁcient ranges from 1to1.A same sequences, the smaller motif set will never score score of +1 indicates that a prediction corresponds to better than the larger one i.e. if M M , with ’(M ,S) 1 2 1 the correct answer. Random predictions will generally =’(M ,S), motif set M will never score worse than M . 2 2 1 result in CC values close to zero. Ideally, a CRM should This motivates the use of the redundancy constraint score good at both the motif and nucleotide level. during the combinatorial search as it removes from a set of redundant motif sets only the least interesting sets, CRM detection on real data which would get a very low rank in any case. The real data set was derived from genome-wide ChIP data obtained with DNA sequencing (ChIP-Seq) for the Benchmarking on synthetic data KLF4, NANOG, OCT4, SOX2 and STAT3 transcription We ﬁrst use a synthetic data set to compare CPModule factors, as described by Chen et al. (25). For each tran- with different CRM detection tools. The synthetic data is scription factor, the input set consists of 100 sequences, retrieved from Xie et al. (36). The data consists of 22 each corresponding to 500 bp centered around one of the genomic sequences each 1000 base pairs in length. In 20 top 100 ChIP-binding peaks of the assayed TF. Binding sequences, sites sampled from the TRANSFAC PWMs of peaks were taken from the GEO ﬁle GSE11431 (38). respectively OCT4, SOX2 and FOXD3 were inserted in a Screening is performed using the 516 TRANSFAC region of at most 164 bp (so the CRM encompasses PWMs described above (section ‘Motif screening’). maximally 164 bp). Each PWM was sampled three times However, as a KLF4 PWM was missing in TRANSFAC, per sequence. The last two sequences had no sites inserted. we added to our list the PWM described by Whitington CPModule was compared with related tools for et al. (39). Whitington et al. (39) constructed the KLF4 unstructured CRM detection such as ModuleSearcher, PWM using de novo motif detection on a set of sequences obtained from the author (37) on 6 July 2008; Compo involved in the development of mouse embryonic stem obtained from the author (14) on 14 August 2010; Cister cells, derived from a ChIP-chip experiment by Jiang et al. and Cluster-Buster downloaded from the authors website (40) independent from the one used in this study. (15,16) on 22 June 2010 and 21 June 2010, respectively; all We applied our method using three different screening methods were given the non-redundant motif list results: (i) high-stringency screening; (ii) low-stringency described in section ‘Motif screening’. screening; (iii) and low-stringency screening in combin- CPModule was run with a frequency threshold of 60% ation with NuOS ﬁltering. Proximity thresholds for and a proximity threshold of 165 bp (as this was the CPModule were varied stepwisely as mentioned below maximum distance used when generating the data). (section ‘Results’ section). The frequency threshold was Nevertheless, CPModule was shown not to be very sensi- set at 60% unless mentioned otherwise. tive to the exact value of the proximity threshold (see Supplementary File S2). For the other CRM detection tools, we similarly used the best parameter values accord- RESULTS ing to the characteristics of the synthetic data (length of CPModule: CRM detection based on constraint the sequences, the distance between two insertion sites, programming for itemset mining and the maximum size of CRM) and default values other- wise. Supplementary Table S1 lists the non-default param- Algorithmic design eters for the used tools. In contrast to what is often assumed for CRM detection We evaluated the performance of the different CRM on coexpressed genes, we do not expect that all sequences tools using the motif (mCC) and nucleotide correlation derived from a ChIP-Seq experiment contain the same coefﬁcients (nCC) (5). At the motif level (mCC), a pre- CRM. Indeed in the same list of ChIP-bound sequences, dicted motif for a sequence is a true positive (TP) if that different CRMs might be present depending on which motif was indeed part of the CRM on that sequence, other TFs are needed next to the ChIP-assayed one to otherwise it is a false positive (FP). If a motif was not mediate coregulation of certain subsets of the genes in predicted to belong to a CRM on that sequence, the list. although it should have been according to the benchmark, Since the same CRM must not be present in all it is counted as a false negative (FN), otherwise as a true sequences, one cannot simply take the intersection of negative (TN). As the motif level evaluation does not take motifs that appear in the sequences. Instead, all possible the predicted sites into account, a solution is also combinations of motifs have to be considered as candidate evaluated at the nucleotide level (nCC): for every nucleo- CRM, and validated against the data. This makes CRM tide we verify whether it was predicted to be part of the detection a combinatorial and computational hard CRM and whether it should have been predicted or not, problem. Furthermore, it is not known in advance which again resulting in TP, FP, FN and TN counts. These TFs are part of the CRM hence one would like to consider counts are aggregated over all sequences to obtain the all TFs having a known motif model. Using a large e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 6 OF 16 number of candidate motifs makes the problem more constraints. We explicitly chose not to statistically difﬁcult, as 100 candidate motifs means there are 2 can- evaluate the CRMs during search as this can quickly didate motif sets; the problem is even more severe when become computationally intensive, especially when using the entire TRANSFAC database, which contains evaluating the speciﬁcity of a CRM using a large collec- more than 500 motif models. Additionally, ChIP-Seq tion of background sequences. Instead, we calculate an derived datasets are large in the number of sequences enrichment score (P-value) in a post-processing step and (usually hundreds of peaks are detected for the assayed rank the CRMs according to this score (see Figure 1C and TF). For each candidate motif set, all these sequences will ‘Materials and Methods’ section). The added beneﬁt is have to be processed. The fact that each motif can have that our system does not return one CRM as being the multiple hits on a single sequence complicates things even most signiﬁcant CRM, rather it returns an ordered list further. Despite the fact that several methods for CRM which a domain expert can choose from. detection have been developed in the past (4), the afore- Benchmarking CPModule mentioned computational issues are still challenging for Before applying our method on a real ChIP-Seq data set CRM tools that ﬁnd complex CRMs consisting of an we compared its performance with that of a number of arbitrary number of TFs. well performing CRM tools, namely ModuleSearcher (37), To deal with these computation issues, we developed Compo (14), Cister (16) and Cluster-Buster (15). CPModule, a CRM detection method based on constraint Cister (16) and Cluster-Buster (15) are representative programming (CP) for itemset mining (24) (for a full single-sequence tools. They scan each sequence individu- description of the method see ‘Materials and Methods’ ally, searching for potential CRMs that best match a section). CPModule searches for combinations of motifs predeﬁned structure as imposed by model parameters (cis-acting regulatory modules) that are sufﬁciently (here a hidden Markov model). In contrast to multiple speciﬁc for a given input set. Using a library of PWMs, sequences tools, such as CPModule, they do not explicitly a set of coregulated sequences is screened and ﬁltered to test whether the detected CRMs are speciﬁc for the input obtain a list of hits per motif and sequence (Figure 1A). set as a whole. Nevertheless, these tools are very good at CPModule then uses this input list to enumerate all detecting CRMs in individual sequences. Because they possible combinations of motifs (motif sets) that meet a treat sequences individually, they are computationally set of predeﬁned constraints (Figure 1B). These con- more efﬁcient than multiple sequence tools. However, straints deﬁne what we consider biologically relevant they cannot take advantage of the fact that sequences CRMs: ﬁrst, a CRM should occur in a minimal number are coregulated. of input sequences (frequency constraint, Figure 1B) to be Like CPModule, ModuleSearcher (37) and Compo (14) sufﬁciently speciﬁc for the input set, but it does not neces- are multiple sequence tools. Both of them come with their sarily have to cover all sequences. Second, we assume that own motif screening tool. ModuleSearcher searches for motif sites of a CRM are more likely to reﬂect true com- motif sets by using a genetic algorithm, a heuristic binatorial regulation when they occur in each other’s search method that maintains of pool of solutions which proximity on a single sequence than when they are scat- are modiﬁed to ﬁnd ever better solutions. This type of tered over long genomic distances. Therefore, the motif search is more ad hoc and gives no guarantees that the sites composing a CRM should occur within a maximal best CRMs are found. Compo on the other hand uses genomic distance from each other (proximity constraint, techniques from itemset mining, as does CPModule. Figure 1B). This is not guaranteed to always be the case, However, they differ in a number of ways: Compo is a which is compensated by the frequency constraint that specialized algorithm while CPModule uses a generic does not require all sequences to contain the CRM. constraint-based methodology, allowing to incorporate Because motifs can have multiple binding sites on the extra constraints such as the redundancy constraint same sequence, we observed that several of the CRMs and the query-based constraint in a principled way. found were redundant. We consider a found CRM to be Additionally, while Compo has a strong focus on multi- redundant to another one if it contains a subset of the parameter search and optimization, CPModule does motifs of the other CRM and occurs in exactly the same sequences. In this case, the smaller CRM will have a lower exhaustive search and calculates the signiﬁcance of a statistical score anyway; hence we can discard such CRMs CRM using a large collection of background sequences from consideration. To make the search more effective, we in a second step. will avoid redundant CRMs during search (redundancy For benchmarking we used the synthetic data constraint, (Figure 1B). Finally, in ChIP-Seq assayed constructed by Xie et al. (36). This data set contains data, one can use the fact that at least the assayed TF intergenic sequences in which ‘true motif sites’ are should be part of the CRM. For this purpose, we add a inserted. The performance of CRM detection was query-based option in which a ‘query motif’ can be assessed by comparing the best scoring solution of each provided that has to be part of the CRM. Using this algorithm with the known ‘true’ solution. The quality of knowledge before and during search (query-based con- the solutions was evaluated both at the motif and nucleo- straint, Figure 1B) can make the problem computationally tide level using respectively a motif correlation coefﬁcient much more feasible. (mCC) and a nCC (5) (see ‘Materials and Methods’ When using the constraint programming for itemset section). mining framework with the above constraints, the result Finding CRMs in multiple long sequences using all 516 is a large collection of CRMs found to fulﬁll those TRANSFAC PWMs is a challenging task. Table 1 shows PAGE 7 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 a comparison of the different tools on our synthetic rather poorly (0.27); it wrongly identiﬁes the motifs that benchmark data. The single sequence tools Cister (16) bind in the identiﬁed regions. CPModule scores worse and Cluster-Buster (15) perform rather poorly (<0.25 (0.55 versus 0.68) at the nucleotide level than Compo, for both mCC and nCC). Because they screen sequences but scores considerably better (0.57 versus 0.27) at the individually, they predict CRMs with a large number of motif level. motifs that differ from sequence to sequence, resulting in To allow for a more thorough comparison of different the tools, we reanalyzed the dataset using each time a bad scores. ModuleSearcher (41) could not handle the large number of PWMs and repeatedly ran into memory subset of the 516 motif models. Starting from the PWMs problems, even with 2GB of RAM allocated. When of the three inserted TFs, we gradually increase the limiting the maximum number of motifs in a CRM to number of total PWMs by sampling them from the set 10 or less, Compo (14) found the solution reported in of 513 remaining ones. This results in an increasingly larger input in terms of candidate motifs, which contain Table 1. It scores very well at the nucleotide level (0.68), meaning that it ﬁnds the binding region on the sequences increasingly more false motif models. This makes the quite accurately. However, at the motif level it performs problem increasingly harder. Figure 2 shows the motif and nucleotide-level correlation coefﬁcients (CC) of the different tools on the data. The number of motif models used is shown on the x-axis (for each sample size, 10 Table 1. Comparison of CRM prediction algorithms different samples are created and all tools are run using the same sample sets). The single-sequence-based tools Cister Cluster-Buster ModuleSearcher Compo CPModule Cister (16) and Cluster-Buster (15) perform best in the mCC 0.16 0.05 / 0.27 0.57 presence of few sampled PWMs and deteriorate as more nCC 0.23 0.23 / 0.68 0.55 PWMs are added. With few PWMs their predictions are accurate, especially regarding the binding region of the The tools were run on the synthetic data set of Xie et al. (36) using a CRMs (nucleotide level). However, because they treat stringent screening with 516 TRANSFAC PWMs. Slash indicates termination by lack of memory. sequences independently, different false positive motifs Figure 2. Performance comparison of CRM detection tools. All CRM detection tools were run on the synthetic data set of Xie et al. (36). Screening was performed with the PWMs used to generate the synthetic data in combination with an additional set of PWMs sampled from TRANSFAC (the number of PWMs added to the true PWMs is indicated on the x-axis). (A) mCC, (B) nCC. e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 8 OF 16 are predicted on each sequence separately. This becomes Assessing the added value of ChIP-based information on worse as the number of sampled PWMs increases, leading detecting CRMs involved in mouse embryonic stem cell to decreasing scores. For the multiple sequence tools Description of the experimental set up Compo (14), ModuleSearcher (41) and CPModule this To show the effect of using ChIP-Seq information on problem is less pronounced. The behavior of Compo improving the performance of CRM detection, we changes as the number of motifs increases: at the motif relied on publicly available ChIP-Seq experiments con- level, the score has a decreasing trend, while at the nucleo- ducted by Chen et al. (25). The data consist of tide level the score increases. Looking at the CRMs found, ChIP-Seq experiments for ﬁve key TFs involved in we observe that Compo ﬁnds CRMs with only one true self-renewal of mouse embryonic stem cells, namely motif and increasingly more false motifs as the number of KLF4, NANOG, OCT4, SOX2 and STAT3 for which PWMs increases, explaining the motif-level behavior. it is known that combinatorial interactions exist Unexpectedly, adding these false motifs seems to contrib- amongst at least some of these ﬁve TFs. These previously ute rather than to deteriorate the precision at the nucleo- known interactions, corresponding to nine different tide level. The shorter predicted binding regions obtained CRMs were used as a benchmark (Figure 3). Starting by adding more motifs seem to coincidentally better ap- from the data of a ChIP-Seq experiment of a single proximate the regions in which also the true CRMs are assayed TF, we used CRM detection to discover, located. The scores of CPModule and ModuleSearcher in silico, the other TFs with which the assayed TF con- score well on both the motif- and nucleotide-level, while stitutes aCRM.Wethentestedtowhatextentwecould being less affected by the number of motifs used. recover the previously described benchmark CRMs using However, ModuleSearcher is unable to scale to more either a query-based or non-query-based setting. The than 400 motifs because of memory issues, while this is non-query-based setting mimics the traditional way in not a problem for CPModule. which CRM detection is being performed, that is trying These results show that CPModule is competitive with to prioritize a CRM that is enriched in a set of sequences state-of-the-art tools in detecting CRMs in sets of without using any further prior information. In the coregulated sequences. This, in combination with its query-based setting, only CRMs that contain a motif capability of handling a large number of sequences, prioritizing CRMs by means of ranked lists and the for the assayed TF are searched for. Prior to the CRM ability to be used in a query-based mode, make it ideally detection, we ﬁrst optimized screening thresholds to suited for this study and for the analysis of ChIP-Seq data reduce the effect of the screening on the success of the in general. CRM detection. Figure 3. Known combinatorial regulation of the ﬁve assayed TFs. Network representing combinatorial interactions between the ﬁve transcription factors (KLF4, SOX2, OCT4, NANOG and STAT3) involved in embryonic stem cell development. Edges indicate that a combinatorial interaction between the indicated TFs exists as reported in literature (with a combinatorial interaction referring to the fact that at least subsets of genes contain binding sites for both TFs in each other’s neighborhood). Dashed lines correspond to the interactions in the benchmark that were missed by CPModule. Solid lines correspond to the interactions in the benchmark that were recovered by CPModule. The thin line indicates that the interaction was detected using CPModule on the ChIP-Seq data set of one TF while the thick line indicates that the interactions was detected by using either ChIP-Seq dataset of the TFs involved in the interaction. PAGE 9 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 Screening and ﬁltering this average number does not exclude that some TFs can We started from the top 100 binding peaks identiﬁed for have multiple motif sites in the same sequence while others each ChIP-Seq-assayed TF (assuming that those represent might have none. Applying a low stringency screening the most reliable binding sites). It was recently shown that resulted in each of the analyzed data sets in the detection the sites of the assayed TF do not exactly coincide with of, on average, four motif sites per TF and per sequence their binding peaks, but can be located as far as 250 bp (Figure 4B). This was much lower in case stringent from the actual peak (42,43). Therefore, we used a screening was used. sequence region of 500 bp centered around each of the Combining the non-stringent screening with a ﬁltering top 100 peaks as input sequences. based on nucleosome occupancy (that is removing sites Ideally, the result of the screening should have a high predicted to be located in nucleosome occupied regions) sensitivity, while containing only few false positive motif seems to offer a good trade-off between the sensitivity and sites. Using a stringent threshold would bias towards only the number of false positive sites as is shown in Figure 4. ﬁnding the most ‘conserved motif sites’. However, true Compared to stringent screening, applying the ﬁltering sites do not necessarily correspond to the most conserved after a non-stringent screening largely increases the sensi- ones (44,45). Using a low stringency ﬁltering might con- tivity for most of the assayed TFs, while still maintaining trarily result in the inclusion of too many false positives, the number of predicted sites per TF within a reasonable possibly deteriorating the CRM detection. Therefore, we range. In the remainder of the analysis, ﬁltering was used in addition to a screening with either a high or a low applied on the low stringency predicted sites of all TFs stringency threshold, also a low stringency screening except for those of the assayed one. For the assayed TF, combined with a ﬁltering based on nucleosome position- ﬁltering was omitted as the ChIP-derived evidence experi- ing as nucleosome positioning plays a role in determining mentally supports that each ChIP-bound region contains the accessibility of a site (39). As the information on binding sites of that TF. condition- and tissue-dependent nucleosome occupancy is not readily available, we relied on the NuOS which CRM detection has also previously been used in the context of motif CPModule was run using for each assayed TF the detection (30) (see ‘Materials and Methods’ section). sequences of the top 100 peak regions, as well as the Motif sites located in regions that show a high NuOS motif sites identiﬁed by the screening and ﬁltering. For are considered to be transcriptionally inactive (46) and the proximity threshold, we started for each data set were therefore ﬁltered out. from 150 bp and step wisely (50 bp at the time) extended One advantage of starting from ChIP-Seq based infor- this value to maximally 400 bp. The value of the frequency mation is that it allows to approximate, at least for the threshold was set to 60%. Predicted CRMs were ranked assayed TF, the effect of the screening/ﬁltering on the according to their P-values. The higher the rank of a recovery rate of its binding sites. This effect on recovering benchmark CRM (that is a previously known CRM, see the binding sites of the assayed TF can be seen as repre- Figure 3), the better the algorithm was able to prioritize sentative for the effect of the screening/ﬁltering on the CRM amongst the total number of predicted CRMs. recovering sites of any other TF. In Figure 4, we display The results obtained by running CPModule after for each input set (sequences corresponding to the 100 screening with a stringent threshold (for all TFs except binding regions of each assayed TF) the sensitivity in the assayed one) (Supplementary Table S2) shows that a recovering binding sites of the assayed TF after applying stringent screening threshold lowers the sensitivity of different screening/ﬁltering procedures. The sensitivity is retrieving true binding sites to such extent that none of expressed as the percentage of the input set in which a the benchmark CRMs can be retrieved at the predeﬁned motif site of the assayed TF could be detected. A sensitiv- frequency threshold of 60%. Only by subsequently ity of 100% thus corresponds to retrieving at least one lowering the frequency threshold to 50% allowed recover- motif site for the assayed TF in each of the 100 high ing a benchmark CRM (1 out of the 9). To calculate scoring binding regions. As can be expected, a stringent benchmark recovery we considered all solutions obtained screening results in a rather low sensitivity for most of the with all possible proximity thresholds irrespective of their binding regions of the assayed TFs (sensitivity <50%). rank. Without ﬁltering, the number of binding sites per Lowering the screening stringency largely increases this motif and sequence became so high that the search for sensitivity. At least 80% of the binding peaks for respect- CRMs was computationally prohibitive. ively KLF4 (84%), NANOG (80%), OCT4 (98%), SOX2 The most informative results and highest coverage of (95%) and STAT3 (100%) were found to contain a motif benchmark CRMs was obtained using CPModule with site for their respective TFs. However, this increased non-stringent screening and ﬁltering based on the NuOS. sensitivity comes at the expense of also predicting many Seven of nine of the previously described CRMs involving more potentially false positive sites. The number of false KLF4, NANOG, OCT4, SOX2 or STAT3. positives that result from a screening/ﬁltering procedure is Table 2 shows for each of the assayed TFs and for each harder to estimate as we have no clue about the identity or proximity threshold, the highest ranking recovered bench- location of the true sites. Therefore, we estimated the false mark CRM together with its rank amongst the total discovery rate by the average number of motif sites per TF number of solutions. In addition to their rank we show and per sequence region retained after applying different screening procedures (Figure 4B). The average number of for each CRM its support as an additional quality criter- sites per screened TF should be sufﬁciently low. Note that ion, indicating how many of the 100 given peak regions e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 10 OF 16 Figure 4. Effect of different screening/ﬁltering combinations on motif prediction results. (A) Effect of using different screening/ﬁltering combinations on the sensitivity of recovering true sites of the assayed TF. Sensitivity is assessed by the percentage of binding peak regions in which a motif site of the assayed TF could be detected. (B) Effect of using different screening /ﬁltering combinations on the average number of remaining motif sites per sequence and per TF for each of the ChIP-assayed data sets. In each panel, we used a stringent screening, a non-stringent screening without ﬁltering and a non-stringent screening with a ﬁltering based on NuOS (different categories are indicated in the order as mentioned above by bars with increasing gray scales), respectively. contained the predicted CRMs (the higher this value, the the assayed TF was found. Most of the TFs involved in more speciﬁc the detected CRM is for the given dataset). the benchmark interactions, therefore, have binding sites For all data sets, at the one but lowest proximity in a rather close proximity on the genome. (200 bp) most of the benchmark CRMs were retrieved (6 Predicted CRMs were also validated using the available of the 9 benchmark CRMs). Further increasing the prox- ChIP-Seq experimental information: if a previously imity threshold results in the same benchmark CRMs also described interaction between the analyzed TF and any obtained with a lower threshold, albeit most of the time at of the other four benchmark TFs was predicted by a lower rank and/or in combination with other motifs. CPModule, the ChIP-Seq data of the other TFs were Increasing the proximity threshold will increase the used to experimentally verify whether their predicted number of valid CRMs. For NANOG, one additional sites in the retrieved CRMs coincided with their binding benchmark CRMs were retrieved at a higher proximity peaks. Predicted sites of TFs for which experimental threshold than the one for which the ﬁrst CRM containing data was available, overlapped for at least 10% PAGE 11 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 (and often more) with their corresponding binding only listing CRMs that contain a site for the ChIP- peaks (Column ‘Validation’ in Table 2). This indicates assayed TF. Table 2 shows that in the query-based that most of the benchmark CRMs predicted by setting, the rank of the benchmark CRMs is in many CPModule reﬂect true CRM signals present in the cases much better than the rank in the non-query-based ChIP-Seq data. setting. This is especially true for KLF4, NANOG, SOX2 The added value of using ChIP-Seq derived information and STAT3. For OCT4, the ranks are more similar for becomes obvious when comparing the rank obtained for both settings, but in the query-based setting much less each benchmark CRM in the ‘non-query-based’ setting candidate CRMs are returned, and hence the computation with the one obtained using the ‘query-based’ setting. is much more efﬁcient. Our results show that exploiting The ﬁrst setting mimics a classical CRM detection set ChIP-Seq information, by constraining the candidate up, i.e. when searching for CRMs that are statistically CRMs to those that only contain the assayed TF, not overrepresented in a set of coregulated sequences. The only leads to a compact result set with in many cases a query-based setting differs from the classical setting by better ranking of the true CRMs, but more importantly Table 2. Benchmark CRMs obtained with CPModule in combination with ﬁltering (non-stringent screening with ﬁltering for all TFs except the assayed one) ChIP-Seq- CRM Support Proximity Query-based Non-query- Validation (%) assa- (%) threshold based yed TF (bp) Rank/Total Rank/Total KLF4 KLF4, STAT1 66 150 19/23 845/849 22.73 KLF4, OCT1 60 200 19/46 5562/5694 33.33 KLF4, STAT1, [CEBP] 61 200 22/46 5635/5694 23.33 KLF4, STAT4, [SMAD] 61 250 16/98 6160/7029 22.95 KLF4, OCT1 60 250 65/98 6994/7029 33.33 KLF4, STAT4, [T3R] 63 300 23/183 6903/7704 22.22 KLF4, OCT1 61 300 131/183 7651/7704 32.79 KLF4, STAT4, STAT1, [CDXA, LEF1] 60 350 29/284 25 056/26 843 23.33 KLF4, OCT1 61 350 212/284 26 771/26 843 32.79 KLF4, STAT4, [SMAD, T3R] 60 400 5/468 24 930/3 1549 21.67 KLF4, OCT1, [CDXA] 60 400 207/468 31 220/31549 33.33 NANOG NANOG, STAT5A_03, STAT5A_04 62 150 1/11 5930/5941 19.35 NANOG, STAT5A_04, [PU1] 60 200 1/39 40 171/43 093 23.33 NANOG, STAT5A_03, [PU1] 61 250 1/71 62 475/64 059 21.31 NANOG, OCT1 60 250 21/71 64 006/64 059 68.33 NANOG, OCT1, [FAC1] 60 300 1/145 66 186/80859 70.49 NANOG, STAT3, [FAC1] 62 300 3/145 77 724/80 859 26.67 NANOG, STAT5A_04, [PU1, FAC1] 60 350 1/406 159 818/217 328 25.00 NANOG, OCT1, STAT5A_04, [FAC1] 60 350 2/406 167 806/217 328 66.67 (OCT1); 26.67 (STAT5A_04) NANOG, STAT5A_04, [PU1, HNF3, AR] 60 400 1/883 204 024/299 409 23.33 NANOG, OCT1, STAT6, STAT5A_04, [FAC1] 60 400 2/883 224 495/299 409 70.00 (OCT1); 26.67 (STAT6) OCT4 OCT4, STAT6, [XFD2, FOXJ2, FOXP3] 63 150 6/1322 30/11966 14.75 OCT4, SOX2 60 150 1272/1322 10 348/11 966 78.33 OCT4, STAT4, STAT6, [PAX2, PAX4, TITF1] 62 200 1/13 141 6/111 817 16.39 OCT4, SOX2, [PAX2] 62 200 11 740/13 141 83 797/111 817 79.03 OCT4, STAT4, STAT6, [PAX4, PAX2, ELF1] 66 250 1/29 767 23/182 697 16.67 OCT4, SOX2, [CDXA] 60 250 28 080/29 767 1671 42/182 697 81.67 OCT4, STAT3 61 300 1/73 091 7/235 252 14.75 OCT4, SOX2, [CDXA, PAX2] 60 300 68 944/73 091 217 331/235 252 75.00 OCT4, STAT3, [CDXA] 60 350 1/290 997 1/859 377 12.90 OCT4, SOX2, [PAX2, FOXP3] 60 350 106 443/290 997 296 722/859 377 73.33 OCT4, STAT3, STAT5A_03 60 400 1/383 001 11/108 0139 14.75 OCT4, SOX2, [PAX2, FOXP3] 60 400 150 936/383001 449 140/1 080 139 73.33 SOX2 SOX2, OCT4 68 150 1/6318 322/46 471 88.24 SOX2, STAT5A_04, [NKX62, AR, HELIOSA] 60 150 3/6318 840/46 471 23.33 SOX2, STAT5A_04, [GEN_INI2_B, FOXJ2, 60 200 2/90 416 55/512 702 25.00 HNF3ALPHA, CEBP, AR] SOX2, OCT4, [CDXA, TST1] 61 200 4/90 416 106/512 702 27.87 SOX2, STAT1, STAT5A_04, [SRY, CAP, NFAT, 60 250 1/168 760 55/790 791 25.00 TEF, AR, CDX, HMGIY, BRCA] SOX2, OCT4, [CDXA, CDX2] 62 250 4/168 760 183/79 0791 87.10 SOX2, OCT4, [CDXA, CDX, CEBP] 60 300 1/303 533 94/1 256 190 86.89 SOX2, STAT, STAT5A_03, [CAP, NFAT, 60 300 2/303 533 238/1 256 190 21.67 GEN_INI2_B, FOXJ2, CEBP] (continued) e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 12 OF 16 Table 2. Continued ChIP-Seq- CRM Support Proximity Query-based Non-query- Validation (%) assa- (%) threshold based yed TF (bp) Rank/Total Rank/Total STAT3 STAT3, OCT4, [CAP] 61 150 36/5649 43/6426 36.07 STAT3, SOX2, [IRF1] 61 150 2651/5649 3018/6426 31.15 STAT3, OCT4, [CEBP] 61 200 301/32 257 312/33 640 29.51 STAT3, SOX2, STAT6, [YY1] 60 200 4532/32257 4675/33640 30.00 STAT3, OCT4, [CAP, TEF1, YY1, PR] 60 250 57/54 549 61/56 473 31.67 STAT3, SOX2, STAT6, [SRY, IRF1] 60 250 6363/54 549 6666/56 473 28.33 STAT3, OCT1, [CAP, FOXM1, YY1, PR] 60 300 186/73 378 188/74 106 33.33 STAT3, SOX2, STAT6, [XPF1] 60 300 8442/73 378 8517/74 106 28.33 STAT3, OCT, STAT5A_03, STAT6, [HOXA3, 60 350 6/243 758 6/243 979 35.00 AP2REP, PU1] STAT3, SOX2, STAT1, STAT4, STAT5A_04, 60 350 21 046/243 758 21 066/243 979 26.67 STAT6, [HNF3, YY1] STAT3, OCT, STAT5A_03, [AP2REP, PU1, XPF1] 61 400 12/308 757 12/308 993 36.07 STAT3, SOX2, [HOXA3, AP2REP] 62 400 21 375/308 757 21 385/308 993 29.03 In this table, only benchmark CRMs recovered by CPModule are displayed, For reasons of conciseness, we only display for each parameter setting the best ranked versions of each of the benchmark CRMs, for instance, whereas Oct4-Sox2 was found to be the best ranked CRM at a proximity threshold of 150, more combinations of Oct4, Sox2 in combination with other TFs were also detected at this setting of the proximity parameter albeit at lower ranks. These alternative versions with lower rank are not displayed in the table. If PWMs for TFs belonging to the same family are very similar, we also considered those CRMs as true that contained rather than the TF reported in literature another member of the same family (48) (i.e. this was the case for TFs of the STAT and OCT family). The set of sequences corresponding to the top 100 scoring peak regions of the assayed TF, were screened with a set of 517 TRANSFAC motifs using a non-stringent screening threshold. Filtering was applied on all motif sites except on the ones of the assayed TF. ChIP-Seq-assayed TF: TF from which the top 100 binding peaks were used to perform the analysis. CRM: obtained CRMs that correspond to previously well described CRMs for the assayed TF; [between brackets are indicated other TFs that were predicted to belong to the same CRM, but that have not previously been described to interact with the assayed TF]. Support: the percentage of sequences from the input set in which this CRM occurs (always higher than the frequency threshold). Proximity threshold (bp): the proximity threshold with which the displayed CRM was found. Query-based Rank/Total: the rank this CRM received in the query-based setting/the number of solutions containing the motif for the ChIP-Seq-assayed TF. Non-query-based Rank/Total: the rank this CRM received in all of the solutions/the total number of valid CRMs. Validation: we started from the ChIP-Seq data of one TF and predicted using CRM detection with which other TFs the assayed TF interacts. We veriﬁed whether the motif sites contributing to the predicted CRMs fell within the binding peaks of the other ChIP-Seq-assayed TF. Table 2 reads as follows, for instance, when starting from the ChIP-Seq data of SOX2, we predicted a previously described CRM containing SOX2-OCT4. This retrieved CRM was ranked ﬁrst amongst the 6318 potential CRMs that contained SOX2 (rank in the query-based mode) and ranked 322 out of the total number of 46 471 possible CRMs in the non-query based mode. SOX2 and OCT4 co-occurred in 68% of the SOX2 ChIP-Seq identiﬁed regions (Support) within a distance of 150 bp and the identiﬁed sites for OCT4 in the predicted CRM fell within the identiﬁed OCT4 ChIP-Seq regions in 88.24% of the cases. As an example of how the same CRM can be detected at different proximity thresholds: the CRM containing KL4-OCT1 was recovered at a proximity constraint of 200, 250, 300 and 350, but with an increasingly lower absolute rank in the non-query-based setting. With the current screening/ﬁltering all runs could be performed except those for SOX2 with proximity thresholds of 350 bp and 400 bp, respectively. These did not ﬁnish after 7days. they also show that in the absence of such information, For a handful of the predicted CRMs (KLF4-STAT4; CRM detection becomes almost infeasible. OCT4-CDXA; OCT4-PAX2; OCT4-STAT; OCT4-SRY; SOX2-OCT4), it was previously proven that their contri- buting TFs were involved in combinatorial regulation (see Novel predicted CRMs involved in embryonic Supplementary Table S3 for a list of references). stem cell regulation For most of the other CRMs we could ﬁnd indirect Besides the results on the benchmark set, we also literature-based support, suggesting the plausibility of displayed for all assayed TFs their top three ranking the predicted interactions, for instance for NANOG- CRMs predicted by CPModule, for the different proxim- FAC1, the putative transcriptional regulator FAC1 has ity thresholds (Supplementary Table S3). Note that those shown to be expressed in embryonic and extra-embryonic CRMs score in most cases much better than the bench- tissues of the early mouse conceptus and was shown to be mark CRMs. essential for trophoblast differentiation during early Functional analysis (Ingenuity Pathway analysis and mouse development (47), making an interaction between literature-based analysis) of the 56 TFs that were NANOG and FAC1 plausible. Other examples are involved in the predicted CRMs of respectively KLF4, described in Table 3. NANOG, OCT4, SOX2 or STAT3, showed that most the TFs have functions related to development (embryonic development, cellular development, tissue development, DISCUSSION organ development, organismal development), cellular growth and proliferation, cancer and tissue morphology In this work, we developed CPModule, a novel approach all functionalities that could be related to ESC cell growth, for CRM detection with a performance that is competitive death and differentiation (see Supplementary File S3). to that of other state-of-art tools, while being able to PAGE 13 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 Table 3. Indirect evidences for the suggested CRMs in Supplementary Table S3 Suggested CRM Indirect evidence NANOG-FAC1 The putative transcriptional regulator FAC1 is expressed in embryonic and extraembryonic tissues of the early mouse conceptus. Study (47) showed that FAC1 is essential for trophoblast differentiation during early mouse development. Thus, there might be an interaction between NANOG and FAC1. OCT4-FOXM1 Foxm1 has been hypothesized to be one of the candidates to help reprogramming somatic cells into iPSCs (Induced pluripotent stem cells) (49). FOXM1 is a major stimulator of cell proliferation (50), so it might interact with KLF4 in the self-renewal process. SOX2-CDXA Binding of homeobox domain from CDX1 protein and SOX2 protein was shown to occur in a system of puriﬁed components (51). Note that we identiﬁed a CRM with CDXA and not with CDX1. However CDXA and CDX1 belong to the same protein family and have very similar motif models. SOX2-BRCA Roles of BRCA in both homologous recombination and non-homologous end joining DNA repair have been shown (52,53). Such function of BRCA might also play a role during the self-renewal process to repair DNA damage. STAT3-HOXA3 As HOXA3 is involved in wound repair (54), interaction with STAT3 in the self-renewal process is plausible. STAT3-GATA1 GATA1 was known to be one of the major transcription factors that stimulated cardiogenesis during development (55–57) even though it is frequently used as a marker for endodermal derivatives during differentiation of pluripotent stem cells (58). Interaction with STAT3 in the self-renewal process is therefore plausible. STAT1-STAT3-STAT6 Binding of human STAT3 protein and human STAT6 protein has been shown in a 2-hybrid assay (59). STAT1 and STAT3 can form heterodimers (60,61). Note however that with STAT motif models it is difﬁcult to make the distinction between the different STAT members. Indirect evidences derived from literature searches which give indications on possible interactions between the indicated TFs found in the predicted CRMs. handle larger data sets (such as 100 sequences in combin- problem intractable or lower the quality of the outcome; ation with a library of 517 PWMs). The advantage of more false positive hits in the screening will also result in CPModule is that it builds upon a constraint programming the detection of more spurious CRMs. Just increasing the for itemset mining framework (24). This approach provides stringency of the screening seems not to be the best option fast search techniques similar to those in itemset mining, as many true sites, and thus also true CRMs, appear to be missing. Using a lower screening threshold in combination while allowing to freely impose additional constraints on with a ﬁltering procedure based on nucleosome occupancy the CRMs. These constraints such as the proximity and provided in our case a good trade-off between keeping the redundancy constraint can help prioritizing likely CRMs. number of false positives in a reasonable range and At this point our system focuses on ﬁnding loosely recovering true sites. We applied this ﬁltering to sites of structured CRMs that satisfy multiple constraints, such all TFs other than those of the assayed TF. By increasing as a minimum frequency constraint or a constraint on the the recovery rate of sites of the assayed TF, we maximize maximum distance between binding sites. It can easily be the chance of ﬁnding CRMs and instances of CRMs that extended to ﬁnd unstructured CRMs under additional contain the assayed TF, as those are the ones we are constraints. The discovery of structured CRMs, similar to primarily interested in. Using a more stringent ﬁltering the approaches proposed by Noto and Craven (2006) and for all other TFs will help reducing the spurious CRMs, by Cartharius et al. (8) in FrameWorker, is outside the but might come at the expense of not being able to detect scope of the current approach. some of the true interactions between the assayed TF and The ﬂexible framework also allows us to use CPModule other TFs in TRANSFAC (which might explain why we in a query-based setting when dealing with ChIP-Seq couldn’t recover all previously described benchmark derived data, that is, we can search only for CRMs that CRMs). With more experimental data on condition- contain the assayed TF. Because of its enumerative dependent nucleosome occupancy and other cell speciﬁc approach, CPModule outputs all valid CRMs as an features becoming available, ﬁltering will become more ordered list rather than returning one CRM as being the reliable and will surely further improve the success of com- most signiﬁcant CRM. Having an idea of the rank of a binatorial CRM detection. true CRM amongst the total number of valid CRMs gives an intuition on the difﬁculty of computationally retrieving a true CRM in a particular data set. We used this property SUPPLEMENTARY DATA to assess the contribution of ChIP-Seq data in its ability to Supplementary Data are available at NAR Online: prioritize true CRMs. Our results on real datasets showed Supplementary Tables S1–S3, Supplementary Figures that in the absence of ChIP-Seq based information, S1–S2 and Supplementary References [62–112]. biologically relevant CRM detection is almost infeasible. The success of CRM detection also depends on the quality of the input data, which is the set of motif sites ACKNOWLEDGEMENTS predicted by screening using motif models such as PWMs. A too dense collection of motif sites (hits) obtained by a The authors thank Luc De Raedt for the invaluable dis- non-stringent screening threshold usually results in too cussions and his vision on using constraint programming many motif combinations, which either make the for itemset mining, which enabled the development of the e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 14 OF 16 15. Frith,M.C., Li,M.C. and Weng,Z. (2003) Cluster-Buster: ﬁnding CPModule system. The authors also thank the anonym- dense clusters of motifs in DNA sequences. Nucleic Acids Res., ous reviewers for many useful comments. 31, 3666–3668. 16. Frith,M.C., Hansen,U. and Weng,Z. (2001) Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, 878–889. FUNDING 17. Buck,M.J. and Lieb,J.D. (2004) ChIP-chip: considerations for the KULeuven Research Council (GOA/08/011, - SymBioSys design, analysis, and application of genome-wide chromatin -CoE EF/05/007, NATAR C1895-PF/10/10); the agency immunoprecipitation experiments. Genomics, 83, 349–360. 18. Jothi,R., Cuddapah,S., Barski,A., Cui,K. and Zhao,K. (2008) for Innovation by Science and Technology Genome-wide identiﬁcation of in vivo protein-DNA binding sites (SBO-BioFrame); Interuniversity Attraction Poles (P6/ from ChIP-Seq data. Nucleic Acids Res., 36, 5221–5231. 25-BioMaGNet); Research Foundation - Flanders 19. Liu,E.T., Pott,S. and Huss,M. (2010) Q&A: ChIP-seq (IOK-B9725-G.0329.09); the Human Frontier Science technologies and the study of gene regulation. BMC Biol., 8, 56. 20. Pepke,S., Wold,B. and Mortazavi,A. (2009) Computation for Program (RGY0079/2007C); Ghent University ChIP-seq and RNA-seq studies. Nat. Methods, 6, S22–S32. (Multidisciplinary Research Partnership ‘M2N’); Institute 21. Li,X., MacArthur,S., Bourgon,R., Nix,D., Pollard,D., Iyer,V., for the Promotion and Innovation through Science and Hechmer,A., Simirenko,L., Stapleton,M., Luengo Hendriks,C. Technology in Flanders (IWT-Vlaanderen); A Postdoc et al. (2008) Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol., 6, e27. grant from the Research Foundation-Flanders. Funding 22. Visel,A., Blow,M., Li,Z., Zhang,T., Akiyama,J., Holt,A., Plajzer- for open access charge: KU Leuven Research Council Frick,I., Shoukry,M., Wright,C., Chen,F. et al. (2009) ChIP-seq (SymBioSys, CoE EF/05/007). accurately predicts tissue-speciﬁc activity of enhancers. Nature, 457, 854–858. Conﬂict of interest statement. None declared. 23. van der Meer,D.L., Degenhardt,T., Vaisanen,S., de Groot,P.J., Heinaniemi,M., de Vries,S.C., Muller,M., Carlberg,C. and Kersten,S. (2010) Proﬁling of promoter occupancy by PPARalpha in human hepatoma cells via ChIP-chip analysis. Nucleic Acids REFERENCES Res., 38, 2839–2850. 24. De Raedt,L., Guns,T. and Nijssen,S. (2008) Constraint 1. Davidson,E.H. (2001) Genomic Regulatory Systems: Development programming for itemset mining. Proceedings of the 14th ACM and Evolution. Academic Press, San Diego. SIGKDD International Conference on Knowledge Discovery and 2. Zhou,Q. and Wong,W.H. (2004) CisModule: de novo discovery Data Mining. ACM, New York, NY, USA, pp. 204–212. of cis-regulatory modules by hierarchical mixture modeling. Proc. 25. Chen,X., Xu,H., Yuan,P., Fang,F., Huss,M., Vega,V.B., Wong,E., Natl Acad. Sci. USA, 101, 12114–12119. Orlov,Y.L., Zhang,W., Jiang,J. et al. (2008) Integration of 3. Gupta,M. and Liu,J.S. (2005) De novo cis-regulatory module external signaling pathways with the core transcriptional network elicitation for eukaryotic genomes. Proc. Natl Acad. Sci. USA, 102, 7079–7084. in embryonic stem cells. Cell, 133, 1106–1117. 4. Van Loo,P. and Marynen,P. (2009) Computational methods for 26. Matys,V., Kel-Margoulis,O.V., Fricke,E., Liebich,I., Land,S., the detection of cis-regulatory modules. Brief Bioinform., 10, Barre-Dirrie,A., Reuter,I., Chekmenev,D., Krull,M., 509–524. Hornischer,K. et al. (2006) TRANSFAC and its module 5. Klepper,K., Sandve,G.K., Abul,O., Johansen,J. and Drablos,F. TRANSCompel: transcriptional gene regulation in eukaryotes. (2008) Assessment of composite motif discovery methods. BMC Nucleic Acids Res., 34, D108–D110. Bioinformatics, 9, 123. 27. Coessens,B., Thijs,G., Aerts,S., Marchal,K., De Smet,F., 6. Noto,K. and Craven,M. (2006) A specialized learner for inferring Engelen,K., Glenisson,P., Moreau,Y., Mathys,J. and De Moor,B. structured cis-regulatory modules. BMC Bioinformatics, 7, 528. (2003) INCLUSive: a web portal and service registry for 7. Whitington,T., Frith,M.C., Johnson,J. and Bailey,T.L. (2011) microarray and regulatory sequence analysis. Nucleic Acids Res., Inferring transcription factor complexes from ChIP-seq data. 31, 3468–3470. Nucleic Acids Res., 39, e98. 28. Frith,M.C., Fu,Y., Yu,L., Chen,J.F., Hansen,U. and Weng,Z. 8. Cartharius,K., Frech,K., Grote,K., Klocke,B., Haltmeier,M., (2004) Detection of functional DNA motifs via statistical Klingenhoff,A., Frisch,M., Bayerlein,M. and Werner,T. (2005) over-representation. Nucleic Acids Res., 32, 1372–1381. MatInspector and beyond: promoter analysis based on 29. Xi,L., Fondufe-Mittendorf,Y., Xia,L., Flatow,J., Widom,J. and transcription factor binding sites. Bioinformatics, 21, 2933–2942. Wang,J.-P. (2010) Predicting nucleosome positioning 9. Dohr,S., Klingenhoff,A., Maier,H., Hrabe de Angelis,M., using a duration Hidden Markov Model. BMC Bioinformatics, Werner,T. and Schneider,R. (2005) Linking disease-associated 11, 346. genes to regulatory networks via promoter organization. Nucleic 30. Ramsey,S.A., Knijnenburg,T.A., Kennedy,K.A., Zak,D.E., Acids Res., 33, 864–872. Gilchrist,M., Gold,E.S., Johnson,C.D., Lampano,A.E., Litvak,V., 10. Calva,D., Dahdaleh,F.S., Woodﬁeld,G., Weigel,R.J., Carr,J.C., Navarro,G. et al. (2010) Genome-wide histone acetylation data Chinnathambi,S. and Howe,J.R. (2011) Discovery of SMAD4 improve prediction of mammalian transcription factor binding promoters, transcription factor binding sites and deletions in sites. Bioinformatics, 26, 2071–2075. 31. Schulte,C. and Stuckey,P.J. (2008) Efﬁcient constraint juvenile polyposis patients. Nucleic Acids Res., 39, 5369–5378. propagation engines. ACM T Progr. Lang. Sys., 31, 43. 11. Kwon,A.T., Chou,A.Y., Arenillas,D.J. and Wasserman,W.W. 32. Guns,T., Sun,H., Nijssen,S., Marchal,K. and De Raedt,L. (2010) (2011) Validation of skeletal muscle cis-regulatory module Cis-regulatory module detection using constraint programming. predictions reveals nucleotide composition bias in functional Proceedings of IEEE International Conference on Bioinformatics enhancers. PLoS Comp. Biol., 7, e1002256. 12. Su,J., Teichmann,S.A. and Down,T.A. (2010) Assessing and Biomedicine. IEEE Computer Society, Washington, DC, computational methods of cis-regulatory module prediction. PLoS USA, pp. 363–368. Comp. Biol., 6, e1001020. 33. Gallo,A., De Bie,T. and Cristianini,N. (2007) MINI: mining 13. Van Loo,P., Aerts,S., Thienpont,B., De Moor,B., Moreau,Y. and informative nonredundant itemset. Proceedings of the 11th Marynen,P. (2008) ModuleMiner - improved computational Conference on Principles and Practice of Knowledge Discovery in detection of cis-regulatory modules: are there different modes of Databases. Springer, Berlin, pp. 438–445. gene regulation in embryonic development and adult tissues? 34. Fujita,P.A., Rhead,B., Zweig,A.S., Hinrichs,A.S., Karolchik,D., Genome Biol., 9, R66. Cline,M.S., Goldman,M., Barber,G.P., Clawson,H., Coelho,A. 14. Sandve,G.K., Abul,O. and Drablos,F. (2008) Compo: composite et al. (2011) The UCSC Genome Browser database: update 2011. motif discovery using discrete models. BMC Bioinformatics, 9, 527. Nucleic Acids Res., 39, D876–D882. PAGE 15 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 35. Celniker,S.E., Dillon,L.A., Gerstein,M.B., Gunsalus,K.C., injury-induced mobilization and recruitment of bone Henikoff,S., Karpen,G.H., Kellis,M., Lai,E.C., Lieb,J.D., marrow-derived cells. Stem Cells, 27, 1654–1665. MacAlpine,D.M. et al. (2009) Unlocking the secrets of the 55. Grepin,C., Robitaille,L., Antakly,T. and Nemer,M. (1995) Inhibition of transcription factor GATA-4 expression blocks genome. Nature, 459, 927–930. 36. Xie,D., Cai,J., Chia,N.Y., Ng,H.H. and Zhong,S. (2008) in vitro cardiac muscle differentiation. Mol. Cell Biol., 15, Cross-species de novo identiﬁcation of cis-regulatory modules 4095–4102. with GibbsModule: application to gene regulation in embryonic 56. Lien,C.L., Wu,C., Mercer,B., Webb,R., Richardson,J.A. and stem cells. Genome Res., 18, 1325–1335. Olson,E.N. (1999) Control of early cardiac-speciﬁc transcription 37. Aerts,S., Van Loo,P., Moreau,Y. and De Moor,B. (2004) A of Nkx2-5 by a GATA-dependent enhancer. Development, 126, genetic algorithm for the detection of new cis-regulatory modules 75–84. in sets of coregulated genes. Bioinformatics, 20, 1974–1976. 57. Pikkarainen,S., Tokola,H., Kerkela,R. and Ruskoaho,H. (2004) 38. Barrett,T., Troup,D.B., Wilhite,S.E., Ledoux,P., Rudnev,D., GATA transcription factors in the developing and adult heart. Evangelista,C., Kim,I.F., Soboleva,A., Tomashevsky,M., Cardiovasc. Res., 63, 196–207. Marshall,K.A. et al. (2009) NCBI GEO: archive for 58. Holtzinger,A., Rosenfeld,G.E. and Evans,T. (2010) Gata4 directs high-throughput functional genomic data. Nucleic Acids Res., 37, development of cardiac-inducing endoderm from ES cells. Dev. D885–890. Biol., 337, 63–73. 39. Whitington,T., Perkins,A.C. and Bailey,T.L. (2009) 59. Ravasi,T., Suzuki,H., Cannistraci,C.V., Katayama,S., Bajic,V.B., High-throughput chromatin information enables accurate Tan,K., Akalin,A., Schmeier,S., Kanamori-Katayama,M., tissue-speciﬁc prediction of transcription factor binding sites. Bertin,N. et al. (2010) An atlas of combinatorial transcriptional Nucleic Acids Res., 37, 14–25. regulation in mouse and man. Cell, 140, 744–752. 40. Jiang,J., Chan,Y.S., Loh,Y.H., Cai,J., Tong,G.Q., Lim,C.A., 60. Levy,D.E. and Darnell,J.E. Jr (2002) Stats: transcriptional Robson,P., Zhong,S. and Ng,H.H. (2008) A core Klf circuitry control and biological impact. Nat. Rev. Mol. Cell Biol., 3, regulates self-renewal of embryonic stem cells. Nat. Cell Biol., 10, 651–662. 353–360. 61. John,S., Reeves,R.B., Lin,J.X., Child,R., Leiden,J.M., 41. Aerts,S., Van Loo,P., Thijs,G., Moreau,Y. and De Moor,B. Thompson,C.B. and Leonard,W.J. (1995) Regulation of (2003) Computational detection of cis -regulatory modules. cell-type-speciﬁc interleukin-2 receptor alpha-chain gene Bioinformatics, 19(Suppl. 2), ii5–ii14. expression: potential role of physical interactions between Elf-1, 42. Wilbanks,E.G. and Facciotti,M.T. (2010) Evaluation of algorithm HMG-I(Y), and NF-kappa B family proteins. Mol. Cell Biol., 15, performance in ChIP-seq peak detection. PLoS One, 5, e11471. 1786–1796. 43. Zhang,Y., Liu,T., Meyer,C.A., Eeckhoute,J., Johnson,D.S., 62. Farrar,J.D., Smith,J.D., Murphy,T.L. and Murphy,K.M. (2000) Bernstein,B.E., Nusbaum,C., Myers,R.M., Brown,M., Li,W. et al. Recruitment of Stat4 to the human interferon-alpha/beta receptor (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol., requires activated Stat2. J. Biol. Chem., 275, 2693–2697. 9, R137. 63. Wang,C., Zhou,G.L., Vedantam,S., Li,P. and Field,J. (2008) 44. Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for Mitochondrial shuttling of CAP1 promotes actin- and ﬁnding protein-DNA binding sites with applications to coﬁlin-dependent apoptosis. J. Cell Sci., 121, 2913–2920. chromatin-immunoprecipitation microarray experiments. Nat. 64. Kelley,C.M., Ikeda,T., Koipally,J., Avitahl,N., Wu,L., Biotechnol., 20, 835–839. Georgopoulos,K. and Morgan,B.A. (1998) Helios, a novel 45. Hu,J., Li,B. and Kihara,D. (2005) Limitations and potentials of dimerization partner of Ikaros expressed in the earliest current motif discovery algorithms. Nucleic Acids Res., 33, hematopoietic progenitors. Curr. Biol., 8, 508–515. 4899–4913. 65. Battista,S., Pentimalli,F., Baldassarre,G., Fedele,M., Fidanza,V., 46. Lee,C.K., Shibata,Y., Rao,B., Strahl,B.D. and Lieb,J.D. (2004) Croce,C.M. and Fusco,A. (2003) Loss of Hmga1 gene function Evidence for nucleosome depletion at active regulatory regions affects embryonic stem cell lympho-hematopoietic differentiation. genome-wide. Nat. Genet., 36, 900–905. FASEB J., 17, 1496–1498. 47. Goller,T., Vauti,F., Ramasamy,S. and Arnold,H.H. (2008) 66. Choi,H.J., Geng,Y., Cho,H., Li,S., Giri,P.K., Felio,K. and Transcriptional regulator BPTF/FAC1 is essential for trophoblast Wang,C.R. (2010) Differential requirements for the Ets differentiation during early mouse development. Mol. Cell Biol., transcription factor Elf-1 in the development of NKT cells and 28, 6819–6827. NK cells. Blood, 117, 1880–1887. 48. Macintyre,G., Bailey,J., Haviv,I. and Kowalczyk,A. (2010) 67. Tang,Y., Katuri,V., Dillner,A., Mishra,B., Deng,C.X. and is-rSNP: a novel technique for in silico regulatory SNP detection. Mishra,L. (2003) Disruption of transforming growth factor-beta Bioinformatics, 26, i524–i530. signaling in ELF beta-spectrin-deﬁcient mice. Science, 299, 49. Xie,Z., Tan,G., Ding,M., Dong,D., Chen,T., Meng,X., Huang,X. 574–577. and Tan,Y. (2010) Foxm1 transcription factor is required for 68. Beck,F. and Stringer,E.J. (2010) The role of Cdx genes in the gut maintenance of pluripotency of P19 embryonal carcinoma cells. and in axial development. Biochem. Soc. Trans., 38, 353–357. Nucleic Acids Res., 38, 8027–8038. 69. Park,M.J., Kim,H.Y., Kim,K. and Cheong,J. (2009) 50. Wang,I.C., Chen,Y.J., Hughes,D., Petrovic,V., Major,M.L., Homeodomain transcription factor CDX1 is required for the Park,H.J., Tan,Y., Ackerson,T. and Costa,R.H. (2005) Forkhead transcriptional induction of PPARgamma in intestinal cell box M1 regulates the transcriptional network of genes essential differentiation. FEBS Lett., 583, 29–35. for mitotic progression and genes encoding the SCF (Skp2-Cks1) 70. Holdcraft,R.W. and Braun,R.E. (2004) Androgen receptor ubiquitin ligase. Mol. Cell Biol., 25, 10875–10894. function is required in Sertoli cells for the terminal differentiation 51. Beland,M., Pilon,N., Houle,M., Oh,K., Sylvestre,J.R., Prinos,P. of haploid spermatids. Development, 131, 459–467. and Lohnes,D. (2004) Cdx1 autoregulation is governed by a 71. Merrill,B.J., Gat,U., DasGupta,R. and Fuchs,E. (2001) Tcf3 and novel Cdx1-LEF1 transcription complex. Mol. Cell Biol., 24, Lef1 regulate lineage differentiation of multipotent stem cells in 5028–5038. skin. Genes Dev., 15, 1688–1705. 52. Shafee,N., Smith,C.R., Wei,S., Kim,Y., Mills,G.B., 72. Galceran,J., Farinas,I., Depew,M.J., Clevers,H. and Grosschedl,R. Hortobagyi,G.N., Stanbridge,E.J. and Lee,E.Y. (2008) Cancer (1999) Wnt3a-/–like phenotype and limb deﬁciency in Lef1(-/-)Tcf1(-/-) stem cells contribute to cisplatin resistance in Brca1/p53-mediated mice. Genes Dev., 13, 709–717. mouse mammary tumors. Cancer Res., 68, 3243–3250. 73. Bouchard,M., Souabni,A., Mandler,M., Neubuser,A. and 53. Farmer,H., McCabe,N., Lord,C.J., Tutt,A.N., Johnson,D.A., Busslinger,M. (2002) Nephric lineage speciﬁcation by Pax2 and Richardson,T.B., Santarosa,M., Dillon,K.J., Hickson,I., Pax8. Genes Dev., 16, 2958–2970. Knights,C. et al. (2005) Targeting the DNA repair defect in 74. Torres,M., Gomez-Pardo,E., Dressler,G.R. and Gruss,P. (1995) BRCA mutant cells as a therapeutic strategy. Nature, 434, Pax-2 controls multiple steps of urogenital development. 917–921. Development, 121, 4057–4065. 54. Mace,K.A., Restivo,T.E., Rinn,J.L., Paquet,A.C., Chang,H.Y., 75. Kashimada,K. and Koopman,P. (2010) Sry: the master switch in Young,D.M. and Boudreau,N.J. (2009) HOXA3 modulates mammalian sex determination. Development, 137, 3921–3930. e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 16 OF 16 76. Sun,L., Ma,K., Wang,H., Xiao,F., Gao,Y., Zhang,W., Wang,K., 96. Chang,C.P., Neilson,J.R., Bayle,J.H., Gestwicki,J.E., Kuo,A., Gao,X., Ip,N. and Wu,Z. (2007) JAK1-STAT1-STAT3, a key Stankunas,K., Graef,I.A. and Crabtree,G.R. (2004) A ﬁeld of pathway promoting proliferation and preventing premature myocardial-endocardial NFAT signaling underlies heart valve differentiation of myoblasts. J. Cell Biol., 179, 129–138. morphogenesis. Cell, 118, 649–663. 77. Kang,J., DiBenedetto,B., Narayan,K., Zhao,H., Der,S.D. and 97. Kimura,S., Hara,Y., Pineau,T., Fernandez-Salguero,P., Fox,C.H., Chambers,C.A. (2004) STAT5 is required for thymopoiesis Ward,J.M. and Gonzalez,F.J. (1996) The T/ebp null mouse: in a development stage-speciﬁc manner. J. Immunol., 173, thyroid-speciﬁc enhancer-binding protein is essential for the 2307–2314. organogenesis of the thyroid, lung, ventral forebrain, and 78. Snow,J.W., Abraham,N., Ma,M.C., Abbey,N.W., Herndier,B. and pituitary. Genes Dev., 10, 60–69. Goldsmith,M.A. (2002) STAT5 promotes multilineage 98. Henseleit,K.D., Nelson,S.B., Kuhlbrodt,K., Hennings,J.C., hematolymphoid development in vivo through effects on early Ericson,J. and Sander,M. (2005) NKX6 transcription factor hematopoietic progenitor cells. Blood, 99, 95–101. activity is required for alpha- and beta-cell development in the 79. Wurster,A.L., Tanaka,T. and Grusby,M.J. (2000) The biology of pancreas. Development, 132, 3139–3149. Stat4 and Stat6. Oncogene, 19, 2577–2584. 99. Wang,J., Elghazi,L., Parker,S.E., Kizilocak,H., Asano,M., 80. Barak,O., Lazzaro,M.A., Lane,W.S., Speicher,D.W., Picketts,D.J. Sussel,L. and Sosa-Pineda,B. (2004) The concerted activities of and Shiekhattar,R. (2003) Isolation of human NURF: a regulator Pax4 and Nkx2.2 are essential to initiate pancreatic beta-cell of engrailed gene expression. EMBO J., 22, 6089–6100. differentiation. Dev. Biol., 266, 178–189. 81. Jacks,T. (1996) Tumor suppressor gene mutations in mice. Annu. 100. Mansouri,A., Chowdhury,K. and Gruss,P. (1998) Follicular cells Rev. Genet., 30, 603–636. of the thyroid gland require Pax8 gene function. Nat. Genet., 19, 82. Begay,V., Smink,J. and Leutz,A. (2004) Essential requirement of 87–90. CCAAT/enhancer binding proteins in embryogenesis. Mol. Cell 101. Selleri,L., Depew,M.J., Jacobs,Y., Chanda,S.K., Tsang,K.Y., Biol., 24, 9744–9751. Cheah,K.S., Rubenstein,J.L., O’Gorman,S. and Cleary,M.L. 83. Niedernhofer,L.J., Essers,J., Weeda,G., Beverloo,B., de Wit,J., (2001) Requirement for Pbx1 in skeletal patterning and Muijtjens,M., Odijk,H., Hoeijmakers,J.H. and Kanaar,R. (2001) programming chondrocyte proliferation and differentiation. The structure-speciﬁc endonuclease Ercc1-Xpf is required for Development, 128, 3543–3557. targeted gene replacement in embryonic stem cells. EMBO J., 20, 102. Shyamala,G., Yang,X., Cardiff,R.D. and Dale,E. (2000) 6540–6549. Impact of progesterone receptor on cell-fate decisions during 84. Wan,H., Dingle,S., Xu,Y., Besnard,V., Kaestner,K.H., Ang,S.L., mammary gland development. Proc. Natl Acad. Sci. USA, 97, Wert,S., Stahlman,M.T. and Whitsett,J.A. (2005) Compensatory 3044–3049. roles of Foxa1 and Foxa2 during lung morphogenesis. J. Biol. 103. Sebastiano,V., Dalvai,M., Gentile,L., Schubart,K., Sutter,J., Chem., 280, 13809–13816. Wu,G.M., Tapia,N., Esch,D., Ju,J.Y., Hubner,K. et al. (2010) 85. Tompers,D.M., Foreman,R.K., Wang,Q., Kumanova,M. and Oct1 regulates trophoblast development during early mouse Labosky,P.A. (2005) Foxd3 is required in the trophoblast embryogenesis. Development, 137, 3551–3560. progenitor cell lineage of the mouse embryo. Dev. Biol., 285, 104. Ryu,E.J., Wang,J.Y., Le,N., Baloh,R.H., Gustin,J.A., Schmidt,R.E. and Milbrandt,J. (2007) Misexpression of Pou3f1 126–137. 86. Ohyama,T. and Groves,A.K. (2004) Expression of mouse Foxi results in peripheral nerve hypomyelination and axonal loss. class genes in early craniofacial development. Dev. Dyn., 231, J. Neurosci., 27, 11552–11559. 640–646. 105. Aberg,T., Cavender,A., Gaikwad,J.S., Bronckers,A.L., Wang,X., 87. Granadino,B., Arias-de-la-Fuente,C., Perez-Sanchez,C., Waltimo-Siren,J., Thesleff,I. and D’Souza,R.N. (2004) Parraga,M., Lopez-Fernandez,L.A., del Mazo,J. and Rey- Phenotypic changes in dentition of Runx2 homozygote-null Campos,J. (2000) Fhx (Foxj2) expression is activated during mutant mice. J. Histochem. Cytochem., 52, 131–139. spermatogenesis and very early in embryonic development. Mech. 106. Tremblay,K.D., Dunn,N.R. and Robertson,E.J. (2001) Mouse Dev., 97, 157–160. embryos lacking Smad1 signals display defects in 88. Fontenot,J.D., Gavin,M.A. and Rudensky,A.Y. (2003) Foxp3 extra-embryonic tissues and germ cell formation. Development, programs the development and function of CD4+CD25+ 128, 3609–3621. regulatory T cells. Nat. Immunol., 4, 330–336. 107. James,D., Levine,A.J., Besser,D. and Hemmati-Brivanlou,A. 89. Tsai,F.Y., Browne,C.P. and Orkin,S.H. (1998) Knock-in mutation (2005) TGFbeta/activin/nodal signaling is necessary for the of transcription factor GATA-3 into the GATA-1 locus: partial maintenance of pluripotency in human embryonic stem cells. rescue of GATA-1 loss of function in erythroid cells. Dev. Biol., Development, 132, 1273–1282. 196, 218–227. 108. Wontakal,S.N., Guo,X., Will,B., Shi,M., Raha,D., 90. Stecca,B., Nait-Oumesmar,B., Kelley,K.A., Voss,A.K., Thomas,T. Mahajan,M.C., Weissman,S., Snyder,M., Steidl,U., Zheng,D. and Lazzarini,R.A. (2002) Gcm1 expression deﬁnes three stages et al. (2011) A large gene network in immature erythroid cells is of chorio-allantoic interaction during placental development. controlled by the myeloid and B cell transcriptional regulator Mech. Dev., 115, 27–34. PU.1. PLoS Genet., 7, e1001392. 91. Fijalkowska,I., Sharma,D., Bult,C.J. and Danoff,S.K. (2010) 109. Korinek,V., Barker,N., Moerer,P., van Donselaar,E., Huls,G., Expression of the transcription factor, TFII-I, during Peters,P.J. and Clevers,H. (1998) Depletion of epithelial stem-cell post-implantation mouse embryonic development. BMC Res. compartments in the small intestine of mice lacking Tcf-4. Nat. Notes, 3, 203. Genet., 19, 379–383. 92. Kameda,Y., Nishimaki,T., Takeichi,M. and Chisaka,O. (2002) 110. Inukai,T., Inaba,T., Dang,J., Kuribara,R., Ozawa,K., Homeobox gene hoxa3 is essential for the formation of Miyajima,A., Wu,W., Look,A.T., Arinobu,Y., Iwasaki,H. et al. the carotid body in the mouse embryos. Dev. Biol., 247, (2005) TEF, an antiapoptotic bZIP transcription factor related 197–209. to the oncogenic E2A-HLF chimera, inhibits cell growth by 93. Fournier,M., Lebert-Ghali,C.E., Krosl,G. and Bijl,J.J. (2011) down-regulating expression of the common beta chain of HOXA4 induces expansion of hematopoietic stem cells in vitro cytokine receptors. Blood, 105, 4437–4444. 111. Wallis,K., Sjogren,M., van Hogerlinden,M., Silberberg,G., and confers enhancement of pro-B-cells in vivo. Stem Cells Dev, 21, 133–142. Fisahn,A., Nordstrom,K., Larsson,L., Westerblad,H., Morreale 94. Kim,J.I., Li,T., Ho,I.C., Grusby,M.J. and Glimcher,L.H. (1999) de Escobar,G., Shupliakov,O. et al. (2008) Locomotor Requirement for the c-Maf transcription factor in crystallin gene deﬁciencies and aberrant development of subtype-speciﬁc regulation and lens development. Proc. Natl Acad. Sci. USA, 96, GABAergic interneurons caused by an unliganded thyroid 3781–3785. hormone receptor alpha1. J. Neurosci., 28, 1904–1915. 95. Han,J., Ishii,M., Bringas,P. Jr, Maas,R.L., Maxson,R.E. Jr and 112. Affar el,B., Gay,F., Shi,Y., Liu,H., Huarte,M., Wu,S., Collins,T. Chai,Y. (2007) Concerted action of Msx1 and Msx2 in regulating and Li,E. (2006) Essential dosage-dependent functions of the cranial neural crest cell differentiation during frontal bone transcription factor yin yang 1 in late embryonic development development. Mech. Dev., 124, 729–745. and cell cycle progression. Mol. Cell Biol., 26, 3565–3581. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/unveiling-combinatorial-regulation-through-the-combination-of-chip-5OC3RI3sAV

Loading next page...

References (119)

B. Merrill, U. Gat, R. Dasgupta, E. Fuchs (2001)
Tcf3 and Lef1 regulate lineage differentiation of multipotent stem cells in skin.
Genes & development, 15 13
Christian Schulte, Peter Stuckey (2006)
Efficient constraint propagation engines
ArXiv, abs/cs/0611009
J. Bernal (2008)
Faculty Opinions recommendation of Locomotor deficiencies and aberrant development of subtype-specific GABAergic interneurons caused by an unliganded thyroid hormone receptor alpha1.
L. Xi, Yvonne Fondufe-Mittendorf, Lei Xia, Jared Flatow, J. Widom, Ji-Ping Wang (2010)
Predicting nucleosome positioning using a duration Hidden Markov Model
BMC Bioinformatics, 11
Mayetri Gupta, Jun Liu (2005)
De novo cis-regulatory module elicitation for eukaryotic genomes.
Proceedings of the National Academy of Sciences of the United States of America, 102 20
S. Wontakal, Xingyi Guo, B. Will, Minyi Shi, D. Raha, M. Mahajan, S. Weissman, Michael Snyder, U. Steidl, D. Zheng, A. Skoultchi (2011)
A Large Gene Network in Immature Erythroid Cells Is Controlled by the Myeloid and B Cell Transcriptional Regulator PU.1
PLoS Genetics, 7
H. Wan, S. Dingle, Yan Xu, V. Besnard, K. Kaestner, S. Ang, S. Wert, M. Stahlman, J. Whitsett (2005)
Compensatory Roles of Foxa1 and Foxa2 during Lung Morphogenesis*
Journal of Biological Chemistry, 280
Keith Noto, Mark Craven (2006)
A specialized learner for inferring structured cis-regulatory modules
BMC Bioinformatics, 7
I. Wang, Yi-Ju Chen, D. Hughes, V. Petrovic, M. Major, H. Park, Yongjun Tan, Timothy Ackerson, R. Costa (2005)
Forkhead Box M1 Regulates the Transcriptional Network of Genes Essential for Mitotic Progression and Genes Encoding the SCF (Skp2-Cks1) Ubiquitin Ligase
Molecular and Cellular Biology, 25
A. Wurster, Takashi Tanaka, M. Grusby (2000)
The biology of Stat4 and Stat6
Oncogene, 19
Luguo Sun, Kewei Ma, Haixia Wang, Fang Xiao, Yan Gao, Wei Zhang, Kepeng Wang, Xiang Gao, N. Ip, Zhenguo Wu (2007)
JAK1–STAT1–STAT3, a key pathway promoting proliferation and preventing premature differentiation of myoblasts
The Journal of Cell Biology, 179
Audrey Holtzinger, G. Rosenfeld, T. Evans (2010)
Gata4 directs development of cardiac-inducing endoderm from ES cells.
Developmental biology, 337 1
Korinna Henseleit, S. Nelson, K. Kuhlbrodt, J. Hennings, J. Ericson, M. Sander (2005)
NKX6 transcription factor activity is required for alpha- and beta-cell development in the pancreas.
Development, 132 13
E. Ryu, James Wang, N. Le, R. Baloh, Jason Gustin, R. Schmidt, J. Milbrandt (2007)
Misexpression of Pou3f1 Results in Peripheral Nerve Hypomyelination and Axonal Loss
The Journal of Neuroscience, 27
F. Tsai, C. Browne, S. Orkin (1998)
Knock-in mutation of transcription factor GATA-3 into the GATA-1 locus: partial rescue of GATA-1 loss of function in erythroid cells.
Developmental biology, 196 2
G. Sandve, Osman Abul, F. Drabløs (2008)
Compo: composite motif discovery using discrete models
BMC Bioinformatics, 9
M. Bouchard, A. Souabni, M. Mandler, A. Neubüser, M. Busslinger (2002)
Nephric lineage specification by Pax2 and Pax8.
Genes & development, 16 22
Raja Jothi, Suresh Cuddapah, A. Barski, Kairong Cui, K. Zhao (2008)
Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data
Nucleic Acids Research, 36
T. Ravasi, Harukazu Suzuki, C. Cannistraci, S. Katayama, V. Bajic, Kai Tan, A. Akalin, S. Schmeier, M. Kanamori-Katayama, N. Bertin, Piero Carninci, C. Daub, A. Forrest, J. Gough, S. Grimmond, Jung-Hoon Han, Takehiro Hashimoto, Winston Hide, Oliver Hofmann, A. Kamburov, M. Kaur, H. Kawaji, A. Kubosaki, T. Lassmann, E. Nimwegen, C. MacPherson, Chihiro Ogawa, A. Radovanovic, Ariel Schwartz, R. Teasdale, J. Tegnér, B. Lenhard, S. Teichmann, T. Arakawa, N. Ninomiya, Kayoko Murakami, M. Tagami, S. Fukuda, K. Imamura, C. Kai, R. Ishihara, Yayoi Kitazume, J. Kawai, D. Hume, T. Ideker, Y. Hayashizaki (2010)
An Atlas of Combinatorial Transcriptional Regulation in Mouse and Man
Cell, 140
L. Niedernhofer, J. Essers, G. Weeda, B. Beverloo, J. Wit, M. Muijtjens, H. Odijk, J. Hoeijmakers, R. Kanaar (2001)
The structure‐specific endonuclease Ercc1—Xpf is required for targeted gene replacement in embryonic stem cells
The EMBO Journal, 20
David Meer, T. Degenhardt, S. Väisänen, P. Groot, M. Heinäniemi, S. Vries, Michael Müller, C. Carlberg, S. Kersten (2010)
Profiling of promoter occupancy by PPARα in human hepatoma cells via ChIP-chip analysis
Nucleic Acids Research, 38
Zhongqiu Xie, Guixiang Tan, M. Ding, D. Dong, Tuanhui Chen, Xiangxian Meng, Xiaoqin Huang, Yongjun Tan (2010)
Foxm1 transcription factor is required for maintenance of pluripotency of P19 embryonal carcinoma cells
Nucleic Acids Research, 38
Kjetil Klepper, G. Sandve, Osman Abul, Jostein Johansen, F. Drabløs (2008)
Assessment of composite motif discovery methods
BMC Bioinformatics, 9
Daniel Calva, F. Dahdaleh, G. Woodfield, R. Weigel, Jennifer Carr, S. Chinnathambi, J. Howe (2011)
Discovery of SMAD4 promoters, transcription factor binding sites and deletions in juvenile polyposis patients
Nucleic Acids Research, 39
(2008)
Locomotor deficiencies and aberrant development of subtype-specific GABAergic interneurons caused by an unliganded thyroid hormone receptor alpha1
J. Neurosci., 28
Junfeng Wang, L. Elghazi, S. Parker, Hasan Kizilocak, M. Asano, L. Sussel, B. Sosa-Pineda (2004)
The concerted activities of Pax4 and Nkx2.2 are essential to initiate pancreatic beta-cell differentiation.
Developmental biology, 266 1
James Kim, Tiansen Li, I. Ho, M. Grusby, L. Glimcher (1999)
Requirement for the c-Maf transcription factor in crystallin gene regulation and lens development.
Proceedings of the National Academy of Sciences of the United States of America, 96 7
X. Chen, H. Xu, P. Yuan, Fang Fang, M. Huss, V. Vega, Eleanor Wong, Y. Orlov, Weiwei Zhang, Jianming Jiang, Y. Loh, Hock Yeo, Z. Yeo, V. Narang, K. Govindarajan, Bernard Leong, A. Shahab, Y. Ruan, G. Bourque, W. Sung, N. Clarke, Chia-Lin Wei, H. Ng (2008)
Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells
Cell, 133
S. Aerts, P. Loo, Y. Moreau, B. Moor (2004)
A genetic algorithm for the detection of new cis-regulatory modules in sets of coregulated genes
Bioinformatics, 20 12
P. Loo, P. Marynen (2009)
Computational methods for the detection of cis-regulatory modules
Briefings in bioinformatics, 10 5
S. John, Raymond Reeves, Jian-Xin Lin, Ryan Child, J. Leiden, C. Thompson, W. Leonard (1995)
Regulation of cell-type-specific interleukin-2 receptor alpha-chain gene expression: potential role of physical interactions between Elf-1, HMG-I(Y), and NF-kappa B family proteins
Molecular and Cellular Biology, 15
V. Bégay, J. Smink, A. Leutz (2004)
Essential Requirement of CCAAT/Enhancer Binding Proteins in Embryogenesis
Molecular and Cellular Biology, 24
B. Granadino, Carmen Arias-de-la-Fuente, C. Perez-Sanchez, M. Párraga, L. López-Fernández, J. Mazo, J. Rey-Campos (2000)
Fhx (Foxj2) expression is activated during spermatogenesis and very early in embryonic development
Mechanisms of Development, 97
M. Frith, Michael Li, Z. Weng (2003)
Cluster-Buster: finding dense clusters of motifs in DNA sequences
Nucleic acids research, 31 13
Xiaoyong Li, S. MacArthur, R. Bourgon, D. Nix, Daniel Pollard, Venky Iyer, A. Hechmer, L. Simirenko, M. Stapleton, C. Hendriks, H. Chu, N. Ogawa, W. Inwood, V. Sementchenko, A. Beaton, Richard Weiszmann, S. Celniker, D. Knowles, T. Gingeras, T. Speed, M. Eisen, M. Biggin (2008)
Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm
PLoS Biology, 6
S. Pikkarainen, H. Tokola, Risto Kerkelä, H. Ruskoaho (2004)
GATA transcription factors in the developing and adult heart.
Cardiovascular research, 63 2
J. Farrar, J. Smith, T. Murphy, K. Murphy (2000)
Recruitment of Stat4 to the human interferon-alpha/beta receptor requires activated Stat2.
The Journal of biological chemistry, 275 4
T. Åberg, A. Cavender, J. Gaikwad, A. Bronckers, Xiuping Wang, J. Waltimo-Sirén, I. Thesleff, R. D'Souza (2004)
Phenotypic Changes in Dentition of Runx2 Homozygote-null Mutant Mice
Journal of Histochemistry & Cytochemistry, 52
J. Fontenot, M. Gavin, A. Rudensky (2003)
Foxp3 programs the development and function of CD4+CD25+ regulatory T cells
Nature Immunology, 4
T. Jacks (1999)
Tumor suppressor gene mutations in mice.
Annual review of genetics, 30
Marilaine Fournier, Charles-Étienne Lebert-Ghali, G. Krosl, Janet Bijl (2012)
HOXA4 induces expansion of hematopoietic stem cells in vitro and confers enhancement of pro-B-cells in vivo.
Stem cells and development, 21 1
J. Farrar, Janice Smith, T. Murphy, K. Murphy (2000)
Recruitment of Stat4 to the Human Interferon-α/β Receptor Requires Activated Stat2*
The Journal of Biological Chemistry, 275
Jing Su, S. Teichmann, T. Down (2010)
Assessing Computational Methods of Cis-Regulatory Module Prediction
PLoS Computational Biology, 6
Jianjun Hu, Bin Li, D. Kihara (2005)
Limitations and potentials of current motif discovery algorithms
Nucleic Acids Research, 33
Orr Barak, M. Lazzaro, William Lane, D. Speicher, D. Picketts, R. Shiekhattar (2003)
Isolation of human NURF: a regulator of Engrailed gene expression
The EMBO Journal, 22
Bert Coessens, G. Thijs, S. Aerts, K. Marchal, F. Smet, K. Engelen, Patrick Glenisson, Y. Moreau, Janick Mathys, B. Moor (2003)
INCLUSive: a web portal and service registry for microarray and regulatory sequence analysis
Nucleic acids research, 31 13
Tobias Goller, F. Vauti, S. Ramasamy, H. Arnold (2008)
Transcriptional Regulator BPTF/FAC1 Is Essential for Trophoblast Differentiation during Early Mouse Development
Molecular and Cellular Biology, 28
Y. Kameda, T. Nishimaki, M. Takeichi, O. Chisaka (2002)
Homeobox gene hoxa3 is essential for the formation of the carotid body in the mouse embryos.
Developmental biology, 247 1
M. Torres, Emilia Gómez-Pardo, G. Dressler, Peter Gruss (1995)
Pax-2 controls multiple steps of urogenital development.
Development, 121 12
Cheol-Koo Lee, Yoichiro Shibata, Bhargavi Rao, B. Strahl, J. Lieb (2004)
Evidence for nucleosome depletion at active regulatory regions genome-wide
Nature Genetics, 36
S. Pepke, B. Wold, A. Mortazavi (2009)
Computation for ChIP-seq and RNA-seq studies
Nature Methods, 6
M. Béland, N. Pilon, M. Houle, Karen Oh, Jean-René Sylvestre, P. Prinos, D. Lohnes (2004)
Cdx1 Autoregulation Is Governed by a Novel Cdx1-LEF1 Transcription Complex
Molecular and Cellular Biology, 24
S. Battista, F. Pentimalli, G. Baldassarre, M. Fedele, V. Fidanza, C. Croce, A. Fusco (2003)
Loss of Hmga1 gene function affects embryonic stem cell lymphohematopoietic differentiation
The FASEB Journal, 17
S. Kimura, Y. Hara, T. Pineau, P. Fernández-Salguero, C. Fox, Jerrold Ward, F. Gonzalez (1996)
The T/ebp null mouse: thyroid-specific enhancer-binding protein is essential for the organogenesis of the thyroid, lung, ventral forebrain, and pituitary.
Genes & development, 10 1
Thomas Whitington, M. Frith, James Johnson, T. Bailey (2011)
Inferring transcription factor complexes from ChIP-seq data
Nucleic Acids Research, 39
Tias Guns, Hong Sun, K. Marchal, Siegfried Nijssen (2010)
Cis-regulatory module detection using constraint programming
2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
C. Chang, J. Neilson, J. Bayle, J. Gestwicki, A. Kuo, K. Stankunas, I. Graef, G. Crabtree (2004)
A Field of Myocardial-Endocardial NFAT Signaling Underlies Heart Valve Morphogenesis
Cell, 118
Min Park, Hyeyong Kim, KyeongJin Kim, J. Cheong (2009)
Homeodomain transcription factor CDX1 is required for the transcriptional induction of PPARγ in intestinal cell differentiation
FEBS Letters, 583
T. Barrett, D. Troup, S. Wilhite, Pierre Ledoux, D. Rudnev, Carlos Evangelista, Irene Kim, Alexandra Soboleva, Maxim Tomashevsky, K. Marshall, Katherine Phillippy, Patti Sherman, Rolf Muertter, Ron Edgar (2008)
NCBI GEO: archive for high-throughput functional genomic data
Nucleic Acids Research, 37
Jun Han, M. Ishii, P. Bringas, R. Maas, R. Maxson, Y. Chai (2007)
Concerted action of Msx1 and Msx2 in regulating cranial neural crest cell differentiation during frontal bone development
Mechanisms of Development, 124
E. Wilbanks, M. Facciotti (2010)
Evaluation of Algorithm Performance in ChIP-Seq Peak Detection
PLoS ONE, 5
T. Ohyama, A. Groves (2004)
Expression of mouse Foxi class genes in early craniofacial development
Developmental Dynamics, 231
(2012)
PAGE 15 OF 16 Nucleic Acids Research
Dennie Tompers, Ruth Foreman, Qiaohong Wang, M. Kumanova, P. Labosky (2005)
Foxd3 is required in the trophoblast progenitor cell lineage of the mouse embryo.
Developmental biology, 285 1
V. Matys, O. Kel-Margoulis, E. Fricke, I. Liebich, S. Land, A. Barre-Dirrie, I. Reuter, D. Chekmenev, M. Krull, K. Hornischer, N. Voss, P. Stegmaier, B. Lewicki-Potapov, H. Saxel, A. Kel, E. Wingender (2005)
TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes
Nucleic Acids Research, 34
E. Affar, F. Gay, Y. Shi, Huifei Liu, Maite Huarte, Su Wu, T. Collins, E. Li, Yang Shi (2006)
Essential Dosage-Dependent Functions of the Transcription Factor Yin Yang 1 in Late Embryonic Development and Cell Cycle Progression
Molecular and Cellular Biology, 26
R. Holdcraft, R. Braun (2003)
Androgen receptor function is required in Sertoli cells for the terminal differentiation of haploid spermatids
, 131
S. Döhr, A. Klingenhoff, H. Maier, M. Angelis, T. Werner, R. Schneider (2005)
Linking disease-associated genes to regulatory networks via promoter organization
Nucleic Acids Research, 33
Yi Tang, V. Katuri, A. Dillner, B. Mishra, C. Deng, L. Mishra (2003)
Disruption of Transforming Growth Factor-β Signaling in ELF β-Spectrin-Deficient Mice
Science, 299
V. Kořínek, N. Barker, P. Moerer, E. Donselaar, G. Huls, P. Peters, H. Clevers (1998)
Depletion of epithelial stem-cell compartments in the small intestine of mice lacking Tcf-4
Nature Genetics, 19
Joonsoo Kang, Brian DiBenedetto, Kavitha Narayan, Hang Zhao, S. Der, C. Chambers (2004)
STAT5 Is Required for Thymopoiesis in a Development Stage-Specific Manner1
The Journal of Immunology, 173
Jianming Jiang, Yun-Shen Chan, Y. Loh, Jun Cai, G. Tong, Ching-Aeng Lim, P. Robson, Sheng Zhong, H. Ng (2008)
A core Klf circuitry regulates self-renewal of embryonic stem cells
Nature Cell Biology, 10
E. Davidson (2005)
Genomic Regulatory Systems: Development and Evolution
Juan Galceran, I. Fariñas, M. Depew, H. Clevers, R. Grosschedl (1999)
Wnt3a−/−-like phenotype and limb deficiency in Lef1−/−Tcf1−/− mice
D. Levy, J. Darnell (2002)
Signalling: STATs: transcriptional control and biological impact
Nature Reviews Molecular Cell Biology, 3
A. Mansouri, K. Chowdhury, P. Gruss (1998)
Follicular cells of the thyroid gland require Pax8 gene function
Nature Genetics, 19
Qing Zhou, W. Wong (2004)
CisModule: de novo discovery of cis-regulatory modules by hierarchical mixture modeling.
Proceedings of the National Academy of Sciences of the United States of America, 101 33
A. Kwon, A. Chou, David Arenillas, W. Wasserman (2011)
Validation of Skeletal Muscle cis-Regulatory Module Predictions Reveals Nucleotide Composition Bias in Functional Enhancers
PLoS Computational Biology, 7
D. James, Ariel Levine, D. Besser, A. Hemmati‐Brivanlou (2005)
TGFβ/activin/nodal signaling is necessary for the maintenance of pluripotency in human embryonic stem cells
, 132
J. Snow, N. Abraham, M. Ma, N. Abbey, B. Herndier, M. Goldsmith (2002)
STAT5 promotes multilineage hematolymphoid development in vivo through effects on early hematopoietic progenitor cells.
Blood, 99 1
P. Fujita, B. Rhead, A. Zweig, A. Hinrichs, D. Karolchik, M. Cline, M. Goldman, G. Barber, H. Clawson, Antonio Coelho, M. Diekhans, T. Dreszer, B. Giardine, R. Harte, Jennifer Hillman-Jackson, F. Hsu, V. Kirkup, R. Kuhn, K. Learned, Chin Li, L. Meyer, A. Pohl, B. Raney, K. Rosenbloom, Kayla Smith, D. Haussler, W. Kent (2010)
The UCSC Genome Browser database: update 2011
Nucleic Acids Research, 39
D. James, Ariel Levine, D. Besser, A. Hemmati‐Brivanlou (2005)
TGFbeta/activin/nodal signaling is necessary for the maintenance of pluripotency in human embryonic stem cells.
Development, 132 6
E. Liu, Sebastian Pott, M. Huss (2010)
Q&A: ChIP-seq technologies and the study of gene regulation
BMC Biology, 8
C. Lien, Chuanzhen Wu, Brian Mercer, R. Webb, J. Richardson, E. Olson (1999)
Control of early cardiac-specific transcription of Nkx2-5 by a GATA-dependent enhancer.
Development, 126 1
G. Macintyre, James Bailey, I. Haviv, A. Kowalczyk (2010)
is-rSNP: a novel technique for in silico regulatory SNP detection
Bioinformatics, 26
F. Beck, E. Stringer (2010)
The role of Cdx genes in the gut and in axial development.
Biochemical Society transactions, 38 2
B. Stecca, B. Nait-Oumesmar, K. Kelley, A. Voss, T. Thomas, R. Lazzarini (2002)
Gcm1 expression defines three stages of chorio-allantoic interaction during placental development
Mechanisms of Development, 115
Changhui Wang, Guo-Lei Zhou, Srilakshmi Vedantam, Peng Li, J. Field (2008)
Mitochondrial shuttling of CAP1 promotes actin- and cofilin-dependent apoptosis
Journal of Cell Science, 121
X. Liu, D. Brutlag, J. Liu (2002)
An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments
Nature Biotechnology, 20
L. Raedt, Tias Guns, Siegfried Nijssen (2008)
Constraint programming for itemset mining
G. Shyamala, Xiaodong Yang, R. Cardiff, E. Dale (2000)
Impact of progesterone receptor on cell-fate decisions during mammary gland development.
Proceedings of the National Academy of Sciences of the United States of America, 97 7
Cartharius K, Frech K, Grote K, Klocke B, Haltmeier M, Klingenhoff A, Frisch M, Bayerlein M, Werner T, Kerstin Cartharius (2005)
MatInspector and beyond: promoter analysis based on transcription factor binding sites
Bioinformatics, 21 13
Thomas Whitington, A. Perkins, T. Bailey (2008)
High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites
Nucleic Acids Research, 37
S. Ramsey, T. Knijnenburg, Kathleen Kennedy, D. Zak, M. Gilchrist, E. Gold, Carrie Johnson, Aaron Lampano, V. Litvak, Garnet Navarro, Tetyana Stolyar, A. Aderem, I. Shmulevich (2010)
Genome-wide histone acetylation data improve prediction of mammalian transcription factor binding sites
Bioinformatics, 26
V. Sebastiano, M. Dalvai, L. Gentile, Karin Schubart, Julien Sutter, Guang‐Ming Wu, Natalia Tapia, Daniel Esch, J. Ju, K. Hübner, M. Bravo, H. Schöler, F. Cavaleri, P. Matthias (2010)
Oct1 regulates trophoblast development during early mouse embryogenesis
Development, 137
K. Tremblay, N. Dunn, E. Robertson (2001)
Mouse embryos lacking Smad1 signals display defects in extra-embryonic tissues and germ cell formation.
Development, 128 18
Korinna Henseleit, S. Nelson, K. Kuhlbrodt, J. Hennings, J. Ericson, M. Sander (2005)
NKX6 transcription factor activity is required for α- andβ -cell development in the pancreas
, 132
C. Grépin, L. Robitaille, T. Antakly, M. Nemer (1995)
Inhibition of transcription factor GATA-4 expression blocks in vitro cardiac muscle differentiation
Molecular and Cellular Biology, 15
A. Gallo, T. Bie, N. Cristianini (2007)
MINI: Mining Informative Non-redundant Itemsets
A. Visel, M. Blow, Zirong Li, Tao Zhang, J. Akiyama, Amy Holt, I. Plajzer-Frick, Malak Shoukry, Crystal Wright, Feng Chen, Veena Afzal, B. Ren, E. Rubin, L. Pennacchio (2009)
ChIP-seq accurately predicts tissue-specific activity of enhancers
Nature, 457
Yi Tang, V. Katuri, A. Dillner, B. Mishra, C. Deng, L. Mishra (2003)
Disruption of transforming growth factor-beta signaling in ELF beta-spectrin-deficient mice.
Science, 299 5606
L. Selleri, M. Depew, Y. Jacobs, S. Chanda, Kwok Tsang, K. Cheah, J. Rubenstein, S. O’Gorman, Michael Cleary (2001)
Requirement for Pbx1 in skeletal patterning and programming chondrocyte proliferation and differentiation.
Development, 128 18
S. Aerts, P. Loo, G. Thijs, Y. Moreau, B. Moor (2003)
Computational detection of cis-regulatory modules
Bioinformatics, 19 Suppl 2
K. Kashimada, P. Koopman (2010)
Sry: the master switch in mammalian sex determination
Development, 137
Dan Xie, Jun Cai, Na-Yu Chia, H. Ng, Sheng Zhong (2008)
Cross-species de novo identification of cis-regulatory modules with GibbsModule: application to gene regulation in embryonic stem cells.
Genome research, 18 8
T. Inukai, T. Inaba, J. Dang, R. Kuribara, K. Ozawa, A. Miyajima, Wen‐shu Wu, A. Look, Y. Arinobu, H. Iwasaki, K. Akashi, K. Kagami, K. Goi, K. Sugita, S. Nakazawa (2005)
TEF, an antiapoptotic bZIP transcription factor related to the oncogenic E2A-HLF chimera, inhibits cell growth by down-regulating expression of the common beta chain of cytokine receptors.
Blood, 105 11
N. Shafee, Christopher Smith, Shuanzeng Wei, Y. Kim, G. Mills, G. Hortobagyi, E. Stanbridge, E. Lee (2008)
Cancer stem cells contribute to cisplatin resistance in Brca1/p53-mediated mouse mammary tumors.
Cancer research, 68 9
H. Farmer, N. Mccabe, C. Lord, A. Tutt, Damian Johnson, T. Richardson, M. Santarosa, Krystyna Dillon, I. Hickson, C. Knights, N. Martin, S. Jackson, G. Smith, A. Ashworth (2005)
Targeting the DNA repair defect in BRCA mutant cells as a therapeutic strategy
Nature, 434
P. Loo, S. Aerts, B. Thienpont, B. Moor, Y. Moreau, P. Marynen (2008)
ModuleMiner - improved computational detection of cis-regulatory modules: are there different modes of gene regulation in embryonic development and adult tissues?
Genome Biology, 9
C. Kelley, Tohru Ikeda, J. Koipally, N. Avitahl, Li Wu, K. Georgopoulos, B. Morgan (1998)
Helios, a novel dimerization partner of Ikaros expressed in the earliest hematopoietic progenitors
Current Biology, 8
M. Buck, J. Lieb (2004)
ChIP-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments.
Genomics, 83 3
I. Fijalkowska, D. Sharma, C. Bult, S. Danoff (2010)
Expression of the transcription factor, TFII-I, during post-implantation mouse embryonic development
BMC Research Notes, 3
K. Mace, T. Restivo, J. Rinn, A. Paquet, Howard Chang, D. Young, N. Boudreau (2009)
HOXA3 Modulates Injury-Induced Mobilization and Recruitment of Bone Marrow-Derived Cells
Stem Cells (Dayton, Ohio), 27
M. Frith, Yutao Fu, Liqun Yu, Jiang-fan Chen, U. Hansen, Z. Weng (2004)
Detection of functional DNA motifs via statistical over-representation.
Nucleic acids research, 32 4
M. Frith, U. Hansen, Z. Weng (2001)
Detection of cis -element clusters in higher eukaryotic DNA
Bioinformatics, 17 10
(2002)
Stats: transcriptional control and biological impact
Nat. Rev. Mol. Cell Biol., 3
Hak-Jong Choi, Y. Geng, Hoonsik Cho, Sha Li, Pramod Giri, Kyrie Felio, Chyung-Ru Wang (2011)
Differential requirements for the Ets transcription factor Elf-1 in the development of NKT cells and NK cells.
Blood, 117 6
Yong Zhang, Tao Liu, Clifford Meyer, J. Eeckhoute, David Johnson, B. Bernstein, C. Nusbaum, R. Myers, Myles Brown, Wei Li, X. Liu (2008)
Model-based Analysis of ChIP-Seq (MACS)
Genome Biology, 9
S. Celniker, Laura Dillon, M. Gerstein, K. Gunsalus, S. Henikoff, G. Karpen, Manolis Kellis, E. Lai, J. Lieb, D. MacAlpine, G. Micklem, F. Piano, M. Snyder, L. Stein, K. White, R. Waterston (2009)
Unlocking the secrets of the genome
Nature, 459

Publisher: Oxford University Press
Copyright: The Author(s) 2012. Published by Oxford University Press.
ISSN: 0305-1048
eISSN: 1362-4962
DOI: 10.1093/nar/gks237
pmid: 22422841
Publisher site: See Article on Publisher Site

Abstract

Published online 15 March 2012 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 doi:10.1093/nar/gks237 Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection 1,2,3 4 1 2,5 4 Hong Sun , Tias Guns , Ana Carolina Fierro , Lieven Thorrez , Siegfried Nijssen 1,6, and Kathleen Marchal * 1 2 Department of Microbial and Molecular Systems, Department of Electrical Engineering, Katholieke Universiteit 3 4 Leuven, IBBT-K.U.Leuven Future Health Department, Kasteelpark Arenberg 10, box 2446, Department of Computer Science, Katholieke Universiteit Leuven, Celestijnenlaan 200A, 3001 Leuven, Interdepartmental Stem Cell Institute, Katholieke Universiteit Leuven, O&N I Herestraat 49, 3000 Leuven and Department of Plant Biotechnology and Bioinformatics, Ghent University, VIB, Technologiepark 927, 9052 Gent, Belgium Received April 19, 2011; Revised February 28, 2012; Accepted February 29, 2012 (TFs) (1). Searching for cis-acting regulatory modules ABSTRACT (CRMs), or combinations of motifs that often co-occur Computationally retrieving biologically relevant cis- in a set of coregulated sequences, helps in unraveling the regulatory modules (CRMs) is not straightforward. mode of combinatorial regulation. CRM detection is Because of the large number of candidates and the customarily being applied on a set of intergenic regions imperfection of the screening methods, many located upstream of coexpressed genes; such genes are for spurious CRMs are detected that are as high scoring example identiﬁed by microarray experiments. Except for de novo methods (2,3), most CRM detection methods rely as the biologically true ones. Using ChIP-information on a motif screening step. In this step, all sites that match allows not only to reduce the regions in which the given motifs of TFs, are located in the selected sequences binding sites of the assayed transcription factor (TF) (4,5). Subsequently, a combinatorial search is performed should be located, but also allows restricting the valid to identify a set of motifs (called a CRM), that occur CRMs to those that contain the assayed TF (here frequently in the given set of intergenic sequences. referred to as applying CRM detection in a query- Usually, a score is assigned to each of the obtained based mode). In this study, we show that exploiting CRMs that assess their statistical signiﬁcance in a set of ChIP-information in a query-based way makes background sequences (4,5). Some methods apply a highly in silico CRM detection a much more feasible structured deﬁnition in which a CRM consists of combin- endeavor. To be able to handle the large datasets, ations of motifs that need to occur in a speciﬁc order, with the query-based setting and other specificities speciﬁc orientations, and within certain distances (6–8). Although the overrepresentation of a structured CRM in proper to CRM detection on ChIP-Seq based data, a gene set is likely to be biologically relevant (9,10), the we developed a novel powerful CRM detection degree to which biologically relevant CRMs are structured method ‘CPModule’. By applying it on a well-studied is still largely unknown (11). Therefore, most methods rely ChIP-Seq data set involved in self-renewal of mouse on less constrained CRM models, hereafter referred to as embryonic stem cells, we demonstrate how our tool unstructured CRM detection methods (4,5,12–16), can recover combinatorial regulation of five known The combinatorial search underlying CRM detection is TFs that are key in the self-renewal of mouse embry- highly complex. Adding to this complexity, is the fact that onic stem cells. Additionally, we make a number of often the regions containing the binding sites are not well new predictions on combinatorial regulation of these delineated [e.g. when starting from a coexpressed gene five key TFs with other TFs documented in TRANSFAC. sets, the intergenic sequences typically range several 1000s of base pairs upstream of the transcription start site (TSS)]. Such long intergenic sequences reduce the INTRODUCTION signal/noise ratio to such extent that in silico CRM detec- In eukaryotes, transcriptional regulation is mediated by tion becomes ineffective. Hence the search for combina- the concerted action of different transcription factors torial regulation is often limited to the proximal promoter *To whom correspondence should be addressed. Tel: +32 486909943; Fax: +32 16 321963; Email: [email protected] The Author(s) 2012. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 2 OF 16 region, whereas at least some of the sites responsible for screening was performed with the tool Clover (28). Given the observed expression behavior might also be located in a PWM of W nucleotides and a sequence, it calculates a regions distant from the TSS of the given genes (such as score for each subsequence of length W. A threshold on for instance in enhancers). Nowadays chromatin-immuno- the score determines whether the subsequence is precipitation (ChIP)-based techniques are becoming considered a potential binding site of the motif, also increasingly popular for the genome-wide identiﬁcation called a hit or a motif site. The default threshold of 6.0 of TF binding sites (17,18). Such techniques make it was used to deﬁne a stringent screening, resulting in few feasible to locate, at least for the assayed TF, the approxi- hits per motif and sequence on average, while a threshold mate binding regions. Using ChIP-bound sequences thus of 3.0 corresponds to a non-stringent screening resulting in allows largely reducing the regions in which the binding many more hits. sites of the assayed TF should be located (typically 500 bp instead of thousands of bp) (19,20) and does not restrict Filtering based on nucleosome occupancy the search for CRMs to the proximal promoter region We ﬁltered potential motif sites by removing sites that (7,21–23). exceed a given value for the nucleosome occupancy score In addition to these obvious advantages, using (NuOS). To calculate the nucleosome occupancy score of a ChIP-information also allows for a query-based search potential motif site, we ﬁrst assigned a nucleosome strategy. Instead of searching for all possible CRMs in occupancy probability to each base pair position, using the input set as is traditionally being done, we can limit the prediction model ‘NuPoP’ of Xi et al. 2010 (29). The our search to those CRMs that contain the ChIP-assayed nucleosome occupancy score was then calculated as the TF (7,8). Incorporating knowledge of the assayed TF geometric mean of the nucleosome occupancy probabilities during CRM detection in such a query-based way allows at all positions of the potential motif site (30). predicting complex CRMs in which the assayed TF is To determine the optimal threshold values for the involved. In this work, we studied the extent to which NuOS, we tested the effect of different ﬁltering thresholds using ChIP-derived information can help in increasing on their ability to: (i) reduce the number of motif site the performance of CRM detection, compared to the predictions per region and per TF, while (ii) not too application of CRM detection in a traditional much compromising the sensitivity of recovering true non-query-based setting. To this end, we developed a binding sites (see Supplementary File S1). Based on novel powerful CRM detection method ‘CPModule’, an these tests, predicted motif sites located within a low prob- unstructured CRM detection method based on a con- ability of nucleosome occupancy (<10%) (when using a straint programming for itemset mining framework (24). ﬁltering based on the NuOS) were retained. Besides handling the speciﬁc challenges of CRM detection on ChIP-Seq based data, CPModule can be used in both a query and non-query-based mode. Its exhaustive search CRM detection using constraint programming for itemset strategy allows making an assessment of the total mining (Figure 1B) number of valid CRMs that are present in the input set CPModule uses as input the motif sites located in the input and of the degree to which a CRM of interest gets sequences by motif screening. The result of the screening prioritized among the total number of candidates. and ﬁltering step is for each motif M and sequence S aset Applying CPModule on a well-studied ChIP-Seq data of motif hits MHðÞ M,S¼ ðÞ l ,r , .. . ,ðl ,r Þ .Motif M 1 1 n n set involved in self-renewal of mouse embryonic stem cells has a hit at ðl ,r Þ with 1 l r jSj;here ðl ,r Þ is an i i i i i i (25) showed that using a query-based setting is in most interval between start position l and stop position r on i i cases the only sensible way to perform CRM detection. the sequence S.jj S corresponds to the length of sequence S. Besides recovering well described benchmark CRMs, we The combinatorial search problem of ﬁnding a motif set also make several novel predictions on the combinatorial is solved by using the constraint programming (CP) for regulation of the ﬁve key regulators, involved in the itemset mining framework (24). The core of CPModule process of self-renewal, with other TFs documented in enumerates all possible motif sets, where a motif set TRANSFAC (26). M ¼fg M , ... M is deﬁned as a subset of all screened 1 n motifs. A CRM is a motif set M that is valid given a set of domain-speciﬁc constraints (more details provided MATERIALS AND METHODS below): (i) the frequency constraint, which requires that Motif screening and ﬁltering (Figure 1A) the motif set occurs in a predeﬁned minimal number of sequences S from the input set, (ii) the proximity Motif screening constraint, which requires that hits of motifs in a set The motif models used for screening are position weight should occur in each other’s proximity. The maximal matrices (PWMs) from TRANSFAC (26). To remove distance of the region in which hits should co-occur is redundancy among the PWMs, each of them was speciﬁed by the user and controls the level of proximity, compared to all the other PWMs using the program (iii) the redundancy constraint, which requires that the MotifComparison (27). MotifComparison calculates the motif set M must be non-redundant with respect to its Kullback–Leiber distance between matrices; PWMs showing a mutual distance <0.1 were grouped. All supersets. Optionally, (iv) a query constraint ensures PWMs in a group were removed except for one represen- that a given query motif must be part of the motif set tative one. This resulted in a ﬁnal list of 516 PWMs. Motif found (Figure 1B). PAGE 3 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 Figure 1. CPModule analysis ﬂow. (A) The input consists of a library of PWMs and a set of sequences. In the ﬁrst step, prior to the actual CRM detection a screening with public motif databases is performed. Here, we combine standard PWM screening with ﬁltering based on nucleosome occupancy. Motif sites displaying a high nucleosome occupancy are ﬁltered out (indicated as the transparent shapes in A). (B) The second step consists of the actual combinatorial search. Here, we use a constraint programming for itemset mining approach to enumerate all valid motif sets, i.e. combinations of motifs (i) that occur frequently in the input set (frequency constraint). Only valid motif sets will be considered (indicated in regular boxes), while invalid ones will not (indicated in dashed boxes); (ii) of which the motif sites contributing to the motif set occur in each other’s proximity (proximity constraint). Only valid motif sets will be considered (indicated in regular boxes), while invalid ones will not (indicated in dashed boxes); (iii) that are non-redundant (redundancy constraint). The motif sets in the dashed box are redundant with the motif set in the regular box and will not be considered; (iv) that contain a query-motif (query-based constraint), which corresponds in this work to the motif of the ChIP-assayed TF. Valid motif sets are indicated in regular boxes. (C) Valid motif sets or CRMs are ﬁnally assigned a P-value that expresses their speciﬁcity for the input set. More formally, a motif set M ¼fg M , ... M will only highly efﬁcient algorithm. Consequently, we need to 1 n be considered as a potential CRM in a sequence S if and encode our problem as a constraint satisfaction problem. only if its set of hit regions (HR) on that sequence S is not We do so as follows. Motif sets are represented by empty, where the set of hit regions (HR) consists of those Boolean variables: there is a Boolean variable M for rangesðÞ l,l+ of base pairs in which all selected motifs M every possible motif, indicating whether this motif is are present, e.g. they have at least one hit with interval part of the motif set M. If a certain M ¼ 1, then we say 0 0 l ,r in that range: that the motif is in the motif set; otherwise the motif is not in the set. Furthermore, we have a Boolean variable S for HRðÞ M,S¼ðÞ l,l+ : 1 l jj S ,8M 2 M : every genomic sequence, indicating whether the motif set ð1Þ 0 0 0 0 will be considered as a potential CRM in a sequence, i.e. 9 l ,r 2 MHðÞ M,S : l l < r l+ whether S 2 ’ðM,SÞ. Lastly, we deﬁne a Boolean variable Given a set of sequences S, the subset of sequences in seqM f for every motif i and every sequence j. The variable ij which the set of hit regions is not empty is denoted by seqM f indicates whether motif M is in the proximity of all ij i ’ðM,SÞ. motifs in motif set M on sequence j. To ﬁnd the motif sets that fulﬁll the above require- The ‘constraints’ are imposed on these variables as ments, we use a general and principled approach based follows: on ‘constraint programming’. In constraint programming, problems are modeled as constraint satisfaction problem Proximity constraint in terms of constraints on variables. The constraint The proximity constraint couples the seqM f variables to ij speciﬁcation is then solved by a general purpose, yet the variables representing the motifs. Formally, we deﬁne e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 4 OF 16 the seqM f variables as follows for every motif on every Genome-wide enrichment score calculation and ranking ij genomic sequence: (Figure 1C) To assess the signiﬁcance of the found motif sets (CRMs), 8 : seqM f ¼ 1 $ð9ðÞ l,r 2 HR M,S ij ij j ð2Þ we calculate for each an enrichment score (P-value) 0 0 0 0 9 l ,r 2 MH M ,S : l l r rÞ i j adapting the strategy proposed in Gallo et al. (33). The main modiﬁcation is that we use a set of background In other words, if in a particular genomic sequence a hit sequences in the calculation of this score. These back- of a particular motif is within a hit region (HR)ofthe ground sequences are used to estimate the proportion motif set, this motif’s variable for that sequence must be p of sequences in the whole genome that contain the 1. Observe that seqM f =1 will hold for all motifs in the ij motif set. We compare the number of observed input motif set M, for all sequences that are in ’ðM,SÞ; however, sequences that contain a particular motif set ’ðM,SÞ there may be additional motifs that have hits in the prox- with the number of sequences that is expected to contain imity of regions in HRðM,S Þ. this motif set. The latter is estimated by counting the number of background sequences containing the motif Frequency constraint set ’ðM,S Þ. This set is obtained after applying background The constraint that imposes a minimum size on ’ðM,SÞ is exactly the same screening and ﬁltering strategy on the ~ ~ formalized as: S min frequency. Here, the S j j background sequences as was applied on the input sequences. Based on this estimate of the probability to variables are deﬁned in terms of the seqM f variables, ij observe a valid motif set in a random set, we calculate such to ensure that sequences are only counted (S ¼ 1) a genome-wide enrichment score (P-value) by means of if all selected motifs (8 : M ¼ 1) occur within each i i a cumulative binomial distribution: other’s proximity in that sequence (SeqM ¼ 1): ij jj S ~ g ~ jj S 8 : S ¼ 1 $ð8 : M ¼ 1 ! SeqM ¼ 1Þð3Þ i jj Si j j i i ij P valueðÞ M ¼ p ð1 pÞ : ð5Þ i¼ ’ðÞ M,S jj Redundancy constraint Where P ¼j’ðM,S Þj=jS j; S is the set of background background The redundancy constraint requires that we cannot add a input sequences; jSj is the number of input sequences; motif to a motif set without losing one sequence in its S is the set of background sequences. background corresponding sequence set ’ðM ,SÞ. This can be The background sequences were derived by sampling enforced as follows on the Boolean variables: from the mouse genome [Version mm9, NCBI Build 37, ~ g ~ UCSC database (34)], a large number of intergenic 8 : ð8 : S ¼ 1 ! SeqM ¼ 1Þ! M ¼ 1, ð4Þ i j j ij i sequences (2000 background sequences for the synthetic stating that a motif must be part of the set (M ¼ 1) if i data, 5000 background sequences for each of the on all selected sequences (8 : S ¼ 1) the motif is within j j ChIP-Seq assays). To exclude the possibility that the com- position of the background set would inﬂuence the the proximity of the others (SeqM ¼ 1). ij estimated background occurrences of the CRMs, we Query-based constraint compiled background sets consisting of either putative The query-based constraint requires that each motif set promoter sequences, that is sequences located upstream contains at least a given motif. This can be enforced by of a gene’s transcription start site, or sets made from back- requiring that the corresponding Boolean variable M ground sequences located in putative enhancer regions, satisﬁes the constraint that M ¼ 1. Note that the proxim- that is sequences corresponding to regions bound by the ity constraint will ensure that only motif sets will be enhancer binding proteins factor CTCF [downloaded considered with hits close enough to this given motif. from ENCODE (35)]. In our experiments, the compos- This combination of constraints is solved by a con- ition of the background sets did not inﬂuence our ﬁnal straint programming system by means of a depth ﬁrst ranking. All results presented in the article use a back- backtracking search. The search strategy alternates ground set based on proximal promoter regions. between branching, in which a variable is assigned a When dealing with ChIP-Seq data, where each sequence value from its domain (Boolean value), and propagation, is a region around an assayed transcription factor site, we the process of using a constraint to remove values from selected the background sequences such that each the domain of variables. The search strategy is similar to sequence contains at least one motif site of the assayed strategies that have been used in itemset mining (24). The TF (which does not overlap with a ChIP-bound region). main difference with traditional itemset mining is the In this setting, the number of background sequences that inclusion of proximity constraints and the inclusion of a qualiﬁes is variable for each data set. To have an equal redundancy constraint for this type of data. The advan- number of sequences for each background set, we tage of using an existing CP system (31) is that additional randomly sampled for each data set 5000 sequences constraints can be added in a modular and straightfor- from the set of qualifying background sequences (5310 ward way, preventing the reimplementation of the was the maximal number of sequences that could be itemset mining strategy from scratch. For more details obtained for the data set with the smallest cognate back- on the implementation of the different constraints we ground set). Note that with this strategy we approximate refer to ref. (32). the P-value calculation in a conservative way as we cannot PAGE 5 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 exclude that the background contains sequences with true total counts of the predicted CRM. Based on these sites of the assessed TF that remained unbound under the counts, a correlation coefﬁcient (CC) is deﬁned as follows: assessed conditions. TP TN FN FP The motif sets are ranked according to their enrichment CC ¼ pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ð6Þ ðTP+FNÞðTN+FPÞðTP+FPÞðTN+FNÞ score [P-value (M), Equation (5)]. Note that with this ranking, for two redundant motif sets that occur in the The value of this coefﬁcient ranges from 1to1.A same sequences, the smaller motif set will never score score of +1 indicates that a prediction corresponds to better than the larger one i.e. if M M , with ’(M ,S) 1 2 1 the correct answer. Random predictions will generally =’(M ,S), motif set M will never score worse than M . 2 2 1 result in CC values close to zero. Ideally, a CRM should This motivates the use of the redundancy constraint score good at both the motif and nucleotide level. during the combinatorial search as it removes from a set of redundant motif sets only the least interesting sets, CRM detection on real data which would get a very low rank in any case. The real data set was derived from genome-wide ChIP data obtained with DNA sequencing (ChIP-Seq) for the Benchmarking on synthetic data KLF4, NANOG, OCT4, SOX2 and STAT3 transcription We ﬁrst use a synthetic data set to compare CPModule factors, as described by Chen et al. (25). For each tran- with different CRM detection tools. The synthetic data is scription factor, the input set consists of 100 sequences, retrieved from Xie et al. (36). The data consists of 22 each corresponding to 500 bp centered around one of the genomic sequences each 1000 base pairs in length. In 20 top 100 ChIP-binding peaks of the assayed TF. Binding sequences, sites sampled from the TRANSFAC PWMs of peaks were taken from the GEO ﬁle GSE11431 (38). respectively OCT4, SOX2 and FOXD3 were inserted in a Screening is performed using the 516 TRANSFAC region of at most 164 bp (so the CRM encompasses PWMs described above (section ‘Motif screening’). maximally 164 bp). Each PWM was sampled three times However, as a KLF4 PWM was missing in TRANSFAC, per sequence. The last two sequences had no sites inserted. we added to our list the PWM described by Whitington CPModule was compared with related tools for et al. (39). Whitington et al. (39) constructed the KLF4 unstructured CRM detection such as ModuleSearcher, PWM using de novo motif detection on a set of sequences obtained from the author (37) on 6 July 2008; Compo involved in the development of mouse embryonic stem obtained from the author (14) on 14 August 2010; Cister cells, derived from a ChIP-chip experiment by Jiang et al. and Cluster-Buster downloaded from the authors website (40) independent from the one used in this study. (15,16) on 22 June 2010 and 21 June 2010, respectively; all We applied our method using three different screening methods were given the non-redundant motif list results: (i) high-stringency screening; (ii) low-stringency described in section ‘Motif screening’. screening; (iii) and low-stringency screening in combin- CPModule was run with a frequency threshold of 60% ation with NuOS ﬁltering. Proximity thresholds for and a proximity threshold of 165 bp (as this was the CPModule were varied stepwisely as mentioned below maximum distance used when generating the data). (section ‘Results’ section). The frequency threshold was Nevertheless, CPModule was shown not to be very sensi- set at 60% unless mentioned otherwise. tive to the exact value of the proximity threshold (see Supplementary File S2). For the other CRM detection tools, we similarly used the best parameter values accord- RESULTS ing to the characteristics of the synthetic data (length of CPModule: CRM detection based on constraint the sequences, the distance between two insertion sites, programming for itemset mining and the maximum size of CRM) and default values other- wise. Supplementary Table S1 lists the non-default param- Algorithmic design eters for the used tools. In contrast to what is often assumed for CRM detection We evaluated the performance of the different CRM on coexpressed genes, we do not expect that all sequences tools using the motif (mCC) and nucleotide correlation derived from a ChIP-Seq experiment contain the same coefﬁcients (nCC) (5). At the motif level (mCC), a pre- CRM. Indeed in the same list of ChIP-bound sequences, dicted motif for a sequence is a true positive (TP) if that different CRMs might be present depending on which motif was indeed part of the CRM on that sequence, other TFs are needed next to the ChIP-assayed one to otherwise it is a false positive (FP). If a motif was not mediate coregulation of certain subsets of the genes in predicted to belong to a CRM on that sequence, the list. although it should have been according to the benchmark, Since the same CRM must not be present in all it is counted as a false negative (FN), otherwise as a true sequences, one cannot simply take the intersection of negative (TN). As the motif level evaluation does not take motifs that appear in the sequences. Instead, all possible the predicted sites into account, a solution is also combinations of motifs have to be considered as candidate evaluated at the nucleotide level (nCC): for every nucleo- CRM, and validated against the data. This makes CRM tide we verify whether it was predicted to be part of the detection a combinatorial and computational hard CRM and whether it should have been predicted or not, problem. Furthermore, it is not known in advance which again resulting in TP, FP, FN and TN counts. These TFs are part of the CRM hence one would like to consider counts are aggregated over all sequences to obtain the all TFs having a known motif model. Using a large e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 6 OF 16 number of candidate motifs makes the problem more constraints. We explicitly chose not to statistically difﬁcult, as 100 candidate motifs means there are 2 can- evaluate the CRMs during search as this can quickly didate motif sets; the problem is even more severe when become computationally intensive, especially when using the entire TRANSFAC database, which contains evaluating the speciﬁcity of a CRM using a large collec- more than 500 motif models. Additionally, ChIP-Seq tion of background sequences. Instead, we calculate an derived datasets are large in the number of sequences enrichment score (P-value) in a post-processing step and (usually hundreds of peaks are detected for the assayed rank the CRMs according to this score (see Figure 1C and TF). For each candidate motif set, all these sequences will ‘Materials and Methods’ section). The added beneﬁt is have to be processed. The fact that each motif can have that our system does not return one CRM as being the multiple hits on a single sequence complicates things even most signiﬁcant CRM, rather it returns an ordered list further. Despite the fact that several methods for CRM which a domain expert can choose from. detection have been developed in the past (4), the afore- Benchmarking CPModule mentioned computational issues are still challenging for Before applying our method on a real ChIP-Seq data set CRM tools that ﬁnd complex CRMs consisting of an we compared its performance with that of a number of arbitrary number of TFs. well performing CRM tools, namely ModuleSearcher (37), To deal with these computation issues, we developed Compo (14), Cister (16) and Cluster-Buster (15). CPModule, a CRM detection method based on constraint Cister (16) and Cluster-Buster (15) are representative programming (CP) for itemset mining (24) (for a full single-sequence tools. They scan each sequence individu- description of the method see ‘Materials and Methods’ ally, searching for potential CRMs that best match a section). CPModule searches for combinations of motifs predeﬁned structure as imposed by model parameters (cis-acting regulatory modules) that are sufﬁciently (here a hidden Markov model). In contrast to multiple speciﬁc for a given input set. Using a library of PWMs, sequences tools, such as CPModule, they do not explicitly a set of coregulated sequences is screened and ﬁltered to test whether the detected CRMs are speciﬁc for the input obtain a list of hits per motif and sequence (Figure 1A). set as a whole. Nevertheless, these tools are very good at CPModule then uses this input list to enumerate all detecting CRMs in individual sequences. Because they possible combinations of motifs (motif sets) that meet a treat sequences individually, they are computationally set of predeﬁned constraints (Figure 1B). These con- more efﬁcient than multiple sequence tools. However, straints deﬁne what we consider biologically relevant they cannot take advantage of the fact that sequences CRMs: ﬁrst, a CRM should occur in a minimal number are coregulated. of input sequences (frequency constraint, Figure 1B) to be Like CPModule, ModuleSearcher (37) and Compo (14) sufﬁciently speciﬁc for the input set, but it does not neces- are multiple sequence tools. Both of them come with their sarily have to cover all sequences. Second, we assume that own motif screening tool. ModuleSearcher searches for motif sites of a CRM are more likely to reﬂect true com- motif sets by using a genetic algorithm, a heuristic binatorial regulation when they occur in each other’s search method that maintains of pool of solutions which proximity on a single sequence than when they are scat- are modiﬁed to ﬁnd ever better solutions. This type of tered over long genomic distances. Therefore, the motif search is more ad hoc and gives no guarantees that the sites composing a CRM should occur within a maximal best CRMs are found. Compo on the other hand uses genomic distance from each other (proximity constraint, techniques from itemset mining, as does CPModule. Figure 1B). This is not guaranteed to always be the case, However, they differ in a number of ways: Compo is a which is compensated by the frequency constraint that specialized algorithm while CPModule uses a generic does not require all sequences to contain the CRM. constraint-based methodology, allowing to incorporate Because motifs can have multiple binding sites on the extra constraints such as the redundancy constraint same sequence, we observed that several of the CRMs and the query-based constraint in a principled way. found were redundant. We consider a found CRM to be Additionally, while Compo has a strong focus on multi- redundant to another one if it contains a subset of the parameter search and optimization, CPModule does motifs of the other CRM and occurs in exactly the same sequences. In this case, the smaller CRM will have a lower exhaustive search and calculates the signiﬁcance of a statistical score anyway; hence we can discard such CRMs CRM using a large collection of background sequences from consideration. To make the search more effective, we in a second step. will avoid redundant CRMs during search (redundancy For benchmarking we used the synthetic data constraint, (Figure 1B). Finally, in ChIP-Seq assayed constructed by Xie et al. (36). This data set contains data, one can use the fact that at least the assayed TF intergenic sequences in which ‘true motif sites’ are should be part of the CRM. For this purpose, we add a inserted. The performance of CRM detection was query-based option in which a ‘query motif’ can be assessed by comparing the best scoring solution of each provided that has to be part of the CRM. Using this algorithm with the known ‘true’ solution. The quality of knowledge before and during search (query-based con- the solutions was evaluated both at the motif and nucleo- straint, Figure 1B) can make the problem computationally tide level using respectively a motif correlation coefﬁcient much more feasible. (mCC) and a nCC (5) (see ‘Materials and Methods’ When using the constraint programming for itemset section). mining framework with the above constraints, the result Finding CRMs in multiple long sequences using all 516 is a large collection of CRMs found to fulﬁll those TRANSFAC PWMs is a challenging task. Table 1 shows PAGE 7 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 a comparison of the different tools on our synthetic rather poorly (0.27); it wrongly identiﬁes the motifs that benchmark data. The single sequence tools Cister (16) bind in the identiﬁed regions. CPModule scores worse and Cluster-Buster (15) perform rather poorly (<0.25 (0.55 versus 0.68) at the nucleotide level than Compo, for both mCC and nCC). Because they screen sequences but scores considerably better (0.57 versus 0.27) at the individually, they predict CRMs with a large number of motif level. motifs that differ from sequence to sequence, resulting in To allow for a more thorough comparison of different the tools, we reanalyzed the dataset using each time a bad scores. ModuleSearcher (41) could not handle the large number of PWMs and repeatedly ran into memory subset of the 516 motif models. Starting from the PWMs problems, even with 2GB of RAM allocated. When of the three inserted TFs, we gradually increase the limiting the maximum number of motifs in a CRM to number of total PWMs by sampling them from the set 10 or less, Compo (14) found the solution reported in of 513 remaining ones. This results in an increasingly larger input in terms of candidate motifs, which contain Table 1. It scores very well at the nucleotide level (0.68), meaning that it ﬁnds the binding region on the sequences increasingly more false motif models. This makes the quite accurately. However, at the motif level it performs problem increasingly harder. Figure 2 shows the motif and nucleotide-level correlation coefﬁcients (CC) of the different tools on the data. The number of motif models used is shown on the x-axis (for each sample size, 10 Table 1. Comparison of CRM prediction algorithms different samples are created and all tools are run using the same sample sets). The single-sequence-based tools Cister Cluster-Buster ModuleSearcher Compo CPModule Cister (16) and Cluster-Buster (15) perform best in the mCC 0.16 0.05 / 0.27 0.57 presence of few sampled PWMs and deteriorate as more nCC 0.23 0.23 / 0.68 0.55 PWMs are added. With few PWMs their predictions are accurate, especially regarding the binding region of the The tools were run on the synthetic data set of Xie et al. (36) using a CRMs (nucleotide level). However, because they treat stringent screening with 516 TRANSFAC PWMs. Slash indicates termination by lack of memory. sequences independently, different false positive motifs Figure 2. Performance comparison of CRM detection tools. All CRM detection tools were run on the synthetic data set of Xie et al. (36). Screening was performed with the PWMs used to generate the synthetic data in combination with an additional set of PWMs sampled from TRANSFAC (the number of PWMs added to the true PWMs is indicated on the x-axis). (A) mCC, (B) nCC. e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 8 OF 16 are predicted on each sequence separately. This becomes Assessing the added value of ChIP-based information on worse as the number of sampled PWMs increases, leading detecting CRMs involved in mouse embryonic stem cell to decreasing scores. For the multiple sequence tools Description of the experimental set up Compo (14), ModuleSearcher (41) and CPModule this To show the effect of using ChIP-Seq information on problem is less pronounced. The behavior of Compo improving the performance of CRM detection, we changes as the number of motifs increases: at the motif relied on publicly available ChIP-Seq experiments con- level, the score has a decreasing trend, while at the nucleo- ducted by Chen et al. (25). The data consist of tide level the score increases. Looking at the CRMs found, ChIP-Seq experiments for ﬁve key TFs involved in we observe that Compo ﬁnds CRMs with only one true self-renewal of mouse embryonic stem cells, namely motif and increasingly more false motifs as the number of KLF4, NANOG, OCT4, SOX2 and STAT3 for which PWMs increases, explaining the motif-level behavior. it is known that combinatorial interactions exist Unexpectedly, adding these false motifs seems to contrib- amongst at least some of these ﬁve TFs. These previously ute rather than to deteriorate the precision at the nucleo- known interactions, corresponding to nine different tide level. The shorter predicted binding regions obtained CRMs were used as a benchmark (Figure 3). Starting by adding more motifs seem to coincidentally better ap- from the data of a ChIP-Seq experiment of a single proximate the regions in which also the true CRMs are assayed TF, we used CRM detection to discover, located. The scores of CPModule and ModuleSearcher in silico, the other TFs with which the assayed TF con- score well on both the motif- and nucleotide-level, while stitutes aCRM.Wethentestedtowhatextentwecould being less affected by the number of motifs used. recover the previously described benchmark CRMs using However, ModuleSearcher is unable to scale to more either a query-based or non-query-based setting. The than 400 motifs because of memory issues, while this is non-query-based setting mimics the traditional way in not a problem for CPModule. which CRM detection is being performed, that is trying These results show that CPModule is competitive with to prioritize a CRM that is enriched in a set of sequences state-of-the-art tools in detecting CRMs in sets of without using any further prior information. In the coregulated sequences. This, in combination with its query-based setting, only CRMs that contain a motif capability of handling a large number of sequences, prioritizing CRMs by means of ranked lists and the for the assayed TF are searched for. Prior to the CRM ability to be used in a query-based mode, make it ideally detection, we ﬁrst optimized screening thresholds to suited for this study and for the analysis of ChIP-Seq data reduce the effect of the screening on the success of the in general. CRM detection. Figure 3. Known combinatorial regulation of the ﬁve assayed TFs. Network representing combinatorial interactions between the ﬁve transcription factors (KLF4, SOX2, OCT4, NANOG and STAT3) involved in embryonic stem cell development. Edges indicate that a combinatorial interaction between the indicated TFs exists as reported in literature (with a combinatorial interaction referring to the fact that at least subsets of genes contain binding sites for both TFs in each other’s neighborhood). Dashed lines correspond to the interactions in the benchmark that were missed by CPModule. Solid lines correspond to the interactions in the benchmark that were recovered by CPModule. The thin line indicates that the interaction was detected using CPModule on the ChIP-Seq data set of one TF while the thick line indicates that the interactions was detected by using either ChIP-Seq dataset of the TFs involved in the interaction. PAGE 9 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 Screening and ﬁltering this average number does not exclude that some TFs can We started from the top 100 binding peaks identiﬁed for have multiple motif sites in the same sequence while others each ChIP-Seq-assayed TF (assuming that those represent might have none. Applying a low stringency screening the most reliable binding sites). It was recently shown that resulted in each of the analyzed data sets in the detection the sites of the assayed TF do not exactly coincide with of, on average, four motif sites per TF and per sequence their binding peaks, but can be located as far as 250 bp (Figure 4B). This was much lower in case stringent from the actual peak (42,43). Therefore, we used a screening was used. sequence region of 500 bp centered around each of the Combining the non-stringent screening with a ﬁltering top 100 peaks as input sequences. based on nucleosome occupancy (that is removing sites Ideally, the result of the screening should have a high predicted to be located in nucleosome occupied regions) sensitivity, while containing only few false positive motif seems to offer a good trade-off between the sensitivity and sites. Using a stringent threshold would bias towards only the number of false positive sites as is shown in Figure 4. ﬁnding the most ‘conserved motif sites’. However, true Compared to stringent screening, applying the ﬁltering sites do not necessarily correspond to the most conserved after a non-stringent screening largely increases the sensi- ones (44,45). Using a low stringency ﬁltering might con- tivity for most of the assayed TFs, while still maintaining trarily result in the inclusion of too many false positives, the number of predicted sites per TF within a reasonable possibly deteriorating the CRM detection. Therefore, we range. In the remainder of the analysis, ﬁltering was used in addition to a screening with either a high or a low applied on the low stringency predicted sites of all TFs stringency threshold, also a low stringency screening except for those of the assayed one. For the assayed TF, combined with a ﬁltering based on nucleosome position- ﬁltering was omitted as the ChIP-derived evidence experi- ing as nucleosome positioning plays a role in determining mentally supports that each ChIP-bound region contains the accessibility of a site (39). As the information on binding sites of that TF. condition- and tissue-dependent nucleosome occupancy is not readily available, we relied on the NuOS which CRM detection has also previously been used in the context of motif CPModule was run using for each assayed TF the detection (30) (see ‘Materials and Methods’ section). sequences of the top 100 peak regions, as well as the Motif sites located in regions that show a high NuOS motif sites identiﬁed by the screening and ﬁltering. For are considered to be transcriptionally inactive (46) and the proximity threshold, we started for each data set were therefore ﬁltered out. from 150 bp and step wisely (50 bp at the time) extended One advantage of starting from ChIP-Seq based infor- this value to maximally 400 bp. The value of the frequency mation is that it allows to approximate, at least for the threshold was set to 60%. Predicted CRMs were ranked assayed TF, the effect of the screening/ﬁltering on the according to their P-values. The higher the rank of a recovery rate of its binding sites. This effect on recovering benchmark CRM (that is a previously known CRM, see the binding sites of the assayed TF can be seen as repre- Figure 3), the better the algorithm was able to prioritize sentative for the effect of the screening/ﬁltering on the CRM amongst the total number of predicted CRMs. recovering sites of any other TF. In Figure 4, we display The results obtained by running CPModule after for each input set (sequences corresponding to the 100 screening with a stringent threshold (for all TFs except binding regions of each assayed TF) the sensitivity in the assayed one) (Supplementary Table S2) shows that a recovering binding sites of the assayed TF after applying stringent screening threshold lowers the sensitivity of different screening/ﬁltering procedures. The sensitivity is retrieving true binding sites to such extent that none of expressed as the percentage of the input set in which a the benchmark CRMs can be retrieved at the predeﬁned motif site of the assayed TF could be detected. A sensitiv- frequency threshold of 60%. Only by subsequently ity of 100% thus corresponds to retrieving at least one lowering the frequency threshold to 50% allowed recover- motif site for the assayed TF in each of the 100 high ing a benchmark CRM (1 out of the 9). To calculate scoring binding regions. As can be expected, a stringent benchmark recovery we considered all solutions obtained screening results in a rather low sensitivity for most of the with all possible proximity thresholds irrespective of their binding regions of the assayed TFs (sensitivity <50%). rank. Without ﬁltering, the number of binding sites per Lowering the screening stringency largely increases this motif and sequence became so high that the search for sensitivity. At least 80% of the binding peaks for respect- CRMs was computationally prohibitive. ively KLF4 (84%), NANOG (80%), OCT4 (98%), SOX2 The most informative results and highest coverage of (95%) and STAT3 (100%) were found to contain a motif benchmark CRMs was obtained using CPModule with site for their respective TFs. However, this increased non-stringent screening and ﬁltering based on the NuOS. sensitivity comes at the expense of also predicting many Seven of nine of the previously described CRMs involving more potentially false positive sites. The number of false KLF4, NANOG, OCT4, SOX2 or STAT3. positives that result from a screening/ﬁltering procedure is Table 2 shows for each of the assayed TFs and for each harder to estimate as we have no clue about the identity or proximity threshold, the highest ranking recovered bench- location of the true sites. Therefore, we estimated the false mark CRM together with its rank amongst the total discovery rate by the average number of motif sites per TF number of solutions. In addition to their rank we show and per sequence region retained after applying different screening procedures (Figure 4B). The average number of for each CRM its support as an additional quality criter- sites per screened TF should be sufﬁciently low. Note that ion, indicating how many of the 100 given peak regions e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 10 OF 16 Figure 4. Effect of different screening/ﬁltering combinations on motif prediction results. (A) Effect of using different screening/ﬁltering combinations on the sensitivity of recovering true sites of the assayed TF. Sensitivity is assessed by the percentage of binding peak regions in which a motif site of the assayed TF could be detected. (B) Effect of using different screening /ﬁltering combinations on the average number of remaining motif sites per sequence and per TF for each of the ChIP-assayed data sets. In each panel, we used a stringent screening, a non-stringent screening without ﬁltering and a non-stringent screening with a ﬁltering based on NuOS (different categories are indicated in the order as mentioned above by bars with increasing gray scales), respectively. contained the predicted CRMs (the higher this value, the the assayed TF was found. Most of the TFs involved in more speciﬁc the detected CRM is for the given dataset). the benchmark interactions, therefore, have binding sites For all data sets, at the one but lowest proximity in a rather close proximity on the genome. (200 bp) most of the benchmark CRMs were retrieved (6 Predicted CRMs were also validated using the available of the 9 benchmark CRMs). Further increasing the prox- ChIP-Seq experimental information: if a previously imity threshold results in the same benchmark CRMs also described interaction between the analyzed TF and any obtained with a lower threshold, albeit most of the time at of the other four benchmark TFs was predicted by a lower rank and/or in combination with other motifs. CPModule, the ChIP-Seq data of the other TFs were Increasing the proximity threshold will increase the used to experimentally verify whether their predicted number of valid CRMs. For NANOG, one additional sites in the retrieved CRMs coincided with their binding benchmark CRMs were retrieved at a higher proximity peaks. Predicted sites of TFs for which experimental threshold than the one for which the ﬁrst CRM containing data was available, overlapped for at least 10% PAGE 11 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 (and often more) with their corresponding binding only listing CRMs that contain a site for the ChIP- peaks (Column ‘Validation’ in Table 2). This indicates assayed TF. Table 2 shows that in the query-based that most of the benchmark CRMs predicted by setting, the rank of the benchmark CRMs is in many CPModule reﬂect true CRM signals present in the cases much better than the rank in the non-query-based ChIP-Seq data. setting. This is especially true for KLF4, NANOG, SOX2 The added value of using ChIP-Seq derived information and STAT3. For OCT4, the ranks are more similar for becomes obvious when comparing the rank obtained for both settings, but in the query-based setting much less each benchmark CRM in the ‘non-query-based’ setting candidate CRMs are returned, and hence the computation with the one obtained using the ‘query-based’ setting. is much more efﬁcient. Our results show that exploiting The ﬁrst setting mimics a classical CRM detection set ChIP-Seq information, by constraining the candidate up, i.e. when searching for CRMs that are statistically CRMs to those that only contain the assayed TF, not overrepresented in a set of coregulated sequences. The only leads to a compact result set with in many cases a query-based setting differs from the classical setting by better ranking of the true CRMs, but more importantly Table 2. Benchmark CRMs obtained with CPModule in combination with ﬁltering (non-stringent screening with ﬁltering for all TFs except the assayed one) ChIP-Seq- CRM Support Proximity Query-based Non-query- Validation (%) assa- (%) threshold based yed TF (bp) Rank/Total Rank/Total KLF4 KLF4, STAT1 66 150 19/23 845/849 22.73 KLF4, OCT1 60 200 19/46 5562/5694 33.33 KLF4, STAT1, [CEBP] 61 200 22/46 5635/5694 23.33 KLF4, STAT4, [SMAD] 61 250 16/98 6160/7029 22.95 KLF4, OCT1 60 250 65/98 6994/7029 33.33 KLF4, STAT4, [T3R] 63 300 23/183 6903/7704 22.22 KLF4, OCT1 61 300 131/183 7651/7704 32.79 KLF4, STAT4, STAT1, [CDXA, LEF1] 60 350 29/284 25 056/26 843 23.33 KLF4, OCT1 61 350 212/284 26 771/26 843 32.79 KLF4, STAT4, [SMAD, T3R] 60 400 5/468 24 930/3 1549 21.67 KLF4, OCT1, [CDXA] 60 400 207/468 31 220/31549 33.33 NANOG NANOG, STAT5A_03, STAT5A_04 62 150 1/11 5930/5941 19.35 NANOG, STAT5A_04, [PU1] 60 200 1/39 40 171/43 093 23.33 NANOG, STAT5A_03, [PU1] 61 250 1/71 62 475/64 059 21.31 NANOG, OCT1 60 250 21/71 64 006/64 059 68.33 NANOG, OCT1, [FAC1] 60 300 1/145 66 186/80859 70.49 NANOG, STAT3, [FAC1] 62 300 3/145 77 724/80 859 26.67 NANOG, STAT5A_04, [PU1, FAC1] 60 350 1/406 159 818/217 328 25.00 NANOG, OCT1, STAT5A_04, [FAC1] 60 350 2/406 167 806/217 328 66.67 (OCT1); 26.67 (STAT5A_04) NANOG, STAT5A_04, [PU1, HNF3, AR] 60 400 1/883 204 024/299 409 23.33 NANOG, OCT1, STAT6, STAT5A_04, [FAC1] 60 400 2/883 224 495/299 409 70.00 (OCT1); 26.67 (STAT6) OCT4 OCT4, STAT6, [XFD2, FOXJ2, FOXP3] 63 150 6/1322 30/11966 14.75 OCT4, SOX2 60 150 1272/1322 10 348/11 966 78.33 OCT4, STAT4, STAT6, [PAX2, PAX4, TITF1] 62 200 1/13 141 6/111 817 16.39 OCT4, SOX2, [PAX2] 62 200 11 740/13 141 83 797/111 817 79.03 OCT4, STAT4, STAT6, [PAX4, PAX2, ELF1] 66 250 1/29 767 23/182 697 16.67 OCT4, SOX2, [CDXA] 60 250 28 080/29 767 1671 42/182 697 81.67 OCT4, STAT3 61 300 1/73 091 7/235 252 14.75 OCT4, SOX2, [CDXA, PAX2] 60 300 68 944/73 091 217 331/235 252 75.00 OCT4, STAT3, [CDXA] 60 350 1/290 997 1/859 377 12.90 OCT4, SOX2, [PAX2, FOXP3] 60 350 106 443/290 997 296 722/859 377 73.33 OCT4, STAT3, STAT5A_03 60 400 1/383 001 11/108 0139 14.75 OCT4, SOX2, [PAX2, FOXP3] 60 400 150 936/383001 449 140/1 080 139 73.33 SOX2 SOX2, OCT4 68 150 1/6318 322/46 471 88.24 SOX2, STAT5A_04, [NKX62, AR, HELIOSA] 60 150 3/6318 840/46 471 23.33 SOX2, STAT5A_04, [GEN_INI2_B, FOXJ2, 60 200 2/90 416 55/512 702 25.00 HNF3ALPHA, CEBP, AR] SOX2, OCT4, [CDXA, TST1] 61 200 4/90 416 106/512 702 27.87 SOX2, STAT1, STAT5A_04, [SRY, CAP, NFAT, 60 250 1/168 760 55/790 791 25.00 TEF, AR, CDX, HMGIY, BRCA] SOX2, OCT4, [CDXA, CDX2] 62 250 4/168 760 183/79 0791 87.10 SOX2, OCT4, [CDXA, CDX, CEBP] 60 300 1/303 533 94/1 256 190 86.89 SOX2, STAT, STAT5A_03, [CAP, NFAT, 60 300 2/303 533 238/1 256 190 21.67 GEN_INI2_B, FOXJ2, CEBP] (continued) e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 12 OF 16 Table 2. Continued ChIP-Seq- CRM Support Proximity Query-based Non-query- Validation (%) assa- (%) threshold based yed TF (bp) Rank/Total Rank/Total STAT3 STAT3, OCT4, [CAP] 61 150 36/5649 43/6426 36.07 STAT3, SOX2, [IRF1] 61 150 2651/5649 3018/6426 31.15 STAT3, OCT4, [CEBP] 61 200 301/32 257 312/33 640 29.51 STAT3, SOX2, STAT6, [YY1] 60 200 4532/32257 4675/33640 30.00 STAT3, OCT4, [CAP, TEF1, YY1, PR] 60 250 57/54 549 61/56 473 31.67 STAT3, SOX2, STAT6, [SRY, IRF1] 60 250 6363/54 549 6666/56 473 28.33 STAT3, OCT1, [CAP, FOXM1, YY1, PR] 60 300 186/73 378 188/74 106 33.33 STAT3, SOX2, STAT6, [XPF1] 60 300 8442/73 378 8517/74 106 28.33 STAT3, OCT, STAT5A_03, STAT6, [HOXA3, 60 350 6/243 758 6/243 979 35.00 AP2REP, PU1] STAT3, SOX2, STAT1, STAT4, STAT5A_04, 60 350 21 046/243 758 21 066/243 979 26.67 STAT6, [HNF3, YY1] STAT3, OCT, STAT5A_03, [AP2REP, PU1, XPF1] 61 400 12/308 757 12/308 993 36.07 STAT3, SOX2, [HOXA3, AP2REP] 62 400 21 375/308 757 21 385/308 993 29.03 In this table, only benchmark CRMs recovered by CPModule are displayed, For reasons of conciseness, we only display for each parameter setting the best ranked versions of each of the benchmark CRMs, for instance, whereas Oct4-Sox2 was found to be the best ranked CRM at a proximity threshold of 150, more combinations of Oct4, Sox2 in combination with other TFs were also detected at this setting of the proximity parameter albeit at lower ranks. These alternative versions with lower rank are not displayed in the table. If PWMs for TFs belonging to the same family are very similar, we also considered those CRMs as true that contained rather than the TF reported in literature another member of the same family (48) (i.e. this was the case for TFs of the STAT and OCT family). The set of sequences corresponding to the top 100 scoring peak regions of the assayed TF, were screened with a set of 517 TRANSFAC motifs using a non-stringent screening threshold. Filtering was applied on all motif sites except on the ones of the assayed TF. ChIP-Seq-assayed TF: TF from which the top 100 binding peaks were used to perform the analysis. CRM: obtained CRMs that correspond to previously well described CRMs for the assayed TF; [between brackets are indicated other TFs that were predicted to belong to the same CRM, but that have not previously been described to interact with the assayed TF]. Support: the percentage of sequences from the input set in which this CRM occurs (always higher than the frequency threshold). Proximity threshold (bp): the proximity threshold with which the displayed CRM was found. Query-based Rank/Total: the rank this CRM received in the query-based setting/the number of solutions containing the motif for the ChIP-Seq-assayed TF. Non-query-based Rank/Total: the rank this CRM received in all of the solutions/the total number of valid CRMs. Validation: we started from the ChIP-Seq data of one TF and predicted using CRM detection with which other TFs the assayed TF interacts. We veriﬁed whether the motif sites contributing to the predicted CRMs fell within the binding peaks of the other ChIP-Seq-assayed TF. Table 2 reads as follows, for instance, when starting from the ChIP-Seq data of SOX2, we predicted a previously described CRM containing SOX2-OCT4. This retrieved CRM was ranked ﬁrst amongst the 6318 potential CRMs that contained SOX2 (rank in the query-based mode) and ranked 322 out of the total number of 46 471 possible CRMs in the non-query based mode. SOX2 and OCT4 co-occurred in 68% of the SOX2 ChIP-Seq identiﬁed regions (Support) within a distance of 150 bp and the identiﬁed sites for OCT4 in the predicted CRM fell within the identiﬁed OCT4 ChIP-Seq regions in 88.24% of the cases. As an example of how the same CRM can be detected at different proximity thresholds: the CRM containing KL4-OCT1 was recovered at a proximity constraint of 200, 250, 300 and 350, but with an increasingly lower absolute rank in the non-query-based setting. With the current screening/ﬁltering all runs could be performed except those for SOX2 with proximity thresholds of 350 bp and 400 bp, respectively. These did not ﬁnish after 7days. they also show that in the absence of such information, For a handful of the predicted CRMs (KLF4-STAT4; CRM detection becomes almost infeasible. OCT4-CDXA; OCT4-PAX2; OCT4-STAT; OCT4-SRY; SOX2-OCT4), it was previously proven that their contri- buting TFs were involved in combinatorial regulation (see Novel predicted CRMs involved in embryonic Supplementary Table S3 for a list of references). stem cell regulation For most of the other CRMs we could ﬁnd indirect Besides the results on the benchmark set, we also literature-based support, suggesting the plausibility of displayed for all assayed TFs their top three ranking the predicted interactions, for instance for NANOG- CRMs predicted by CPModule, for the different proxim- FAC1, the putative transcriptional regulator FAC1 has ity thresholds (Supplementary Table S3). Note that those shown to be expressed in embryonic and extra-embryonic CRMs score in most cases much better than the bench- tissues of the early mouse conceptus and was shown to be mark CRMs. essential for trophoblast differentiation during early Functional analysis (Ingenuity Pathway analysis and mouse development (47), making an interaction between literature-based analysis) of the 56 TFs that were NANOG and FAC1 plausible. Other examples are involved in the predicted CRMs of respectively KLF4, described in Table 3. NANOG, OCT4, SOX2 or STAT3, showed that most the TFs have functions related to development (embryonic development, cellular development, tissue development, DISCUSSION organ development, organismal development), cellular growth and proliferation, cancer and tissue morphology In this work, we developed CPModule, a novel approach all functionalities that could be related to ESC cell growth, for CRM detection with a performance that is competitive death and differentiation (see Supplementary File S3). to that of other state-of-art tools, while being able to PAGE 13 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 Table 3. Indirect evidences for the suggested CRMs in Supplementary Table S3 Suggested CRM Indirect evidence NANOG-FAC1 The putative transcriptional regulator FAC1 is expressed in embryonic and extraembryonic tissues of the early mouse conceptus. Study (47) showed that FAC1 is essential for trophoblast differentiation during early mouse development. Thus, there might be an interaction between NANOG and FAC1. OCT4-FOXM1 Foxm1 has been hypothesized to be one of the candidates to help reprogramming somatic cells into iPSCs (Induced pluripotent stem cells) (49). FOXM1 is a major stimulator of cell proliferation (50), so it might interact with KLF4 in the self-renewal process. SOX2-CDXA Binding of homeobox domain from CDX1 protein and SOX2 protein was shown to occur in a system of puriﬁed components (51). Note that we identiﬁed a CRM with CDXA and not with CDX1. However CDXA and CDX1 belong to the same protein family and have very similar motif models. SOX2-BRCA Roles of BRCA in both homologous recombination and non-homologous end joining DNA repair have been shown (52,53). Such function of BRCA might also play a role during the self-renewal process to repair DNA damage. STAT3-HOXA3 As HOXA3 is involved in wound repair (54), interaction with STAT3 in the self-renewal process is plausible. STAT3-GATA1 GATA1 was known to be one of the major transcription factors that stimulated cardiogenesis during development (55–57) even though it is frequently used as a marker for endodermal derivatives during differentiation of pluripotent stem cells (58). Interaction with STAT3 in the self-renewal process is therefore plausible. STAT1-STAT3-STAT6 Binding of human STAT3 protein and human STAT6 protein has been shown in a 2-hybrid assay (59). STAT1 and STAT3 can form heterodimers (60,61). Note however that with STAT motif models it is difﬁcult to make the distinction between the different STAT members. Indirect evidences derived from literature searches which give indications on possible interactions between the indicated TFs found in the predicted CRMs. handle larger data sets (such as 100 sequences in combin- problem intractable or lower the quality of the outcome; ation with a library of 517 PWMs). The advantage of more false positive hits in the screening will also result in CPModule is that it builds upon a constraint programming the detection of more spurious CRMs. Just increasing the for itemset mining framework (24). This approach provides stringency of the screening seems not to be the best option fast search techniques similar to those in itemset mining, as many true sites, and thus also true CRMs, appear to be missing. Using a lower screening threshold in combination while allowing to freely impose additional constraints on with a ﬁltering procedure based on nucleosome occupancy the CRMs. These constraints such as the proximity and provided in our case a good trade-off between keeping the redundancy constraint can help prioritizing likely CRMs. number of false positives in a reasonable range and At this point our system focuses on ﬁnding loosely recovering true sites. We applied this ﬁltering to sites of structured CRMs that satisfy multiple constraints, such all TFs other than those of the assayed TF. By increasing as a minimum frequency constraint or a constraint on the the recovery rate of sites of the assayed TF, we maximize maximum distance between binding sites. It can easily be the chance of ﬁnding CRMs and instances of CRMs that extended to ﬁnd unstructured CRMs under additional contain the assayed TF, as those are the ones we are constraints. The discovery of structured CRMs, similar to primarily interested in. Using a more stringent ﬁltering the approaches proposed by Noto and Craven (2006) and for all other TFs will help reducing the spurious CRMs, by Cartharius et al. (8) in FrameWorker, is outside the but might come at the expense of not being able to detect scope of the current approach. some of the true interactions between the assayed TF and The ﬂexible framework also allows us to use CPModule other TFs in TRANSFAC (which might explain why we in a query-based setting when dealing with ChIP-Seq couldn’t recover all previously described benchmark derived data, that is, we can search only for CRMs that CRMs). With more experimental data on condition- contain the assayed TF. Because of its enumerative dependent nucleosome occupancy and other cell speciﬁc approach, CPModule outputs all valid CRMs as an features becoming available, ﬁltering will become more ordered list rather than returning one CRM as being the reliable and will surely further improve the success of com- most signiﬁcant CRM. Having an idea of the rank of a binatorial CRM detection. true CRM amongst the total number of valid CRMs gives an intuition on the difﬁculty of computationally retrieving a true CRM in a particular data set. We used this property SUPPLEMENTARY DATA to assess the contribution of ChIP-Seq data in its ability to Supplementary Data are available at NAR Online: prioritize true CRMs. Our results on real datasets showed Supplementary Tables S1–S3, Supplementary Figures that in the absence of ChIP-Seq based information, S1–S2 and Supplementary References [62–112]. biologically relevant CRM detection is almost infeasible. The success of CRM detection also depends on the quality of the input data, which is the set of motif sites ACKNOWLEDGEMENTS predicted by screening using motif models such as PWMs. A too dense collection of motif sites (hits) obtained by a The authors thank Luc De Raedt for the invaluable dis- non-stringent screening threshold usually results in too cussions and his vision on using constraint programming many motif combinations, which either make the for itemset mining, which enabled the development of the e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 14 OF 16 15. Frith,M.C., Li,M.C. and Weng,Z. (2003) Cluster-Buster: ﬁnding CPModule system. The authors also thank the anonym- dense clusters of motifs in DNA sequences. Nucleic Acids Res., ous reviewers for many useful comments. 31, 3666–3668. 16. Frith,M.C., Hansen,U. and Weng,Z. (2001) Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, 878–889. FUNDING 17. Buck,M.J. and Lieb,J.D. (2004) ChIP-chip: considerations for the KULeuven Research Council (GOA/08/011, - SymBioSys design, analysis, and application of genome-wide chromatin -CoE EF/05/007, NATAR C1895-PF/10/10); the agency immunoprecipitation experiments. Genomics, 83, 349–360. 18. Jothi,R., Cuddapah,S., Barski,A., Cui,K. and Zhao,K. (2008) for Innovation by Science and Technology Genome-wide identiﬁcation of in vivo protein-DNA binding sites (SBO-BioFrame); Interuniversity Attraction Poles (P6/ from ChIP-Seq data. Nucleic Acids Res., 36, 5221–5231. 25-BioMaGNet); Research Foundation - Flanders 19. Liu,E.T., Pott,S. and Huss,M. (2010) Q&A: ChIP-seq (IOK-B9725-G.0329.09); the Human Frontier Science technologies and the study of gene regulation. BMC Biol., 8, 56. 20. Pepke,S., Wold,B. and Mortazavi,A. (2009) Computation for Program (RGY0079/2007C); Ghent University ChIP-seq and RNA-seq studies. Nat. Methods, 6, S22–S32. (Multidisciplinary Research Partnership ‘M2N’); Institute 21. Li,X., MacArthur,S., Bourgon,R., Nix,D., Pollard,D., Iyer,V., for the Promotion and Innovation through Science and Hechmer,A., Simirenko,L., Stapleton,M., Luengo Hendriks,C. Technology in Flanders (IWT-Vlaanderen); A Postdoc et al. (2008) Transcription factors bind thousands of active and inactive regions in the Drosophila blastoderm. PLoS Biol., 6, e27. grant from the Research Foundation-Flanders. Funding 22. Visel,A., Blow,M., Li,Z., Zhang,T., Akiyama,J., Holt,A., Plajzer- for open access charge: KU Leuven Research Council Frick,I., Shoukry,M., Wright,C., Chen,F. et al. (2009) ChIP-seq (SymBioSys, CoE EF/05/007). accurately predicts tissue-speciﬁc activity of enhancers. Nature, 457, 854–858. Conﬂict of interest statement. None declared. 23. van der Meer,D.L., Degenhardt,T., Vaisanen,S., de Groot,P.J., Heinaniemi,M., de Vries,S.C., Muller,M., Carlberg,C. and Kersten,S. (2010) Proﬁling of promoter occupancy by PPARalpha in human hepatoma cells via ChIP-chip analysis. Nucleic Acids REFERENCES Res., 38, 2839–2850. 24. De Raedt,L., Guns,T. and Nijssen,S. (2008) Constraint 1. Davidson,E.H. (2001) Genomic Regulatory Systems: Development programming for itemset mining. Proceedings of the 14th ACM and Evolution. Academic Press, San Diego. SIGKDD International Conference on Knowledge Discovery and 2. Zhou,Q. and Wong,W.H. (2004) CisModule: de novo discovery Data Mining. ACM, New York, NY, USA, pp. 204–212. of cis-regulatory modules by hierarchical mixture modeling. Proc. 25. Chen,X., Xu,H., Yuan,P., Fang,F., Huss,M., Vega,V.B., Wong,E., Natl Acad. Sci. USA, 101, 12114–12119. Orlov,Y.L., Zhang,W., Jiang,J. et al. (2008) Integration of 3. Gupta,M. and Liu,J.S. (2005) De novo cis-regulatory module external signaling pathways with the core transcriptional network elicitation for eukaryotic genomes. Proc. Natl Acad. Sci. USA, 102, 7079–7084. in embryonic stem cells. Cell, 133, 1106–1117. 4. Van Loo,P. and Marynen,P. (2009) Computational methods for 26. Matys,V., Kel-Margoulis,O.V., Fricke,E., Liebich,I., Land,S., the detection of cis-regulatory modules. Brief Bioinform., 10, Barre-Dirrie,A., Reuter,I., Chekmenev,D., Krull,M., 509–524. Hornischer,K. et al. (2006) TRANSFAC and its module 5. Klepper,K., Sandve,G.K., Abul,O., Johansen,J. and Drablos,F. TRANSCompel: transcriptional gene regulation in eukaryotes. (2008) Assessment of composite motif discovery methods. BMC Nucleic Acids Res., 34, D108–D110. Bioinformatics, 9, 123. 27. Coessens,B., Thijs,G., Aerts,S., Marchal,K., De Smet,F., 6. Noto,K. and Craven,M. (2006) A specialized learner for inferring Engelen,K., Glenisson,P., Moreau,Y., Mathys,J. and De Moor,B. structured cis-regulatory modules. BMC Bioinformatics, 7, 528. (2003) INCLUSive: a web portal and service registry for 7. Whitington,T., Frith,M.C., Johnson,J. and Bailey,T.L. (2011) microarray and regulatory sequence analysis. Nucleic Acids Res., Inferring transcription factor complexes from ChIP-seq data. 31, 3468–3470. Nucleic Acids Res., 39, e98. 28. Frith,M.C., Fu,Y., Yu,L., Chen,J.F., Hansen,U. and Weng,Z. 8. Cartharius,K., Frech,K., Grote,K., Klocke,B., Haltmeier,M., (2004) Detection of functional DNA motifs via statistical Klingenhoff,A., Frisch,M., Bayerlein,M. and Werner,T. (2005) over-representation. Nucleic Acids Res., 32, 1372–1381. MatInspector and beyond: promoter analysis based on 29. Xi,L., Fondufe-Mittendorf,Y., Xia,L., Flatow,J., Widom,J. and transcription factor binding sites. Bioinformatics, 21, 2933–2942. Wang,J.-P. (2010) Predicting nucleosome positioning 9. Dohr,S., Klingenhoff,A., Maier,H., Hrabe de Angelis,M., using a duration Hidden Markov Model. BMC Bioinformatics, Werner,T. and Schneider,R. (2005) Linking disease-associated 11, 346. genes to regulatory networks via promoter organization. Nucleic 30. Ramsey,S.A., Knijnenburg,T.A., Kennedy,K.A., Zak,D.E., Acids Res., 33, 864–872. Gilchrist,M., Gold,E.S., Johnson,C.D., Lampano,A.E., Litvak,V., 10. Calva,D., Dahdaleh,F.S., Woodﬁeld,G., Weigel,R.J., Carr,J.C., Navarro,G. et al. (2010) Genome-wide histone acetylation data Chinnathambi,S. and Howe,J.R. (2011) Discovery of SMAD4 improve prediction of mammalian transcription factor binding promoters, transcription factor binding sites and deletions in sites. Bioinformatics, 26, 2071–2075. 31. Schulte,C. and Stuckey,P.J. (2008) Efﬁcient constraint juvenile polyposis patients. Nucleic Acids Res., 39, 5369–5378. propagation engines. ACM T Progr. Lang. Sys., 31, 43. 11. Kwon,A.T., Chou,A.Y., Arenillas,D.J. and Wasserman,W.W. 32. Guns,T., Sun,H., Nijssen,S., Marchal,K. and De Raedt,L. (2010) (2011) Validation of skeletal muscle cis-regulatory module Cis-regulatory module detection using constraint programming. predictions reveals nucleotide composition bias in functional Proceedings of IEEE International Conference on Bioinformatics enhancers. PLoS Comp. Biol., 7, e1002256. 12. Su,J., Teichmann,S.A. and Down,T.A. (2010) Assessing and Biomedicine. IEEE Computer Society, Washington, DC, computational methods of cis-regulatory module prediction. PLoS USA, pp. 363–368. Comp. Biol., 6, e1001020. 33. Gallo,A., De Bie,T. and Cristianini,N. (2007) MINI: mining 13. Van Loo,P., Aerts,S., Thienpont,B., De Moor,B., Moreau,Y. and informative nonredundant itemset. Proceedings of the 11th Marynen,P. (2008) ModuleMiner - improved computational Conference on Principles and Practice of Knowledge Discovery in detection of cis-regulatory modules: are there different modes of Databases. Springer, Berlin, pp. 438–445. gene regulation in embryonic development and adult tissues? 34. Fujita,P.A., Rhead,B., Zweig,A.S., Hinrichs,A.S., Karolchik,D., Genome Biol., 9, R66. Cline,M.S., Goldman,M., Barber,G.P., Clawson,H., Coelho,A. 14. Sandve,G.K., Abul,O. and Drablos,F. (2008) Compo: composite et al. (2011) The UCSC Genome Browser database: update 2011. motif discovery using discrete models. BMC Bioinformatics, 9, 527. Nucleic Acids Res., 39, D876–D882. PAGE 15 OF 16 Nucleic Acids Research, 2012, Vol. 40, No. 12 e90 35. Celniker,S.E., Dillon,L.A., Gerstein,M.B., Gunsalus,K.C., injury-induced mobilization and recruitment of bone Henikoff,S., Karpen,G.H., Kellis,M., Lai,E.C., Lieb,J.D., marrow-derived cells. Stem Cells, 27, 1654–1665. MacAlpine,D.M. et al. (2009) Unlocking the secrets of the 55. Grepin,C., Robitaille,L., Antakly,T. and Nemer,M. (1995) Inhibition of transcription factor GATA-4 expression blocks genome. Nature, 459, 927–930. 36. Xie,D., Cai,J., Chia,N.Y., Ng,H.H. and Zhong,S. (2008) in vitro cardiac muscle differentiation. Mol. Cell Biol., 15, Cross-species de novo identiﬁcation of cis-regulatory modules 4095–4102. with GibbsModule: application to gene regulation in embryonic 56. Lien,C.L., Wu,C., Mercer,B., Webb,R., Richardson,J.A. and stem cells. Genome Res., 18, 1325–1335. Olson,E.N. (1999) Control of early cardiac-speciﬁc transcription 37. Aerts,S., Van Loo,P., Moreau,Y. and De Moor,B. (2004) A of Nkx2-5 by a GATA-dependent enhancer. Development, 126, genetic algorithm for the detection of new cis-regulatory modules 75–84. in sets of coregulated genes. Bioinformatics, 20, 1974–1976. 57. Pikkarainen,S., Tokola,H., Kerkela,R. and Ruskoaho,H. (2004) 38. Barrett,T., Troup,D.B., Wilhite,S.E., Ledoux,P., Rudnev,D., GATA transcription factors in the developing and adult heart. Evangelista,C., Kim,I.F., Soboleva,A., Tomashevsky,M., Cardiovasc. Res., 63, 196–207. Marshall,K.A. et al. (2009) NCBI GEO: archive for 58. Holtzinger,A., Rosenfeld,G.E. and Evans,T. (2010) Gata4 directs high-throughput functional genomic data. Nucleic Acids Res., 37, development of cardiac-inducing endoderm from ES cells. Dev. D885–890. Biol., 337, 63–73. 39. Whitington,T., Perkins,A.C. and Bailey,T.L. (2009) 59. Ravasi,T., Suzuki,H., Cannistraci,C.V., Katayama,S., Bajic,V.B., High-throughput chromatin information enables accurate Tan,K., Akalin,A., Schmeier,S., Kanamori-Katayama,M., tissue-speciﬁc prediction of transcription factor binding sites. Bertin,N. et al. (2010) An atlas of combinatorial transcriptional Nucleic Acids Res., 37, 14–25. regulation in mouse and man. Cell, 140, 744–752. 40. Jiang,J., Chan,Y.S., Loh,Y.H., Cai,J., Tong,G.Q., Lim,C.A., 60. Levy,D.E. and Darnell,J.E. Jr (2002) Stats: transcriptional Robson,P., Zhong,S. and Ng,H.H. (2008) A core Klf circuitry control and biological impact. Nat. Rev. Mol. Cell Biol., 3, regulates self-renewal of embryonic stem cells. Nat. Cell Biol., 10, 651–662. 353–360. 61. John,S., Reeves,R.B., Lin,J.X., Child,R., Leiden,J.M., 41. Aerts,S., Van Loo,P., Thijs,G., Moreau,Y. and De Moor,B. Thompson,C.B. and Leonard,W.J. (1995) Regulation of (2003) Computational detection of cis -regulatory modules. cell-type-speciﬁc interleukin-2 receptor alpha-chain gene Bioinformatics, 19(Suppl. 2), ii5–ii14. expression: potential role of physical interactions between Elf-1, 42. Wilbanks,E.G. and Facciotti,M.T. (2010) Evaluation of algorithm HMG-I(Y), and NF-kappa B family proteins. Mol. Cell Biol., 15, performance in ChIP-seq peak detection. PLoS One, 5, e11471. 1786–1796. 43. Zhang,Y., Liu,T., Meyer,C.A., Eeckhoute,J., Johnson,D.S., 62. Farrar,J.D., Smith,J.D., Murphy,T.L. and Murphy,K.M. (2000) Bernstein,B.E., Nusbaum,C., Myers,R.M., Brown,M., Li,W. et al. Recruitment of Stat4 to the human interferon-alpha/beta receptor (2008) Model-based analysis of ChIP-Seq (MACS). Genome Biol., requires activated Stat2. J. Biol. Chem., 275, 2693–2697. 9, R137. 63. Wang,C., Zhou,G.L., Vedantam,S., Li,P. and Field,J. (2008) 44. Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for Mitochondrial shuttling of CAP1 promotes actin- and ﬁnding protein-DNA binding sites with applications to coﬁlin-dependent apoptosis. J. Cell Sci., 121, 2913–2920. chromatin-immunoprecipitation microarray experiments. Nat. 64. Kelley,C.M., Ikeda,T., Koipally,J., Avitahl,N., Wu,L., Biotechnol., 20, 835–839. Georgopoulos,K. and Morgan,B.A. (1998) Helios, a novel 45. Hu,J., Li,B. and Kihara,D. (2005) Limitations and potentials of dimerization partner of Ikaros expressed in the earliest current motif discovery algorithms. Nucleic Acids Res., 33, hematopoietic progenitors. Curr. Biol., 8, 508–515. 4899–4913. 65. Battista,S., Pentimalli,F., Baldassarre,G., Fedele,M., Fidanza,V., 46. Lee,C.K., Shibata,Y., Rao,B., Strahl,B.D. and Lieb,J.D. (2004) Croce,C.M. and Fusco,A. (2003) Loss of Hmga1 gene function Evidence for nucleosome depletion at active regulatory regions affects embryonic stem cell lympho-hematopoietic differentiation. genome-wide. Nat. Genet., 36, 900–905. FASEB J., 17, 1496–1498. 47. Goller,T., Vauti,F., Ramasamy,S. and Arnold,H.H. (2008) 66. Choi,H.J., Geng,Y., Cho,H., Li,S., Giri,P.K., Felio,K. and Transcriptional regulator BPTF/FAC1 is essential for trophoblast Wang,C.R. (2010) Differential requirements for the Ets differentiation during early mouse development. Mol. Cell Biol., transcription factor Elf-1 in the development of NKT cells and 28, 6819–6827. NK cells. Blood, 117, 1880–1887. 48. Macintyre,G., Bailey,J., Haviv,I. and Kowalczyk,A. (2010) 67. Tang,Y., Katuri,V., Dillner,A., Mishra,B., Deng,C.X. and is-rSNP: a novel technique for in silico regulatory SNP detection. Mishra,L. (2003) Disruption of transforming growth factor-beta Bioinformatics, 26, i524–i530. signaling in ELF beta-spectrin-deﬁcient mice. Science, 299, 49. Xie,Z., Tan,G., Ding,M., Dong,D., Chen,T., Meng,X., Huang,X. 574–577. and Tan,Y. (2010) Foxm1 transcription factor is required for 68. Beck,F. and Stringer,E.J. (2010) The role of Cdx genes in the gut maintenance of pluripotency of P19 embryonal carcinoma cells. and in axial development. Biochem. Soc. Trans., 38, 353–357. Nucleic Acids Res., 38, 8027–8038. 69. Park,M.J., Kim,H.Y., Kim,K. and Cheong,J. (2009) 50. Wang,I.C., Chen,Y.J., Hughes,D., Petrovic,V., Major,M.L., Homeodomain transcription factor CDX1 is required for the Park,H.J., Tan,Y., Ackerson,T. and Costa,R.H. (2005) Forkhead transcriptional induction of PPARgamma in intestinal cell box M1 regulates the transcriptional network of genes essential differentiation. FEBS Lett., 583, 29–35. for mitotic progression and genes encoding the SCF (Skp2-Cks1) 70. Holdcraft,R.W. and Braun,R.E. (2004) Androgen receptor ubiquitin ligase. Mol. Cell Biol., 25, 10875–10894. function is required in Sertoli cells for the terminal differentiation 51. Beland,M., Pilon,N., Houle,M., Oh,K., Sylvestre,J.R., Prinos,P. of haploid spermatids. Development, 131, 459–467. and Lohnes,D. (2004) Cdx1 autoregulation is governed by a 71. Merrill,B.J., Gat,U., DasGupta,R. and Fuchs,E. (2001) Tcf3 and novel Cdx1-LEF1 transcription complex. Mol. Cell Biol., 24, Lef1 regulate lineage differentiation of multipotent stem cells in 5028–5038. skin. Genes Dev., 15, 1688–1705. 52. Shafee,N., Smith,C.R., Wei,S., Kim,Y., Mills,G.B., 72. Galceran,J., Farinas,I., Depew,M.J., Clevers,H. and Grosschedl,R. Hortobagyi,G.N., Stanbridge,E.J. and Lee,E.Y. (2008) Cancer (1999) Wnt3a-/–like phenotype and limb deﬁciency in Lef1(-/-)Tcf1(-/-) stem cells contribute to cisplatin resistance in Brca1/p53-mediated mice. Genes Dev., 13, 709–717. mouse mammary tumors. Cancer Res., 68, 3243–3250. 73. Bouchard,M., Souabni,A., Mandler,M., Neubuser,A. and 53. Farmer,H., McCabe,N., Lord,C.J., Tutt,A.N., Johnson,D.A., Busslinger,M. (2002) Nephric lineage speciﬁcation by Pax2 and Richardson,T.B., Santarosa,M., Dillon,K.J., Hickson,I., Pax8. Genes Dev., 16, 2958–2970. Knights,C. et al. (2005) Targeting the DNA repair defect in 74. Torres,M., Gomez-Pardo,E., Dressler,G.R. and Gruss,P. (1995) BRCA mutant cells as a therapeutic strategy. Nature, 434, Pax-2 controls multiple steps of urogenital development. 917–921. Development, 121, 4057–4065. 54. Mace,K.A., Restivo,T.E., Rinn,J.L., Paquet,A.C., Chang,H.Y., 75. Kashimada,K. and Koopman,P. (2010) Sry: the master switch in Young,D.M. and Boudreau,N.J. (2009) HOXA3 modulates mammalian sex determination. Development, 137, 3921–3930. e90 Nucleic Acids Research, 2012, Vol. 40, No. 12 PAGE 16 OF 16 76. Sun,L., Ma,K., Wang,H., Xiao,F., Gao,Y., Zhang,W., Wang,K., 96. Chang,C.P., Neilson,J.R., Bayle,J.H., Gestwicki,J.E., Kuo,A., Gao,X., Ip,N. and Wu,Z. (2007) JAK1-STAT1-STAT3, a key Stankunas,K., Graef,I.A. and Crabtree,G.R. (2004) A ﬁeld of pathway promoting proliferation and preventing premature myocardial-endocardial NFAT signaling underlies heart valve differentiation of myoblasts. J. Cell Biol., 179, 129–138. morphogenesis. Cell, 118, 649–663. 77. Kang,J., DiBenedetto,B., Narayan,K., Zhao,H., Der,S.D. and 97. Kimura,S., Hara,Y., Pineau,T., Fernandez-Salguero,P., Fox,C.H., Chambers,C.A. (2004) STAT5 is required for thymopoiesis Ward,J.M. and Gonzalez,F.J. (1996) The T/ebp null mouse: in a development stage-speciﬁc manner. J. Immunol., 173, thyroid-speciﬁc enhancer-binding protein is essential for the 2307–2314. organogenesis of the thyroid, lung, ventral forebrain, and 78. Snow,J.W., Abraham,N., Ma,M.C., Abbey,N.W., Herndier,B. and pituitary. Genes Dev., 10, 60–69. Goldsmith,M.A. (2002) STAT5 promotes multilineage 98. Henseleit,K.D., Nelson,S.B., Kuhlbrodt,K., Hennings,J.C., hematolymphoid development in vivo through effects on early Ericson,J. and Sander,M. (2005) NKX6 transcription factor hematopoietic progenitor cells. Blood, 99, 95–101. activity is required for alpha- and beta-cell development in the 79. Wurster,A.L., Tanaka,T. and Grusby,M.J. (2000) The biology of pancreas. Development, 132, 3139–3149. Stat4 and Stat6. Oncogene, 19, 2577–2584. 99. Wang,J., Elghazi,L., Parker,S.E., Kizilocak,H., Asano,M., 80. Barak,O., Lazzaro,M.A., Lane,W.S., Speicher,D.W., Picketts,D.J. Sussel,L. and Sosa-Pineda,B. (2004) The concerted activities of and Shiekhattar,R. (2003) Isolation of human NURF: a regulator Pax4 and Nkx2.2 are essential to initiate pancreatic beta-cell of engrailed gene expression. EMBO J., 22, 6089–6100. differentiation. Dev. Biol., 266, 178–189. 81. Jacks,T. (1996) Tumor suppressor gene mutations in mice. Annu. 100. Mansouri,A., Chowdhury,K. and Gruss,P. (1998) Follicular cells Rev. Genet., 30, 603–636. of the thyroid gland require Pax8 gene function. Nat. Genet., 19, 82. Begay,V., Smink,J. and Leutz,A. (2004) Essential requirement of 87–90. CCAAT/enhancer binding proteins in embryogenesis. Mol. Cell 101. Selleri,L., Depew,M.J., Jacobs,Y., Chanda,S.K., Tsang,K.Y., Biol., 24, 9744–9751. Cheah,K.S., Rubenstein,J.L., O’Gorman,S. and Cleary,M.L. 83. Niedernhofer,L.J., Essers,J., Weeda,G., Beverloo,B., de Wit,J., (2001) Requirement for Pbx1 in skeletal patterning and Muijtjens,M., Odijk,H., Hoeijmakers,J.H. and Kanaar,R. (2001) programming chondrocyte proliferation and differentiation. The structure-speciﬁc endonuclease Ercc1-Xpf is required for Development, 128, 3543–3557. targeted gene replacement in embryonic stem cells. EMBO J., 20, 102. Shyamala,G., Yang,X., Cardiff,R.D. and Dale,E. (2000) 6540–6549. Impact of progesterone receptor on cell-fate decisions during 84. Wan,H., Dingle,S., Xu,Y., Besnard,V., Kaestner,K.H., Ang,S.L., mammary gland development. Proc. Natl Acad. Sci. USA, 97, Wert,S., Stahlman,M.T. and Whitsett,J.A. (2005) Compensatory 3044–3049. roles of Foxa1 and Foxa2 during lung morphogenesis. J. Biol. 103. Sebastiano,V., Dalvai,M., Gentile,L., Schubart,K., Sutter,J., Chem., 280, 13809–13816. Wu,G.M., Tapia,N., Esch,D., Ju,J.Y., Hubner,K. et al. (2010) 85. Tompers,D.M., Foreman,R.K., Wang,Q., Kumanova,M. and Oct1 regulates trophoblast development during early mouse Labosky,P.A. (2005) Foxd3 is required in the trophoblast embryogenesis. Development, 137, 3551–3560. progenitor cell lineage of the mouse embryo. Dev. Biol., 285, 104. Ryu,E.J., Wang,J.Y., Le,N., Baloh,R.H., Gustin,J.A., Schmidt,R.E. and Milbrandt,J. (2007) Misexpression of Pou3f1 126–137. 86. Ohyama,T. and Groves,A.K. (2004) Expression of mouse Foxi results in peripheral nerve hypomyelination and axonal loss. class genes in early craniofacial development. Dev. Dyn., 231, J. Neurosci., 27, 11552–11559. 640–646. 105. Aberg,T., Cavender,A., Gaikwad,J.S., Bronckers,A.L., Wang,X., 87. Granadino,B., Arias-de-la-Fuente,C., Perez-Sanchez,C., Waltimo-Siren,J., Thesleff,I. and D’Souza,R.N. (2004) Parraga,M., Lopez-Fernandez,L.A., del Mazo,J. and Rey- Phenotypic changes in dentition of Runx2 homozygote-null Campos,J. (2000) Fhx (Foxj2) expression is activated during mutant mice. J. Histochem. Cytochem., 52, 131–139. spermatogenesis and very early in embryonic development. Mech. 106. Tremblay,K.D., Dunn,N.R. and Robertson,E.J. (2001) Mouse Dev., 97, 157–160. embryos lacking Smad1 signals display defects in 88. Fontenot,J.D., Gavin,M.A. and Rudensky,A.Y. (2003) Foxp3 extra-embryonic tissues and germ cell formation. Development, programs the development and function of CD4+CD25+ 128, 3609–3621. regulatory T cells. Nat. Immunol., 4, 330–336. 107. James,D., Levine,A.J., Besser,D. and Hemmati-Brivanlou,A. 89. Tsai,F.Y., Browne,C.P. and Orkin,S.H. (1998) Knock-in mutation (2005) TGFbeta/activin/nodal signaling is necessary for the of transcription factor GATA-3 into the GATA-1 locus: partial maintenance of pluripotency in human embryonic stem cells. rescue of GATA-1 loss of function in erythroid cells. Dev. Biol., Development, 132, 1273–1282. 196, 218–227. 108. Wontakal,S.N., Guo,X., Will,B., Shi,M., Raha,D., 90. Stecca,B., Nait-Oumesmar,B., Kelley,K.A., Voss,A.K., Thomas,T. Mahajan,M.C., Weissman,S., Snyder,M., Steidl,U., Zheng,D. and Lazzarini,R.A. (2002) Gcm1 expression deﬁnes three stages et al. (2011) A large gene network in immature erythroid cells is of chorio-allantoic interaction during placental development. controlled by the myeloid and B cell transcriptional regulator Mech. Dev., 115, 27–34. PU.1. PLoS Genet., 7, e1001392. 91. Fijalkowska,I., Sharma,D., Bult,C.J. and Danoff,S.K. (2010) 109. Korinek,V., Barker,N., Moerer,P., van Donselaar,E., Huls,G., Expression of the transcription factor, TFII-I, during Peters,P.J. and Clevers,H. (1998) Depletion of epithelial stem-cell post-implantation mouse embryonic development. BMC Res. compartments in the small intestine of mice lacking Tcf-4. Nat. Notes, 3, 203. Genet., 19, 379–383. 92. Kameda,Y., Nishimaki,T., Takeichi,M. and Chisaka,O. (2002) 110. Inukai,T., Inaba,T., Dang,J., Kuribara,R., Ozawa,K., Homeobox gene hoxa3 is essential for the formation of Miyajima,A., Wu,W., Look,A.T., Arinobu,Y., Iwasaki,H. et al. the carotid body in the mouse embryos. Dev. Biol., 247, (2005) TEF, an antiapoptotic bZIP transcription factor related 197–209. to the oncogenic E2A-HLF chimera, inhibits cell growth by 93. Fournier,M., Lebert-Ghali,C.E., Krosl,G. and Bijl,J.J. (2011) down-regulating expression of the common beta chain of HOXA4 induces expansion of hematopoietic stem cells in vitro cytokine receptors. Blood, 105, 4437–4444. 111. Wallis,K., Sjogren,M., van Hogerlinden,M., Silberberg,G., and confers enhancement of pro-B-cells in vivo. Stem Cells Dev, 21, 133–142. Fisahn,A., Nordstrom,K., Larsson,L., Westerblad,H., Morreale 94. Kim,J.I., Li,T., Ho,I.C., Grusby,M.J. and Glimcher,L.H. (1999) de Escobar,G., Shupliakov,O. et al. (2008) Locomotor Requirement for the c-Maf transcription factor in crystallin gene deﬁciencies and aberrant development of subtype-speciﬁc regulation and lens development. Proc. Natl Acad. Sci. USA, 96, GABAergic interneurons caused by an unliganded thyroid 3781–3785. hormone receptor alpha1. J. Neurosci., 28, 1904–1915. 95. Han,J., Ishii,M., Bringas,P. Jr, Maas,R.L., Maxson,R.E. Jr and 112. Affar el,B., Gay,F., Shi,Y., Liu,H., Huarte,M., Wu,S., Collins,T. Chai,Y. (2007) Concerted action of Msx1 and Msx2 in regulating and Li,E. (2006) Essential dosage-dependent functions of the cranial neural crest cell differentiation during frontal bone transcription factor yin yang 1 in late embryonic development development. Mech. Dev., 124, 729–745. and cell cycle progression. Mol. Cell Biol., 26, 3565–3581.

Journal

Nucleic Acids Research – Oxford University Press

Published: Jul 15, 2012

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection

Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection

Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection

References (119)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies