A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences

Wei Li; Clifford A. Meyer; X. Shirley Liu

doi:10.1093/bioinformatics/bti1046

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

DeepDyve requires Javascript to function. Please enable Javascript on your browser to continue.

A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences

Li, Wei; Meyer, Clifford A.; Liu, X. Shirley 2005-06-01 00:00:00 Vol. 21 Suppl. 1 2005, pages i274–i282 BIOINFORMATICS doi:10.1093/bioinformatics/bti1046 A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences Wei Li, Clifford A. Meyer and X. Shirley Liu Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health, Boston, MA 02115, USA Received on January 15, 2005; accepted on March 27, 2005 ABSTRACT do not provide information on which TFs directly regu- Motivation: Transcription factors (TFs) regulate gene expres- late which genes and their interaction mechanism. In recent sion by recognizing and binding to speciﬁc regulatory regions years, ChIP-chip has become a popular technique for study- on the genome, which in higher eukaryotes can occur far away ing the genome-wide location of in vivo TF–DNA interactions. from the regulated genes. Recently, Affymetrix developed ChIP-chip was ﬁrst successfully adopted in yeast to identify the high-density oligonucleotide arrays that tile all the non- the regulatory targets of individual TFs (Ren et al., 2000; Lieb repetitive sequences of the human genome at 35 bp resolution. et al., 2001) and to study the entire transcriptional regulat- This new array platform allows for the unbiased mapping of in ory networks (Harbison et al., 2004). Promoter arrays were vivo TF binding sequences (TFBSs) using Chromatin Immuno- used in the yeast ChIP-chip experiments, with a cDNA probe Precipitation followed by microarray experiments (ChIP-chip). for each intergenic sequence. Since higher eukaryotes, espe- The massive dataset generated from these experiments pose cially human, have long intergenic sequences, the promoter great challenges for data analysis. arrays for higher eukaryotes usually contain probes just for Results: We developed a fast, scalable and sensitive method the proximal promoters of annotated genes (Li et al., 2003; to extract TFBSs from ChIP-chip experiments on genome tiling Odom et al., 2004). Unfortunately, TFBSs in higher euka- arrays. Our method takes advantage of tiling array data from ryotes can occur upstream, downstream, close to or far away many experiments to normalize and model the behavior of from the regulated genes, or even in the introns of the genes. each individual probe, and identiﬁes TFBSs using a hidden Thus, proximal promoter arrays may not accurately capture all Markov model (HMM). When applied to the data of p53 ChIP- the ChIP-enriched DNA. Although CpG island (Wells et al., chip experiments from an earlier study, our method discovered 2003) and continuous genomic PCR fragment (Euskirchen many new high conﬁdence p53 targets including all the regions et al., 2004) arrays have been used to address this prob- veriﬁed by quantitative PCR . Using a de novo motif ﬁnding lem, their probe densities and genome coverage are still not algorithm MDscan, we also recovered the p53 motif from our satisfactory. HMM identiﬁed p53 target regions. Furthermore, we found Recently, Affymetrix developed the high-density oligonuc- substantial p53 motif enrichment in these regions compar- leotide arrays that tile all non-repetitive sequences of the ing with both genomic background and the TFBSs identiﬁed human genome. Current tiling of chromosomes 21 and 22 earlier. Several of the newly identiﬁed p53 TFBSs are in the (Kapranov et al., 2002) are available and the whole gen- promoter region of known genes or associated with previously ome tiling arrays are under development. These arrays have characterized p53-responsive genes. one probe pair, a perfect match (PM) probe and a mismatch Contact: xsliu@jimmy.harvard.edu (MM) probe both 25 bases long, for every non-overlapping Supplementary information: Available at the following URL 35 bp region in the genome. They provide the platform for the http://genome.dfci.harvard.edu/∼xsliu/HMMTiling/index.html unbiased mapping of in vivo TF binding sequences (TFBSs) using ChIP-chip, but the massive amount of data (∼1 million 1 INTRODUCTION probe pairs for chromosomes 21 and 22) generated from them pose great challenges for data analysis. One of the most important characteristics of gene regula- Cawley et al. (2004) conducted p53 ChIP-chip experiments tion is the interaction between transcription factors (TFs) on chromosomes 21 and 22 tiling arrays. Six duplicated exper- and cis-regulatory elements. Although microarrays have been iments were performed on each of two p53 antibodies: FL (full widely used to understand gene-expression regulation, they length) and DO1 (N-terminal epitope), as well as two con- To whom correspondence should be addressed. trol experiments: Input (genomic input DNA without ChIP) i274 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org “bti1046” — 2005/6/10 — page 274 — #1 ChIP-chip experiments on genome tiling arrays and GST (antibody to bacterial GST). The GTRANS software provided by Affymetrix (http://www.affymetrix.com/support/ developer/downloads/TilingArrayTools/index.affx) was used to predict p53 binding sequences for each antibody against each control (FL–Input, FL–GST, DO1–Input and DO1– GST). GTRANS requires at least three replicate TF ChIPs and three replicate controls to ﬁnd the ChIP-enriched regions. To check whether a probe x is ChIP-enriched, all probes within 500 bp from x in all ChIP and control experiments are con- sidered. GTRANS uses the non-parametric Mann–Whitney U -test (equivalent to Wilcoxon rank sum test) to rank all the probe pairs by their log2(max(PM − MM),1)) values, and checks whether the sum of ranks of all probe pairs in the ChIPs are signiﬁcantly higher than that in the controls. The two data- sets with the same antibody were merged together to form a non-redundant set (48 FL sites reported in the paper and 103 DO1 sites downloadable from the Supplementary website). A total of 17 and 0% of the identiﬁed p53 binding sequences were located within 1 kb of CpG islands and 5 exons, respect- ively, indicating that only a small fraction of p53 sites would have been discovered by using CpG island arrays or prox- imal promoter arrays. When trying to ﬁnd putative TF binding motifs from the 48 p53-FL sites, Cawley et al. (2004) failed Fig. 1. Strategy diagram for analyzing ChIP-chip experiments on to identify the p53 binding motif. genome tiling arrays. Although the Mann–Whitney U -test assumes no speciﬁc probability distribution of the data, it cannot identify enriched regions with small P -values without a long enough window and 22 tiling arrays (three arrays, A, B and C). All 48 data- and enough replicates. We developed a new algorithm and sets were downloaded from http://transcriptome.affymetrix. applied it to the same p53-FL dataset. We ﬁrst normalized com/publication/tfbs/. The (PM−MM) value was recorded the data across all datasets published in Cawley et al. (2004), for each probe pair as a new probe value. To be able to com- removed all the repetitive probe measurements and estimated pare probe values between samples (Workman et al., 2002), the behavior of each probe. Then a two-hidden-state (ChIP- we conducted quantile normalization (Bolstad et al., 2003) on enriched state and non-enriched state) hidden Markov model each array (A, B and C) separately to make the probe value (HMM) (Rabiner, 1989) was implemented to estimate the distribution the same across all samples. We then mapped the probability of ChIP enrichment at each probe location. We probe values from the three arrays to the April 2003 version identiﬁed many potential p53 binding sequences, including of the human genome according to the Affymetrix NCBIv33 all the quantitative PCR veriﬁed targets. From the high conﬁd- GTRANS Library. ence regions, a p53 binding motif was successfully recovered 2.2 Repetitive probe measurements using MDscan (Liu et al., 2002). Furthermore, we found sub- stantial p53 motif enrichment in our HMM identiﬁed regions Owing to the array coverage and genome sequence similar- compared with both genomic background and TFBSs identi- ity, occasionally N probes with the same 25mer sequence ﬁed by Cawley et al. (2004). The entire analysis is summarized might be spotted at different locations on the chips. At the in Figure 1. mean time, a 25mer might map to M different genomic loca- tions. Therefore, a 25mer sequence might have up to N × M repetitive probe measurements (Fig. 2). Since ChIP-enriched 2 METHODS regions are normally <1 kb, short-range repetitive probe 2.1 Dataset measurements (<1 kb of each other) with high values will Cawley et al. (2004) conducted ChIP-chip experiments for enhance the chance of the region to be falsely predicated as four human TFs in two cell lines (p53-FL and p53-DO1 in ChIP-enriched, and thus should be removed. In contrast, long- HCT1116, cMyc and Sp1 in Jurkat) as well as two control range repetitive probe measurements (>1 kb of each other) experiments (input and GST) in each cell line. Each exper- should be kept for they may represent valid targets (although iment contains two biological replicates and three technical cross-hybridization issue should be addressed in future study). replicates, generating a total of 48 datasets (samples). Each We decided to employ the following two-stage procedure to sample was hybridized to the Affymetrix chromosomes 21 ﬁlter out short-range repetitive probe measurements prior to i275 “bti1046” — 2005/6/10 — page 275 — #2 W.Li et al. the non-p53 ChIP experiments (excluding the top 0.5% per- centile ∼5000 probes). The 12 p53 ChIP datasets (FL and DO1) were excluded from the parameter estimation because their ChIP enrichment signal may skew the baseline estimate. The behavior of each probe i is modeled as a normal distri- , σ ), where µ and σ are the mean and standard bution N(µ i i i deviation of probe i values of the 36 tiling array datasets. 2.4 Hidden Markov model We designed a two-hidden-state (ChIP-enriched state and non- enriched state) HMM to estimate the probability of enrichment at each probe location i. Given J potential binding sites Fig. 2. Repetitive probe measurements. Probes P1 and P3 (N = 2) (assumed to be ∼300 along chromosomes 21 and 22 for p53) have the same 25mer sequences, each maps to four genomic location (M = 4), thus generating eight (N × M) repetitive probe measure- along chromosomes covered by K total probes, the HMM is ments. Repetitive probe measurements are considered short-range if characterized by the following: they are within 1 kb (within Regions I and II) or long-range otherwise (between Regions I and II). In the two-stage procedure to ﬁlter out (1) Initial probabilities: J/K for ChIP-enriched state, short-range duplications, four measurements (2, 3, 5, 6; indicated 1 − J/K for non-enriched state. as *) are ﬁltered out at Stage I, two measurements (4, 7; indicated (2) Transition probabilities: J/K for transition to a different as +) are removed at Stage II. state, 1 − J/K for staying in the same state. (3) Emission probability distribution of probe i in single downstream analysis: (1) Ensure only one measurement per dataset: N(µ + 2σ , (1.5σ ) ) for ChIP-enriched state, i i i probe within 1 kb of the genome. For each probe mapping to N(µ , σ ) for non-enriched state. The parameters are M multiple genomic locations, from the beginning of each based on the results on the Affymetrix SNP arrays chromosome, we deleted the latter occurrence if the gen- (Lieberfarb et al., 2003). omic distance between two consecutive occurrences is <1kb (4) A probe i, with (PM−MM) value p , is deﬁned as an of each other. This process was iterated until all the probe outlier if its Z-value is >3or <−2.5. We reassigned occurrences had been considered. (2) Ensure only one probe the Z-value of each outlier probe as 3 if Z> 3 and measurement for each genomic position. We have three (A, B −2.5 if Z< − 2.5. and C) different kinds of tiling arrays in our current datasets. (5) If two adjacent probes are farther apart than 500 bp The N multiple probes on the same genomic position might in the genome (usually due to a long repeat sequence be from different arrays. Since we quantile normalized each between the two probes), in the forward and back- array separately, there could be a huge array bias if we take the ward procedure, the enriched and non-enriched state average of all the duplicated probe measurements. Therefore, probabilities of the latter probe are reset to the initial we randomly chose one probe and discarded the others. probabilities. 2.3 Probe behavior estimation To combine the results from different replicates, in either ChIP-chip experiments enrich only a small portion of DNA the ChIP or control group, the emission probabilities of all bound by the TF of interest, and control experiments cap- available replicates for each probe were averaged as the emis- ture only baseline non-speciﬁc binding. Therefore, most of sion probability on this probe. The forward and backward the probe measurements across all datasets are not actually algorithms of HMM (Rabiner, 1989) were used to calculate hybridized with ChIP-enriched DNA and could be used to the probabilities of a probe being in ChIP-enriched versus estimate the baseline probe behavior. We randomly selected non-enriched states. A log-odds enrichment value was given 5000 probes from the ∼1 million probe dataset and conduc- to each probe representing the ratio of being ChIP-enriched ted Shapiro–Wilk normality test on the behavior (PM−MM) against non-enriched state. The natural cutoff for the ChIP- of each probe across the 48 datasets. The majority (∼80%) enriched state is 0, which indicates equal ChIP-enriched and of these probes have the Shapiro–Wilk normality P -value > non-enriched probabilities. 0.01, accepting the null hypothesis that the distribution of The ChIP-enriched region is deﬁned as at least two probes probe behavior across the 48 datasets is normal. The remain- with log-odds enrichment value >0 in ChIP and at least one ing 20% probes that are not normal could be caused by ChIP or probe with log-odds enrichment value <−15 in the control. non-speciﬁc enrichment in some of the 48 datasets. We estim- The average enrichment value of all the probes in a ChIP- ated the baseline probe behavior using 36 of the 48 datasets, enriched region is used as a summary enrichment score for including all of the control datasets and 99.5% of probes from the entire region. i276 “bti1046” — 2005/6/10 — page 276 — #3 ChIP-chip experiments on genome tiling arrays 2.5 Sequence retrieval and repeat masking Genomic sequences of the HMM identiﬁed ChIP-enriched regions were retrieved from the repeat-masked April 2003 version of the human genome at UCSC Genome Browser (http://genome.ucsc.edu/). We further masked out all the tan- dem repeats identiﬁed by Tandem repeats ﬁnder (Benson, 1999) to facilitate downstream de novo motif ﬁnding. The resulting sequences are deﬁned as fully-repeat-masked sequences. 2.6 Motif ﬁnding and enrichment scoring We ranked all the fully-repeat-masked ChIP-enriched sequences by the summary enrichment score and applied MDscan (Liu et al., 2002) to ﬁnd the putative p53 binding motif. For a motif of width w, MDscan ﬁrst enumerates each wmer in the highest ranking sequences, and collects other wmers similar to it in these sequences to construct a candidate motif, represented as a probability matrix. A semi-Bayesian scoring function is used to remove low scoring candidate motifs and reﬁne the rest by checking all wmers in all the ChIP-enriched sequences. If several ﬁnal motifs share the same consensus, only the motif with the highest score is kept (Conlon et al., 2003). We determined how well a given sequence segment of width w matched a motif by a function S = max(S , S ) where S + − + and S are the following scoring formula on the sequence itself and its reverse complement, respectively: Fig. 3. Typical ChIP-enriched region (Blk78). Quantile normalized p + p (PM−MM) probe values for 3 ChIPs (p53-FL) and 3 controls (GST) ij s S = δ ln ij as well as HMM log-odds enrichment value for ChIP and control on i=1 j ∈{A,C,G,T} each probe were mapped to the chromosome 22 positions (bottom line). Although the HMM log-odds enrichment values are from six where p is the frequency of nucleotide j at position i in the ij ChIPs and six controls, only three ChIPs and three controls are shown motif, p is a pseudocount of 0.03, b is the background prob- s j in this ﬁgure. The height of the vertical bar is proportional to either ability of nucleotide j calculated from the intergenic regions probe signal or HMM enrichment value with the horizontal line indic- of the human genome, δ = 1 if nucleotide j is present at ating the value 0. The ChIP enriched-region with default cutoff was ij position i and δ = 0 if nucleotide j is not present at position i. indicated by the rectangle in the middle. HMM ﬁts the data into ij a smooth, symmetrical, bell-shaped curve. The two 10mers p53 Since the p53 is known to bind two palindrome 10mers sep- binding motif with 0 bp gap (AGACAGGCTC{0}AGGCATGCCA, arated by a variable spacer of length 0–13 bp (el-Deiry et al., indicated as asterisk) is right in the middle region of this curve with 1992), we computed the p53 matches by summing the scores the highest HMM enrichment value. for the two matching 10mers and extended the spacer to 30 bp. personal communication: Cawley et al., 2004 reported 11 3 RESULTS quantitative PCR veriﬁed regions for p53-FL, but actually 3.1 Identiﬁcation of ChIP-enriched sequences only 10 regions were veriﬁed). Overall, 24 of 98 HMM- We applied our analysis strategy (Fig. 1) on the data of p53- Full TFBSs overlap with the 48 p53-FL TFBSs reported FL ChIP-chip experiments on chromosomes 21 and 22 tiling by Cawley et al. (2004). Furthermore, although we did not arrays (Cawley et al., 2004). It was observed that 3.8% of the use the p53-DO1 data in our HMM analysis, six additional data representing short-range (<1 kb) repetitive probe meas- TFBSs in our HMM-Full results agree with the p53-DO1 urements were ﬁltered out, leaving behind 1 014 067 non- result from Cawley et al. (2004). In a typical ChIP-enriched redundant probe measurements. With the HMM algorithm region (Fig. 3), HMM will ﬁt the data into a smooth, symmet- described in Section 2.4, we identiﬁed 98 p53 TFBSs (deﬁned rical, bell-shaped curve. Often, the p53 binding motif appears as HMM-Full TFBSs and summarized in Table 1 in the in the middle region of the curve with the highest HMM value. Supplementary materials), which include all the 10 p53 There are two major groups of repeats in eukaryotic TFBSs previously veriﬁed by quantitative PCR (S. Bekiranov, genomes: interspersed repeats mainly represent degenerate i277 “bti1046” — 2005/6/10 — page 277 — #4 W.Li et al. copies of transposable elements dispersed at various loca- tions, whereas tandem repeats are usually conﬁned to speciﬁc genomic regions where a unit is tandemly repeated almost exactly from several to thousands of times. To minimize poten- tial cross-hybridization, current Affymetrix probe design (Kapranov et al., 2002) rejects probes residing in the inter- spersed repeats and tandem repeats with short period (roughly ≤12) identiﬁed by RepeatMasker. Even though we ﬁlter out the repetitive probe measurements, each repeat unit can still be represented by probe(s) with different 25mers. Therefore, regions (∼1% of the genome) containing tandem repeats with long period (>12) and high copy number (>10) will have more chance to be falsely predicted as ChIP-enriched. This false-positive prediction includes both higher probe signal value than the real one and expanding real enriched area to the entire tandem repeat region. In one example, Blk55 (Table 1 in the Supplementary materials), containing a tan- Fig. 4. 10mer sequence logos of p53 motifs from (a) MDscan iden- dem repeat with period size of 54 bp and copy number of 139, tiﬁed motif from 43 HMM-identiﬁed high conﬁdence regions from is ∼7 kb long and has extremely high enrichment score of p53 ChIP-chip experiments on genome tiling arrays; (b) TRANSFAC 14.7. This region, comprising ∼160 probes, was not repeat- Professional database; (c) aligned p53 binding sequences from two Classic literatures (el-Deiry et al., 1992; Funk et al., 1992) deﬁning masked during the array design and might be falsely predicted the p53 consensus binding site. as ChIP-enriched. We identiﬁed a total of four TFBSs within the tandem repeats (Table 1 in the Supplementary materials). Among the 21 TFBSs from Cawley et al. (2004) that were Therefore, we carried out motif ﬁnding only on the 43 high not identiﬁed by our HMM approach, almost half are within conﬁdence regions with a stringent log-odds enrichment the tandem repeats. They may indicate a higher number of cutoff value of 6 in ChIP and the same default cutoff in control false positive predictions by Cawley et al. (2004), although and deﬁned these TFBSs containing high conﬁdence regions they could be involved in various regulatory mechanisms as HMM High-Conﬁdence (Table 1 in the Supplementary (Nakamura et al., 1998). We decided to keep all the HMM- materials). Using a motif-ﬁnding program MDscan (Liu et al., identiﬁed TFBSs within the tandem repeats, but only use the 2002), we successfully recovered a strong 10mer palindromic fully-repeat-masked sequences for downstream de novo motif binding motif (Fig. 4 TilingArray motif) from these 43 fully- ﬁnding. repeat-masked high-conﬁdence regions, This motif resembles In addition to the known repeats, non-RepeatMasked large both the p53 motif from entry M00761 of TRANSFAC (Matys segmental duplications (>90% identity, >1 kb in length) et al., 2003) and the Classic p53 motif derived by aligning p53 cover ∼5.3% of the euchromatic genome from the cur- binding sequences from two literatures (el-Deiry et al., 1992; rent human genome assembly (International Human Genome Funk et al., 1992). A similar TilingArray motif could still Sequencing Consortium, 2004). This is the most difﬁcult be recovered after the segmental duplication sequences were cross-hybridization problem for ChIP-chip experiments on removed from the high-conﬁdence regions. It is the ﬁrst time genome tiling arrays. Without other independent evidence, a p53 motif is successfully predicted from either promoters it is impossible to discriminate one or more copies of the of co-expressed genes or ChIP-chip enriched sequences by real enriched DNA from the large segment duplications based de novo motif ﬁnding algorithms. Biologists often use the solely on tiling array hybridization. For example, one ∼40 kb prior knowledge of the TFBS motif from literature or data- segment duplicates six times within chromosome 22 with bases to search for the occurrences of the binding sites. ∼99% sequence identity. We found two TFBSs (the ﬁrst However, if the motif is obtained only from several known occurrences are Blk35 and Blk36) on each copy of this duplic- sites, it may be either too restrictive and miss many real ated segments, generating 12 TFBSs ‘redundant’ at sequence binding sites or too general and ﬁnd many false positive identity level. A total of 26 from the 98 HMM-Full TFBSs sites. In contrast, the p53 TilingArray motif identiﬁed in were found within these segmental duplications and were kept this study might represent the unbiased characterization of intact in our current analysis. p53-DNA interaction. For example, the Classic p53 bind- ing motif require a C at position 4 and a G at position 7, 3.2 Identiﬁcation of p53-binding motif whereas the TilingArray motif is somewhat degenerate at Extracting putative regulatory motifs from ChIP-enriched these two positions. Some studies (Resnick-Silverman et al., regions is difﬁcult for genome tiling array data because 1998; Jaiswal and Narayan, 2001) showed p53 to bind to the long enriched sequences increase the background noise. sequences with mutations in these two positions, indicating i278 “bti1046” — 2005/6/10 — page 278 — #5 ChIP-chip experiments on genome tiling arrays Table 1. Enrichment of p53 binding motifs Motif TFBS No. of sites No. of bases No. of motifs % Sites with No. of motifs Fold-enrichment Binomial P -value motif per 10 kb TilingArray HMM-High-Conﬁdence 43 31,394 34 40 10.8 3.3 2.3e−09 HMM-Full 98 58,734 55 34 9.4 2.8 4.8e−12 Cawley’s 48 25,994 20 25 7.7 2.3 2.1e−04 TRANSFAC HMM-High-Conﬁdence 43 31,394 73 60 23.3 3.5 <2.2e−16 HMM-Full 98 58,734 101 42 17.2 2.6 <2.2e−16 Cawley’s 48 25,994 35 31 13.5 2.0 3.0e−05 Classic HMM-High-Conﬁdence 43 31,394 26 35 8.3 4.6 2.8e−10 HMM-Full 98 58,734 36 23 6.1 3.4 3.0e−10 Cawley’s 48 25,994 13 17 5.0 2.8 6.3e−4 Three p53 motifs (Fig. 4) were mapped to three sets of p53 TFBSs (Table 1 in the Supplementary materials). The number of bases refers to the number of non-repeat nucleotides in fully-repeat-masked TFBSs. A motif-matching scoring function with cutoff of 10, allowing up to 30 bp spacer between the 10mer motif pairs, was used to determine the number of matches in individual TFBS. This cutoff corresponds to 3.3 matches per 10 kb on chromosomes 21 and 22. Fold enrichment was inferred by comparing the motif occurrences in TFBSs with those in fully-repeat-masked chromosomes 21 and 22 sequences. A one tail binomial test was used to determine the P -value attached to the motif enrichment. A substantial enrichment of p53 motif was observed in our HMM identiﬁed TFBS. that the TilingArray p53 motif might better characterize the Table 2. Annotation of p53 TFBS locations p53 binding at these two positions than the Classic p53 motif. TFBS No. of CpG island RefSeq gene 3.3 Motif enrichment sites <1kb <3kb <1kbto5 <1kbto3 p53 was known to bind to two copies of the 10mer palin- HMM-High- 43 5 10 1 1 drome motif with a variable spacer between them (el-Deiry Conﬁdence et al., 1992). We mapped the p53 motif occurrences to the HMM-Full 98 6 20 2 1 fully-repeat-masked sequences of chromosomes 21 and 22, Cawley’s 48 8 12 0 0 allowing up to 30 bp spacer between the 10mer motif pairs. A substantial enrichment of the TilingArray motif pairs was The three sets of TFBS sets are the same as in Table 1 in the Supplementary materials. CpG island and RefSeq Gene annotation tracks were retrieved from UCSC Genome observed from our HMM identiﬁed TFBSs. Using a match Browser on April 2003 Human Genome Assembly (http://genome.ucsc.edu/). 5 and 3 score cutoff of 10 (corresponding to 3.3 matches per 10 kb referred to the transcription starts and ends in RefSeq Gene track. The distance between on chromosomes 21 and 22), TilingArray motif pairs are in TFBS and CpG island or RefSeq Gene is the space between their nearest boundaries. 40, 34 and 25% of HMM-High-Conﬁdence, HMM-Full and Cawley’s TFBS, respectively, corresponding to 3.3-, 2.8- and 2.3-fold enrichment compared with the genomic background 5 upstream or 3 downstream of RefSeq genes (Table 2). (Table 1). In addition, in the TFBSs lacking TilingArray A total of 5, 6 and 8 TFBSs in HMM-High-Conﬁdence, motif pairs, 10 HMM-High-Conﬁdence and 21 HMM-Full HMM-Full and Cawley’s, respectively, are within 1 kb of were found to contain at least one copy of the TilingArray annotated CpG island. This seems to indicate that more Caw- motif. The single copy TilingArray motif (not motif pairs) ley’s TFBSs are proximal to CpG island than those identiﬁed represents a 2.7-fold enrichment compared with the genomic by HMM. However, since TFBSs <1kbinCawley et al. background. One might suspect the validity of this enrich- (2004) were extended equally in both directions to a length of ment analysis, since the TilingArray motif itself is extracted 1 kb, they are more likely to be longer than ours. When the from the HMM identiﬁed high conﬁdence regions and thus, distance between TFBS and CpG island was expanded to 3 kb, should be much more enriched. To address this issue, we we found almost the same number of TFBS in HMM-High- conducted enrichment analysis using two independent repres- Conﬁdence and in Cawley’s TFBSs. It is worth noting that no entations of p53 motif (TRANSFAC and Classic, Fig. 4) with TFBS of Cawley are <1kbof5 upstream or 3 downstream the same criteria on the three sets of TFBSs (Table 1). The res- of RefSeq gene, whereas three TFBSs identiﬁed in HMM are ult showed that the fold enrichments of these two p53 motifs in within those regions. The one TFBS found to be within 1 kb of our HMM identiﬁed regions are still much higher than those in 3 downstream of RefSeq gene is of special interest, because Cawley’s data. it suggests the potential antisense transcripts at the 3 of gene. We extracted 40 well-documented p53 directly regulated 3.4 Annotation of p53 TFBS genes from TRANSFAC (Matys et al., 2003) and mapped We examined the proximities of p53 TFBSs to traditional them to the human genome. Unfortunately, none of them transcriptional regulatory regions, such as CpG islands and is located along human chromosomes 21 and 22. Recently, i279 “bti1046” — 2005/6/10 — page 279 — #6 W.Li et al. two studies (Mirza et al., 2003; Kho et al., 2004) identi- 4 DISCUSSION ﬁed thousands of p53-target genes (direct or indirect) from We present a fast, scalable and sensitive HMM approach for microarray expression analysis. They further narrowed down analyzing ChIP-chip experiments on genome tiling arrays. the potential p53 directly regulated genes to several hundred The algorithm is fast because replicate datasets at each probe by ﬁnding p53-bindng consensus sequence in their regulatory location are considered together in a single HMM run to estim- regions (Mirza et al., 2003) or by excluding the genes whose ate its enrichment probability. On a 2.4 MHz Xeon server, it expression are not directly inﬂuenced by p53 protein level takes ∼6 min to run HMM on six ChIP and six control rep- (Kho et al., 2004). Thirty of these potential p53 directly regu- licates each with ∼1 million probes. Because of its speed, the lated genes are along chromosomes 21 and 22. We found two algorithm is scalable on tiling arrays of either small regions TFBSs to be associated with these genes: Blk73, which was or the whole genome, and ﬂexible enough to be used on not identiﬁed by Cawley et al. (2004), is in the intron of known other organisms. Furthermore, the algorithm is sensitive in gene PITPNB (protein gi|1060905) with a 2.8-expression-fold identifying all the regions previously veriﬁed by quantitative change (Mirza et al., 2003) in response to p53; Blk87, which PCR. The following independent evidences suggest that the was veriﬁed by quantitative PCR in Cawley et al. (2004), is 98 TFBSs we identiﬁed are likely to be genuine regulatory in the ∼13 kb upstream of known gene AB051455 (protein elements. First, our analysis strategy reported fewer TFBSs gi|6572156) with −4.1-expression-fold change (Mirza et al., residing in non-RepeatMasked tandem repeat and large seg- 2003) in response to p53. mental duplications regions which are questionable in their regulatory function. Second, a TilingArray motif resembling 3.5 Method evaluation the TRANSFAC and Classic p53 motif was successfully dis- Our analysis strategy (Fig. 1) differs from the previous study covered from the high conﬁdence TFBSs. Finally, matching by Cawley et al. (2004) in data normalization, repeat ele- the TilingArray, TRANSFAC and Classic p53 motifs in the ments ﬁltering and HMM prediction. It is worth checking TFBSs reveals substantial enrichment of the motifs compared whether the better performance of binding sequence predic- with the genomic background. tion is due to the HMM prediction or the other two factors. Our algorithm overcomes the intrinsic weakness of the To investigate the real difference between HMM and Mann– Mann–Whitney U -test used in Cawley et al. (2004). Without Whitney U -test, we performed the Mann–Whitney U -test on enough replicates and long enough window, Mann–Whitney our quantile normalized datasets with repetitive probe meas- U -test cannot identify enriched regions with a sufﬁciently urements removed. We used the same criteria from Cawley small P -value. On account of the high cost of the tiling arrays, et al. (2004), i.e. ±500 bp sliding window; probe values were often biologists cannot afford more than three replicates. The transformed to log 2(max(PM−MM,1)); predicted regions shearing process in ChIP usually generates fragments <1kb separated by <500 bp were merged together; two sets of pre- long. Reinforcinga1kb window may not only miss short diction (FL–DO1, FL–Input) were further merged together ChIP-enriched regions but also include sequences outside the to form a non-redundant set. Using the same P -value cutoff enriched regions which may confound later motif detection. of 1e−5, we identiﬁed 65 TFBSs, from which additional 11 Mann–Whitney U -test treats every probe equally and fails TFBSs could be found in HMM-Full besides the previous 24 to consider probe-by-probe variability. The statistic U does overlapping TFBSs between HMM-Full and Cawley’s data. not reﬂect whether probe values are ﬂuctuating a lot or con- Most of the 11 additional TFBSs fall just barely below the 1e–5 tinuously high within the window; only the latter indicates cutoff. We gradually decreased the P -value cutoff from 1e−5 ChIP-enriched TFBS. to 1e−9 to conduct the de novo p53-binding motif searching Our HMM approach is based on normalizing data on tiling on the fully-repeat-masked resulting sequences. With none arrays from many experiments and modeling the behavior of of these cutoffs could MDscan recover a pattern resembling each individual probe. One laboratory may not necessarily the p53 consensus binding site. As for the motif enrichment have enough ChIP-chip replicates, but pooling together freely study, we used the 1e−5 P -value cutoff and expanded the available tiling array data from many different laboratories and resulting TFBS equally in each direction to have a minimum different experiments (e.g. ChIP-chip against different TFs length of 750 bp. Therefore, the total number of non-repeat or different tissues) can provide enough information for the nucleotides in these 65 TFBSs is ∼34 kb, which is compar- modeling requirement. Probe behavior estimated from pooled able with that (∼31 kb) of HMM-High-Conﬁdence. Using experiments serves as a baseline for each ChIP experiment three p53 motif representations with the same criteria as in to be compared with, thus HMM can even get reasonable Table 1, we found 7.6, 15.7 and 6.4 motifs per 10 kb for results from a single ChIP experiment. This is particularly TilingArray, TRANSFAC and Classic p53 motifs, respect- useful for surveying at the beginning of a ChIP experiment ively. The motif enrichment is much lower than that in our to explore antibody quality, culture conditions or preliminary HMM-High-Conﬁdence regions. All of the above results quality assessment of each replicate. Our HMM approach cal- suggest that HMM contributed greatly to the success of our culates enrichment probability at each probe location, so the binding sequence prediction. i280 “bti1046” — 2005/6/10 — page 280 — #7 ChIP-chip experiments on genome tiling arrays exact enriched region boundaries could be determined. Fur- oligonucleotide array data based on variance and bias. Bioinform- atics, 19, 185–193. thermore, the sensitive HMM approach can identify short Cawley,S., Bekiranov,S., Ng,H.H., Kapranov,P., Sekinger,E.A., ChIP-enriched sequence. If we exclude those TFBS con- Kampa,D., Piccolboni,A., Sementchenko,V., Cheng,J., taining tandem repeats, the average length (771 bp) of the Williams,A.J. et al. (2004) Unbiased mapping of transcrip- HMM-High-Conﬁdence TFBSs not identiﬁed by the Mann– tion factor binding sites along human chromosomes 21 and 22 Whitney U -test is much shorter than the average length (984 points to widespread regulation of noncoding RNAs. Cell, 116, bp) of HMM-High-Conﬁdence TFBSs identiﬁed by Cawley 499–509. et al. (2004). Conlon,E.M., Liu,X.S., Lieb,J.D. and Liu,J.S. (2003) Integrating Our analysis seems to reveal two interesting characterist- regulatory motif discovery and genome-wide expression analysis. ics about p53 regulation. First, p53 binding may not simply Proc. Natl Acad. Sci. USA, 100, 3339–3344. be the classic two 10mer palindrome motif separated by a el-Deiry,W.S., Kern,S.E., Pietenpol,J.A., Kinzler,K.W. and variable length spacer of 0–13 bp. Many HMM identiﬁed Vogelstein,B. (1992) Deﬁnition of a consensus binding site for p53. Nat. Genet., 1, 45–49. TFBSs contain only a single copy of the TilingArray motif Euskirchen,G., Royce,T.E., Bertone,P., Martone,R., Rinn,J.L., or multiple copies that are >30 bp apart. In almost half (43) Nelson,F.K., Sayward,F., Luscombe,N.M., Miller,P., of the HMM-Full TFBSs, including four quantitative PCR Gerstein,M., Weissman,S. and Snyder,M. (2004) CREB binds veriﬁed regions, this motif does not occur at all. The absence to multiple loci on human chromosome 22. Mol. Cell. Biol., 24, of known p53 motif or motif pairs in TFBS suggests that 3804–3814. there might be an alternative binding mechanism (Yin et al., Funk,W.D., Pak,D.T., Karas,R.H., Wright,W.E. and Shay,J.W. 2003) for p53. Second, p53 binding sites may not be con- (1992) A transcriptionally active DNA-binding site for human served between human and rodents. The average sequence p53 protein complexes. Mol. Cell. Biol., 12, 2866–2871. identity for all the 98 HMM-Full TFBSs is only ∼30% based Harbison,C.T., Gordon,D.B., Lee,T.I., Rinaldi,N.J., Macisaac,K.D., on the BLASTZ global alignment .(Schwartz et al., 2003) Danford,T.W. Hannett,N.M., Tagne,J.B., Reynolds,D.B., Yoo,J. between human (hg15) and rodents (either mouse mm3 or rat et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99–104. rn2). Even the average sequence identity of all the quantitative International Human Genome Sequencing Consortium (2004) Fin- PCR veriﬁed regions is only 42%. Furthermore, 37 TFBSs, ishing the euchromatic sequence of the human genome. Nature, including one quantitative PCR veriﬁed region, have no rodent 431, 931–945. counterparts at all. Many algorithms for ﬁnding eukaryotic Jaiswal,A.S. and Narayan,S. (2001) p53-dependent transcriptional regulatory elements rely on identifying conserved sequences regulation of the APC promoter in colon cancer cells treated with from alignment of orthologous non-coding sequences (Loots DNA alkylating agents. J. Biol. Chem., 276, 18193–18199. et al., 2002; Liu et al., 2004). Our result suggests that the loss Kapranov,P., Cawley,S.E., Drenkow,J., Bekiranov,S., of sensitivity of this approach may be signiﬁcant. The major- Strausberg,R.L., Fodor,S.P. and Gingeras,T.R. (2002) Large-scale ity of the TFBSs identiﬁed by ChIP-chip experiments on tiling transcriptional activity in chromosomes 21 and 22. Science, 296, arrays will be sacriﬁced if sequence identity is included as a 916–919. criterion. Kho,P.S., Wang,Z., Zhuang,L., Li,Y., Chew,J.L., Ng,H.H., Liu,E.T. and Yu,Q. (2004) p53-regulated transcriptional program associ- In summary, the HMM approach presented here provides ated with genotoxic stress-induced apoptosis. J. Biol. Chem., 279, biologists with an efﬁcient computational approach for ana- 21183–21192. lyzing the massive data generated from ChIP-chip on genome Li,Z., Van Calcar,S., Qu,C., Cavenee,W.K., Zhang,M.Q. and Ren,B. tiling arrays. Its adoption will contribute to the more com- (2003) A global transcriptional regulatory role for c-Myc in prehensive models and understanding of gene regulatory Burkitt’s lymphoma cells. Proc. Natl Acad. Sci. USA, 100, networks in higher eukaryotes. 8164–8169. Lieb,J.D., Liu,X., Botstein,D. and Brown,P.O. (2001) Promoter- speciﬁc binding of Rap1 revealed by genome-wide maps of ACKNOWLEDGEMENTS protein-DNA association. Nat. Genet., 28, 327–334. The authors would like to thank David Harrington, Chen Li, Lieberfarb,M.E., Lin,M., Lechpammer,M., Li,C., Tanenbaum,D.M., Molin Wang and Jun S. Liu for their insights and advice on Febbo,P.G., Wright,R.L., Shim,J., Kantoff,P.W., Loda,M., the tiling array data analysis. The project was partly supported Meyerson,M. and Sellers,W.R. (2003) Genome-wide loss of het- by Claudia Adams Barr Program in Cancer Research. erozygosity analysis from laser capture microdissected prostate cancer using single nucleotide polymorphic allele (SNP) arrays and a novel bioinformatics platform dChipSNP. Cancer Res., REFERENCES 63, 4781–4785. Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for Benson,G. (1999) Tandem repeats ﬁnder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. ﬁnding protein–DNA binding sites with applications to Bolstad,B.M., Irizarry,R.A., Astrand,M. and Speed,T.P. (2003) chromatin-immunoprecipitation microarray experiments. Nat. A comparison of normalization methods for high density Biotechnol., 20, 835–839. i281 “bti1046” — 2005/6/10 — page 281 — #8 W.Li et al. Liu,Y., Liu,X.S., Wei,L., Altman,R.B. and Batzoglou,S. (2004) Rabiner,L. (1989) A tutorial on hidden Markov models and selected Eukaryotic regulatory element conservation analysis and iden- applications in speech recognition. Proc. IEEE, 77, 257–286. tiﬁcation using comparative genomics. Genome Res., 14, Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., 451–458. Simon,I., Zeitlinger,J., Schreiber,J., Hannett,N., Kanin,E. et al. Loots,G.G., Ovcharenko,I., Pachter,L., Dubchak,I. and Rubin,E.M. (2000) Genome-wide location and function of DNA binding (2002) rVista for comparative sequence-based discovery of proteins. Science, 290, 2306–2309. functional transcription factor binding sites. Genome Res., 12, Resnick-Silverman,L., St Clair,S., Maurer,M., Zhao,K. and 832–839. Manfredi,J.J. (1998) Identiﬁcation of a novel class of genomic Matys,V., Fricke,E., Geffers,R., Gossling,E., Haubrock,M., Hehl,R., DNA-binding sites suggests a mechanism for selectivity in target Hornischer,K., Karas,D., Kel,A.E., Kel-Margoulis,O.V. et al. gene activation by the tumor suppressor protein p53. Genes Dev., (2003) TRANSFAC: transcriptional regulation, from patterns to 12, 2102–2107. proﬁles. Nucleic Acids Res., 31, 374–378. Schwartz,S., Kent,W.J., Smit,A., Zhang,Z., Baertsch,R., Mirza,A., Wu,Q., Wang,L., McClanahan,T., Bishop,W.R., Hardison,R.C., Haussler, D. and Miller, W. (2003) Human–mouse Gheyas,F., Ding,W., Hutchins,B., Hockenberry,T., alignments with BLASTZ. Genome Res., 13, 103–107. Kirschmeier,P., Greene,J.R. and Liu.S. (2003) Global tran- Wells,J., Yan,P.S., Cechvala,M., Huang,T. and Farnham,P.J. (2003) scriptional program of p53 target genes during the process Identiﬁcation of novel pRb binding sites using CpG microarrays of apoptosis and cell cycle progression. Oncogene, 22, suggests that E2F recruits pRb to speciﬁc genomic sites during S 3645–3654. phase. Oncogene, 22, 1445–1460. Nakamura,Y., Koyama,K. and Matsushima,M. (1998) VNTR (vari- Workman,C., Jensen,L.J., Jarmer,H., Berka,R., Gautier,L., able number of tandem repeat) sequences as transcriptional, Nielser,H.B., Saxild,H.H., Nielsen,C., Brunak,S. and Knudsen,S. translational, or functional regulators. J. Hum. Genet., 43, (2002) A new non-linear normalization method for reducing 149–152. variability in DNA microarray experiments. Genome Biol., 3, Odom,D.T., Zizlsperger,N., Gordon,D.B., Bell,G.W., Rinaldi,N.J., research0048. Murray,H.L. Volkert,T.L., Schreiber,J., Rolfe,P.A., Gifford,D.K. Yin,Y., Liu,Y.X., Jin,Y.J., Hall,E.J. and Barrett,J.C. (2003) PAC1 et al. (2004) Control of pancreas and liver gene expression by phosphatase is a transcription target of p53 in signalling apoptosis HNF transcription factors. Science, 303, 1378–1381. and growth suppression. Nature, 422, 527–531. i282 “bti1046” — 2005/6/10 — page 282 — #9 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press http://www.deepdyve.com/lp/oxford-university-press/a-hidden-markov-model-for-analyzing-chip-chip-experiments-on-genome-ytoKB8iARe

A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences

Li, Wei; Meyer, Clifford A.; Liu, X. Shirley

Bioinformatics , Volume 21 (suppl_1): 9 – Jun 1, 2005

Download PDF

Share Full Text for Free

9 pages

Loading...

Page 2

Loading...

Page 3

Loading...

Page 4

Loading...

Page 5

Loading...

Page 6

Loading...

Page 7

Loading...

Page 8

Loading...

Page 9

References (29)

X. Liu, D. Brutlag, J. Liu (2002)
An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments
Nature Biotechnology, 20
W. El-Deiry, S. Kern, J. Pietenpol, K. Kinzler, B. Vogelstein (1992)
Definition of a consensus binding site for p53
Nature Genetics, 1
Zirong Li, Sara Calcar, Chunxu Qu, W. Cavenee, Michael Zhang, B. Ren (2003)
A global transcriptional regulatory role for c-Myc in Burkitt's lymphoma cells
Proceedings of the National Academy of Sciences of the United States of America, 100
P. Kapranov, S. Cawley, J. Drenkow, S. Bekiranov, R. Strausberg, S. Fodor, T. Gingeras (2002)
Large-Scale Transcriptional Activity in Chromosomes 21 and 22
Science, 296
W. Funk, Daniel Pak, R. Karas, W. Wright, J. Shay (1992)
A transcriptionally active DNA-binding site for human p53 protein complexes
Molecular and Cellular Biology, 12
Christopher Harbison, D. Gordon, Tong Lee, Nicola Rinaldi, K. MacIsaac, Timothy Danford, N. Hannett, J. Tagne, David Reynolds, Jane Yoo, E. Jennings, J. Zeitlinger, Dmitry Pokholok, Manolis Kellis, P. Rolfe, K. Takusagawa, E. Lander, D. Gifford, E. Fraenkel, R. Young (2004)
Transcriptional regulatory code of a eukaryotic genome
Nature, 431
Bing Ren, F. Robert, John Wyrick, O. Aparicio, E. Jennings, I. Simon, J. Zeitlinger, J. Schreiber, N. Hannett, Elenita Kanin, T. Volkert, Christopher Wilson, S. Bell, R. Young (2000)
Genome-wide location and function of DNA binding proteins.
Science, 290 5500
J. Lieb, Xiaole Liu, D. Botstein, P. Brown (2001)
Promoter-specific binding of Rap1 revealed by genome-wide maps of protein–DNA association
Nature Genetics, 28
Yusuke Nakamura, K. Koyama, M. Matsushima (1998)
VNTR (variable number of tandem repeat) sequences as transcriptional, translational, or functional regulators
Journal of Human Genetics, 43
A. Mirza, Qun Wu, Luquan Wang, T. Mcclanahan, W. Bishop, F. Gheyas, Wei Ding, B. Hutchins, Tish Hockenberry, P. Kirschmeier, J. Greene, Suxing Liu (2003)
Global transcriptional program of p53 target genes during the process of apoptosis and cell cycle progression
Oncogene, 22
A. Jaiswal, S. Narayan (2001)
p53-dependent Transcriptional Regulation of theAPC Promoter in Colon Cancer Cells Treated with DNA Alkylating Agents*
The Journal of Biological Chemistry, 276
J. Bonfield, J. Galagan (2004)
Finishing the euchromatic sequence of the human genome
Nature, 431
L. Resnick-Silverman, S. Clair, Matthew Maurer, Kathy Zhao, J. Manfredi (1998)
Identification of a novel class of genomic DNA-binding sites suggests a mechanism for selectivity in target gene activation by the tumor suppressor protein p53.
Genes & development, 12 14
S. Schwartz, W. Kent, A. Smit, Zheng Zhang, R. Baertsch, R. Hardison, D. Haussler, W. Miller (2003)
Human-mouse alignments with BLASTZ.
Genome research, 13 1
G. Loots, I. Ovcharenko, L. Pachter, I. Dubchak, E. Rubin (2002)
rVista for comparative sequence-based discovery of functional transcription factor binding sites.
Genome research, 12 5
P. Kho, Zhen Wang, Zhuang Li, Yuqing Li, Joon-Lin Chew, H. Ng, E. Liu, Qiang Yu (2004)
p53-regulated Transcriptional Program Associated with Genotoxic Stress-induced Apoptosis*
Journal of Biological Chemistry, 279
V. Matys, E. Fricke, R. Geffers, E. Gößling, Martin Haubrock, R. Hehl, K. Hornischer, D. Karas, A. Kel, O. Kel-Margoulis, D. Kloos, S. Land, B. Lewicki-Potapov, H. Michael, R. Münch, I. Reuter, S. Rotert, H. Saxel, Maurice Scheer, S. Thiele, E. Wingender (2003)
TRANSFAC®: transcriptional regulation, from patterns to profiles
Nucleic Acids Res., 31
D. Odom, Nora Zizlsperger, D. Gordon, G. Bell, Nicola Rinaldi, H. Murray, T. Volkert, J. Schreiber, P. Rolfe, D. Gifford, E. Fraenkel, G. Bell, R. Young (2004)
Control of Pancreas and Liver Gene Expression by HNF Transcription Factors
Science, 303
B. Bolstad, R. Irizarry, M. Åstrand, T. Speed (2003)
A comparison of normalization methods for high density oligonucleotide array data based on variance and bias
Bioinformatics, 19 2
C. Workman, L. Jensen, H. Jarmer, R. Berka, L. Gautier, H. Nielser, H. Saxild, Claus Nielsen, S. Brunak, S. Knudsen (2002)
A new non-linear normalization method for reducing variability in DNA microarray experiments
Genome Biology, 3
L. Rabiner (1989)
A tutorial on hidden Markov models and selected applications in speech recognition
Proc. IEEE, 77
Yueyi Liu, X. Liu, Liping Wei, R. Altman, S. Batzoglou (2004)
Eukaryotic regulatory element conservation analysis and identification using comparative genomics.
Genome research, 14 3
M. Lieberfarb, Ming Lin, M. Lechpammer, Cheng Li, D. Tanenbaum, P. Febbo, R. Wright, Judy Shim, P. Kantoff, M. Loda, M. Meyerson, W. Sellers (2003)
Genome-wide loss of heterozygosity analysis from laser capture microdissected prostate cancer using single nucleotide polymorphic allele (SNP) arrays and a novel bioinformatics platform dChipSNP.
Cancer research, 63 16
J. Wells, P. Yan, M. Cechvala, T. Huang, P. Farnham (2003)
Identification of novel pRb binding sites using CpG microarrays suggests that E2F recruits pRb to specific genomic sites during S phase
Oncogene, 22
Erin Conlon, X. Liu, J. Lieb, Jun Liu (2003)
Integrating regulatory motif discovery and genome-wide expression analysis
Proceedings of the National Academy of Sciences of the United States of America, 100
Yuxin Yin, Yu-Xin Liu, Yan Jin, E. Hall, J. Barrett (2003)
PAC1 phosphatase is a transcription target of p53 in signalling apoptosis and growth suppression
Nature, 422
Gary Benson (1999)
Tandem repeats finder: a program to analyze DNA sequences.
Nucleic acids research, 27 2
S. Cawley, S. Bekiranov, H. Ng, P. Kapranov, E. Sekinger, D. Kampa, A. Piccolboni, V. Sementchenko, Jill Cheng, A. Williams, R. Wheeler, Brant Wong, J. Drenkow, M. Yamanaka, Sandeep Patel, Shane Brubaker, H. Tammana, G. Helt, K. Struhl, T. Gingeras (2004)
Unbiased Mapping of Transcription Factor Binding Sites along Human Chromosomes 21 and 22 Points to Widespread Regulation of Noncoding RNAs
Cell, 116

Publisher: Oxford University Press
eISSN: 1367-4811
DOI: 10.1093/bioinformatics/bti1046
pmid: 15961467
Publisher site: See Article on Publisher Site

Abstract

Vol. 21 Suppl. 1 2005, pages i274–i282 BIOINFORMATICS doi:10.1093/bioinformatics/bti1046 A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences Wei Li, Clifford A. Meyer and X. Shirley Liu Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Harvard School of Public Health, Boston, MA 02115, USA Received on January 15, 2005; accepted on March 27, 2005 ABSTRACT do not provide information on which TFs directly regu- Motivation: Transcription factors (TFs) regulate gene expres- late which genes and their interaction mechanism. In recent sion by recognizing and binding to speciﬁc regulatory regions years, ChIP-chip has become a popular technique for study- on the genome, which in higher eukaryotes can occur far away ing the genome-wide location of in vivo TF–DNA interactions. from the regulated genes. Recently, Affymetrix developed ChIP-chip was ﬁrst successfully adopted in yeast to identify the high-density oligonucleotide arrays that tile all the non- the regulatory targets of individual TFs (Ren et al., 2000; Lieb repetitive sequences of the human genome at 35 bp resolution. et al., 2001) and to study the entire transcriptional regulat- This new array platform allows for the unbiased mapping of in ory networks (Harbison et al., 2004). Promoter arrays were vivo TF binding sequences (TFBSs) using Chromatin Immuno- used in the yeast ChIP-chip experiments, with a cDNA probe Precipitation followed by microarray experiments (ChIP-chip). for each intergenic sequence. Since higher eukaryotes, espe- The massive dataset generated from these experiments pose cially human, have long intergenic sequences, the promoter great challenges for data analysis. arrays for higher eukaryotes usually contain probes just for Results: We developed a fast, scalable and sensitive method the proximal promoters of annotated genes (Li et al., 2003; to extract TFBSs from ChIP-chip experiments on genome tiling Odom et al., 2004). Unfortunately, TFBSs in higher euka- arrays. Our method takes advantage of tiling array data from ryotes can occur upstream, downstream, close to or far away many experiments to normalize and model the behavior of from the regulated genes, or even in the introns of the genes. each individual probe, and identiﬁes TFBSs using a hidden Thus, proximal promoter arrays may not accurately capture all Markov model (HMM). When applied to the data of p53 ChIP- the ChIP-enriched DNA. Although CpG island (Wells et al., chip experiments from an earlier study, our method discovered 2003) and continuous genomic PCR fragment (Euskirchen many new high conﬁdence p53 targets including all the regions et al., 2004) arrays have been used to address this prob- veriﬁed by quantitative PCR . Using a de novo motif ﬁnding lem, their probe densities and genome coverage are still not algorithm MDscan, we also recovered the p53 motif from our satisfactory. HMM identiﬁed p53 target regions. Furthermore, we found Recently, Affymetrix developed the high-density oligonuc- substantial p53 motif enrichment in these regions compar- leotide arrays that tile all non-repetitive sequences of the ing with both genomic background and the TFBSs identiﬁed human genome. Current tiling of chromosomes 21 and 22 earlier. Several of the newly identiﬁed p53 TFBSs are in the (Kapranov et al., 2002) are available and the whole gen- promoter region of known genes or associated with previously ome tiling arrays are under development. These arrays have characterized p53-responsive genes. one probe pair, a perfect match (PM) probe and a mismatch Contact: xsliu@jimmy.harvard.edu (MM) probe both 25 bases long, for every non-overlapping Supplementary information: Available at the following URL 35 bp region in the genome. They provide the platform for the http://genome.dfci.harvard.edu/∼xsliu/HMMTiling/index.html unbiased mapping of in vivo TF binding sequences (TFBSs) using ChIP-chip, but the massive amount of data (∼1 million 1 INTRODUCTION probe pairs for chromosomes 21 and 22) generated from them pose great challenges for data analysis. One of the most important characteristics of gene regula- Cawley et al. (2004) conducted p53 ChIP-chip experiments tion is the interaction between transcription factors (TFs) on chromosomes 21 and 22 tiling arrays. Six duplicated exper- and cis-regulatory elements. Although microarrays have been iments were performed on each of two p53 antibodies: FL (full widely used to understand gene-expression regulation, they length) and DO1 (N-terminal epitope), as well as two con- To whom correspondence should be addressed. trol experiments: Input (genomic input DNA without ChIP) i274 © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org “bti1046” — 2005/6/10 — page 274 — #1 ChIP-chip experiments on genome tiling arrays and GST (antibody to bacterial GST). The GTRANS software provided by Affymetrix (http://www.affymetrix.com/support/ developer/downloads/TilingArrayTools/index.affx) was used to predict p53 binding sequences for each antibody against each control (FL–Input, FL–GST, DO1–Input and DO1– GST). GTRANS requires at least three replicate TF ChIPs and three replicate controls to ﬁnd the ChIP-enriched regions. To check whether a probe x is ChIP-enriched, all probes within 500 bp from x in all ChIP and control experiments are con- sidered. GTRANS uses the non-parametric Mann–Whitney U -test (equivalent to Wilcoxon rank sum test) to rank all the probe pairs by their log2(max(PM − MM),1)) values, and checks whether the sum of ranks of all probe pairs in the ChIPs are signiﬁcantly higher than that in the controls. The two data- sets with the same antibody were merged together to form a non-redundant set (48 FL sites reported in the paper and 103 DO1 sites downloadable from the Supplementary website). A total of 17 and 0% of the identiﬁed p53 binding sequences were located within 1 kb of CpG islands and 5 exons, respect- ively, indicating that only a small fraction of p53 sites would have been discovered by using CpG island arrays or prox- imal promoter arrays. When trying to ﬁnd putative TF binding motifs from the 48 p53-FL sites, Cawley et al. (2004) failed Fig. 1. Strategy diagram for analyzing ChIP-chip experiments on to identify the p53 binding motif. genome tiling arrays. Although the Mann–Whitney U -test assumes no speciﬁc probability distribution of the data, it cannot identify enriched regions with small P -values without a long enough window and 22 tiling arrays (three arrays, A, B and C). All 48 data- and enough replicates. We developed a new algorithm and sets were downloaded from http://transcriptome.affymetrix. applied it to the same p53-FL dataset. We ﬁrst normalized com/publication/tfbs/. The (PM−MM) value was recorded the data across all datasets published in Cawley et al. (2004), for each probe pair as a new probe value. To be able to com- removed all the repetitive probe measurements and estimated pare probe values between samples (Workman et al., 2002), the behavior of each probe. Then a two-hidden-state (ChIP- we conducted quantile normalization (Bolstad et al., 2003) on enriched state and non-enriched state) hidden Markov model each array (A, B and C) separately to make the probe value (HMM) (Rabiner, 1989) was implemented to estimate the distribution the same across all samples. We then mapped the probability of ChIP enrichment at each probe location. We probe values from the three arrays to the April 2003 version identiﬁed many potential p53 binding sequences, including of the human genome according to the Affymetrix NCBIv33 all the quantitative PCR veriﬁed targets. From the high conﬁd- GTRANS Library. ence regions, a p53 binding motif was successfully recovered 2.2 Repetitive probe measurements using MDscan (Liu et al., 2002). Furthermore, we found sub- stantial p53 motif enrichment in our HMM identiﬁed regions Owing to the array coverage and genome sequence similar- compared with both genomic background and TFBSs identi- ity, occasionally N probes with the same 25mer sequence ﬁed by Cawley et al. (2004). The entire analysis is summarized might be spotted at different locations on the chips. At the in Figure 1. mean time, a 25mer might map to M different genomic loca- tions. Therefore, a 25mer sequence might have up to N × M repetitive probe measurements (Fig. 2). Since ChIP-enriched 2 METHODS regions are normally <1 kb, short-range repetitive probe 2.1 Dataset measurements (<1 kb of each other) with high values will Cawley et al. (2004) conducted ChIP-chip experiments for enhance the chance of the region to be falsely predicated as four human TFs in two cell lines (p53-FL and p53-DO1 in ChIP-enriched, and thus should be removed. In contrast, long- HCT1116, cMyc and Sp1 in Jurkat) as well as two control range repetitive probe measurements (>1 kb of each other) experiments (input and GST) in each cell line. Each exper- should be kept for they may represent valid targets (although iment contains two biological replicates and three technical cross-hybridization issue should be addressed in future study). replicates, generating a total of 48 datasets (samples). Each We decided to employ the following two-stage procedure to sample was hybridized to the Affymetrix chromosomes 21 ﬁlter out short-range repetitive probe measurements prior to i275 “bti1046” — 2005/6/10 — page 275 — #2 W.Li et al. the non-p53 ChIP experiments (excluding the top 0.5% per- centile ∼5000 probes). The 12 p53 ChIP datasets (FL and DO1) were excluded from the parameter estimation because their ChIP enrichment signal may skew the baseline estimate. The behavior of each probe i is modeled as a normal distri- , σ ), where µ and σ are the mean and standard bution N(µ i i i deviation of probe i values of the 36 tiling array datasets. 2.4 Hidden Markov model We designed a two-hidden-state (ChIP-enriched state and non- enriched state) HMM to estimate the probability of enrichment at each probe location i. Given J potential binding sites Fig. 2. Repetitive probe measurements. Probes P1 and P3 (N = 2) (assumed to be ∼300 along chromosomes 21 and 22 for p53) have the same 25mer sequences, each maps to four genomic location (M = 4), thus generating eight (N × M) repetitive probe measure- along chromosomes covered by K total probes, the HMM is ments. Repetitive probe measurements are considered short-range if characterized by the following: they are within 1 kb (within Regions I and II) or long-range otherwise (between Regions I and II). In the two-stage procedure to ﬁlter out (1) Initial probabilities: J/K for ChIP-enriched state, short-range duplications, four measurements (2, 3, 5, 6; indicated 1 − J/K for non-enriched state. as *) are ﬁltered out at Stage I, two measurements (4, 7; indicated (2) Transition probabilities: J/K for transition to a different as +) are removed at Stage II. state, 1 − J/K for staying in the same state. (3) Emission probability distribution of probe i in single downstream analysis: (1) Ensure only one measurement per dataset: N(µ + 2σ , (1.5σ ) ) for ChIP-enriched state, i i i probe within 1 kb of the genome. For each probe mapping to N(µ , σ ) for non-enriched state. The parameters are M multiple genomic locations, from the beginning of each based on the results on the Affymetrix SNP arrays chromosome, we deleted the latter occurrence if the gen- (Lieberfarb et al., 2003). omic distance between two consecutive occurrences is <1kb (4) A probe i, with (PM−MM) value p , is deﬁned as an of each other. This process was iterated until all the probe outlier if its Z-value is >3or <−2.5. We reassigned occurrences had been considered. (2) Ensure only one probe the Z-value of each outlier probe as 3 if Z> 3 and measurement for each genomic position. We have three (A, B −2.5 if Z< − 2.5. and C) different kinds of tiling arrays in our current datasets. (5) If two adjacent probes are farther apart than 500 bp The N multiple probes on the same genomic position might in the genome (usually due to a long repeat sequence be from different arrays. Since we quantile normalized each between the two probes), in the forward and back- array separately, there could be a huge array bias if we take the ward procedure, the enriched and non-enriched state average of all the duplicated probe measurements. Therefore, probabilities of the latter probe are reset to the initial we randomly chose one probe and discarded the others. probabilities. 2.3 Probe behavior estimation To combine the results from different replicates, in either ChIP-chip experiments enrich only a small portion of DNA the ChIP or control group, the emission probabilities of all bound by the TF of interest, and control experiments cap- available replicates for each probe were averaged as the emis- ture only baseline non-speciﬁc binding. Therefore, most of sion probability on this probe. The forward and backward the probe measurements across all datasets are not actually algorithms of HMM (Rabiner, 1989) were used to calculate hybridized with ChIP-enriched DNA and could be used to the probabilities of a probe being in ChIP-enriched versus estimate the baseline probe behavior. We randomly selected non-enriched states. A log-odds enrichment value was given 5000 probes from the ∼1 million probe dataset and conduc- to each probe representing the ratio of being ChIP-enriched ted Shapiro–Wilk normality test on the behavior (PM−MM) against non-enriched state. The natural cutoff for the ChIP- of each probe across the 48 datasets. The majority (∼80%) enriched state is 0, which indicates equal ChIP-enriched and of these probes have the Shapiro–Wilk normality P -value > non-enriched probabilities. 0.01, accepting the null hypothesis that the distribution of The ChIP-enriched region is deﬁned as at least two probes probe behavior across the 48 datasets is normal. The remain- with log-odds enrichment value >0 in ChIP and at least one ing 20% probes that are not normal could be caused by ChIP or probe with log-odds enrichment value <−15 in the control. non-speciﬁc enrichment in some of the 48 datasets. We estim- The average enrichment value of all the probes in a ChIP- ated the baseline probe behavior using 36 of the 48 datasets, enriched region is used as a summary enrichment score for including all of the control datasets and 99.5% of probes from the entire region. i276 “bti1046” — 2005/6/10 — page 276 — #3 ChIP-chip experiments on genome tiling arrays 2.5 Sequence retrieval and repeat masking Genomic sequences of the HMM identiﬁed ChIP-enriched regions were retrieved from the repeat-masked April 2003 version of the human genome at UCSC Genome Browser (http://genome.ucsc.edu/). We further masked out all the tan- dem repeats identiﬁed by Tandem repeats ﬁnder (Benson, 1999) to facilitate downstream de novo motif ﬁnding. The resulting sequences are deﬁned as fully-repeat-masked sequences. 2.6 Motif ﬁnding and enrichment scoring We ranked all the fully-repeat-masked ChIP-enriched sequences by the summary enrichment score and applied MDscan (Liu et al., 2002) to ﬁnd the putative p53 binding motif. For a motif of width w, MDscan ﬁrst enumerates each wmer in the highest ranking sequences, and collects other wmers similar to it in these sequences to construct a candidate motif, represented as a probability matrix. A semi-Bayesian scoring function is used to remove low scoring candidate motifs and reﬁne the rest by checking all wmers in all the ChIP-enriched sequences. If several ﬁnal motifs share the same consensus, only the motif with the highest score is kept (Conlon et al., 2003). We determined how well a given sequence segment of width w matched a motif by a function S = max(S , S ) where S + − + and S are the following scoring formula on the sequence itself and its reverse complement, respectively: Fig. 3. Typical ChIP-enriched region (Blk78). Quantile normalized p + p (PM−MM) probe values for 3 ChIPs (p53-FL) and 3 controls (GST) ij s S = δ ln ij as well as HMM log-odds enrichment value for ChIP and control on i=1 j ∈{A,C,G,T} each probe were mapped to the chromosome 22 positions (bottom line). Although the HMM log-odds enrichment values are from six where p is the frequency of nucleotide j at position i in the ij ChIPs and six controls, only three ChIPs and three controls are shown motif, p is a pseudocount of 0.03, b is the background prob- s j in this ﬁgure. The height of the vertical bar is proportional to either ability of nucleotide j calculated from the intergenic regions probe signal or HMM enrichment value with the horizontal line indic- of the human genome, δ = 1 if nucleotide j is present at ating the value 0. The ChIP enriched-region with default cutoff was ij position i and δ = 0 if nucleotide j is not present at position i. indicated by the rectangle in the middle. HMM ﬁts the data into ij a smooth, symmetrical, bell-shaped curve. The two 10mers p53 Since the p53 is known to bind two palindrome 10mers sep- binding motif with 0 bp gap (AGACAGGCTC{0}AGGCATGCCA, arated by a variable spacer of length 0–13 bp (el-Deiry et al., indicated as asterisk) is right in the middle region of this curve with 1992), we computed the p53 matches by summing the scores the highest HMM enrichment value. for the two matching 10mers and extended the spacer to 30 bp. personal communication: Cawley et al., 2004 reported 11 3 RESULTS quantitative PCR veriﬁed regions for p53-FL, but actually 3.1 Identiﬁcation of ChIP-enriched sequences only 10 regions were veriﬁed). Overall, 24 of 98 HMM- We applied our analysis strategy (Fig. 1) on the data of p53- Full TFBSs overlap with the 48 p53-FL TFBSs reported FL ChIP-chip experiments on chromosomes 21 and 22 tiling by Cawley et al. (2004). Furthermore, although we did not arrays (Cawley et al., 2004). It was observed that 3.8% of the use the p53-DO1 data in our HMM analysis, six additional data representing short-range (<1 kb) repetitive probe meas- TFBSs in our HMM-Full results agree with the p53-DO1 urements were ﬁltered out, leaving behind 1 014 067 non- result from Cawley et al. (2004). In a typical ChIP-enriched redundant probe measurements. With the HMM algorithm region (Fig. 3), HMM will ﬁt the data into a smooth, symmet- described in Section 2.4, we identiﬁed 98 p53 TFBSs (deﬁned rical, bell-shaped curve. Often, the p53 binding motif appears as HMM-Full TFBSs and summarized in Table 1 in the in the middle region of the curve with the highest HMM value. Supplementary materials), which include all the 10 p53 There are two major groups of repeats in eukaryotic TFBSs previously veriﬁed by quantitative PCR (S. Bekiranov, genomes: interspersed repeats mainly represent degenerate i277 “bti1046” — 2005/6/10 — page 277 — #4 W.Li et al. copies of transposable elements dispersed at various loca- tions, whereas tandem repeats are usually conﬁned to speciﬁc genomic regions where a unit is tandemly repeated almost exactly from several to thousands of times. To minimize poten- tial cross-hybridization, current Affymetrix probe design (Kapranov et al., 2002) rejects probes residing in the inter- spersed repeats and tandem repeats with short period (roughly ≤12) identiﬁed by RepeatMasker. Even though we ﬁlter out the repetitive probe measurements, each repeat unit can still be represented by probe(s) with different 25mers. Therefore, regions (∼1% of the genome) containing tandem repeats with long period (>12) and high copy number (>10) will have more chance to be falsely predicted as ChIP-enriched. This false-positive prediction includes both higher probe signal value than the real one and expanding real enriched area to the entire tandem repeat region. In one example, Blk55 (Table 1 in the Supplementary materials), containing a tan- Fig. 4. 10mer sequence logos of p53 motifs from (a) MDscan iden- dem repeat with period size of 54 bp and copy number of 139, tiﬁed motif from 43 HMM-identiﬁed high conﬁdence regions from is ∼7 kb long and has extremely high enrichment score of p53 ChIP-chip experiments on genome tiling arrays; (b) TRANSFAC 14.7. This region, comprising ∼160 probes, was not repeat- Professional database; (c) aligned p53 binding sequences from two Classic literatures (el-Deiry et al., 1992; Funk et al., 1992) deﬁning masked during the array design and might be falsely predicted the p53 consensus binding site. as ChIP-enriched. We identiﬁed a total of four TFBSs within the tandem repeats (Table 1 in the Supplementary materials). Among the 21 TFBSs from Cawley et al. (2004) that were Therefore, we carried out motif ﬁnding only on the 43 high not identiﬁed by our HMM approach, almost half are within conﬁdence regions with a stringent log-odds enrichment the tandem repeats. They may indicate a higher number of cutoff value of 6 in ChIP and the same default cutoff in control false positive predictions by Cawley et al. (2004), although and deﬁned these TFBSs containing high conﬁdence regions they could be involved in various regulatory mechanisms as HMM High-Conﬁdence (Table 1 in the Supplementary (Nakamura et al., 1998). We decided to keep all the HMM- materials). Using a motif-ﬁnding program MDscan (Liu et al., identiﬁed TFBSs within the tandem repeats, but only use the 2002), we successfully recovered a strong 10mer palindromic fully-repeat-masked sequences for downstream de novo motif binding motif (Fig. 4 TilingArray motif) from these 43 fully- ﬁnding. repeat-masked high-conﬁdence regions, This motif resembles In addition to the known repeats, non-RepeatMasked large both the p53 motif from entry M00761 of TRANSFAC (Matys segmental duplications (>90% identity, >1 kb in length) et al., 2003) and the Classic p53 motif derived by aligning p53 cover ∼5.3% of the euchromatic genome from the cur- binding sequences from two literatures (el-Deiry et al., 1992; rent human genome assembly (International Human Genome Funk et al., 1992). A similar TilingArray motif could still Sequencing Consortium, 2004). This is the most difﬁcult be recovered after the segmental duplication sequences were cross-hybridization problem for ChIP-chip experiments on removed from the high-conﬁdence regions. It is the ﬁrst time genome tiling arrays. Without other independent evidence, a p53 motif is successfully predicted from either promoters it is impossible to discriminate one or more copies of the of co-expressed genes or ChIP-chip enriched sequences by real enriched DNA from the large segment duplications based de novo motif ﬁnding algorithms. Biologists often use the solely on tiling array hybridization. For example, one ∼40 kb prior knowledge of the TFBS motif from literature or data- segment duplicates six times within chromosome 22 with bases to search for the occurrences of the binding sites. ∼99% sequence identity. We found two TFBSs (the ﬁrst However, if the motif is obtained only from several known occurrences are Blk35 and Blk36) on each copy of this duplic- sites, it may be either too restrictive and miss many real ated segments, generating 12 TFBSs ‘redundant’ at sequence binding sites or too general and ﬁnd many false positive identity level. A total of 26 from the 98 HMM-Full TFBSs sites. In contrast, the p53 TilingArray motif identiﬁed in were found within these segmental duplications and were kept this study might represent the unbiased characterization of intact in our current analysis. p53-DNA interaction. For example, the Classic p53 bind- ing motif require a C at position 4 and a G at position 7, 3.2 Identiﬁcation of p53-binding motif whereas the TilingArray motif is somewhat degenerate at Extracting putative regulatory motifs from ChIP-enriched these two positions. Some studies (Resnick-Silverman et al., regions is difﬁcult for genome tiling array data because 1998; Jaiswal and Narayan, 2001) showed p53 to bind to the long enriched sequences increase the background noise. sequences with mutations in these two positions, indicating i278 “bti1046” — 2005/6/10 — page 278 — #5 ChIP-chip experiments on genome tiling arrays Table 1. Enrichment of p53 binding motifs Motif TFBS No. of sites No. of bases No. of motifs % Sites with No. of motifs Fold-enrichment Binomial P -value motif per 10 kb TilingArray HMM-High-Conﬁdence 43 31,394 34 40 10.8 3.3 2.3e−09 HMM-Full 98 58,734 55 34 9.4 2.8 4.8e−12 Cawley’s 48 25,994 20 25 7.7 2.3 2.1e−04 TRANSFAC HMM-High-Conﬁdence 43 31,394 73 60 23.3 3.5 <2.2e−16 HMM-Full 98 58,734 101 42 17.2 2.6 <2.2e−16 Cawley’s 48 25,994 35 31 13.5 2.0 3.0e−05 Classic HMM-High-Conﬁdence 43 31,394 26 35 8.3 4.6 2.8e−10 HMM-Full 98 58,734 36 23 6.1 3.4 3.0e−10 Cawley’s 48 25,994 13 17 5.0 2.8 6.3e−4 Three p53 motifs (Fig. 4) were mapped to three sets of p53 TFBSs (Table 1 in the Supplementary materials). The number of bases refers to the number of non-repeat nucleotides in fully-repeat-masked TFBSs. A motif-matching scoring function with cutoff of 10, allowing up to 30 bp spacer between the 10mer motif pairs, was used to determine the number of matches in individual TFBS. This cutoff corresponds to 3.3 matches per 10 kb on chromosomes 21 and 22. Fold enrichment was inferred by comparing the motif occurrences in TFBSs with those in fully-repeat-masked chromosomes 21 and 22 sequences. A one tail binomial test was used to determine the P -value attached to the motif enrichment. A substantial enrichment of p53 motif was observed in our HMM identiﬁed TFBS. that the TilingArray p53 motif might better characterize the Table 2. Annotation of p53 TFBS locations p53 binding at these two positions than the Classic p53 motif. TFBS No. of CpG island RefSeq gene 3.3 Motif enrichment sites <1kb <3kb <1kbto5 <1kbto3 p53 was known to bind to two copies of the 10mer palin- HMM-High- 43 5 10 1 1 drome motif with a variable spacer between them (el-Deiry Conﬁdence et al., 1992). We mapped the p53 motif occurrences to the HMM-Full 98 6 20 2 1 fully-repeat-masked sequences of chromosomes 21 and 22, Cawley’s 48 8 12 0 0 allowing up to 30 bp spacer between the 10mer motif pairs. A substantial enrichment of the TilingArray motif pairs was The three sets of TFBS sets are the same as in Table 1 in the Supplementary materials. CpG island and RefSeq Gene annotation tracks were retrieved from UCSC Genome observed from our HMM identiﬁed TFBSs. Using a match Browser on April 2003 Human Genome Assembly (http://genome.ucsc.edu/). 5 and 3 score cutoff of 10 (corresponding to 3.3 matches per 10 kb referred to the transcription starts and ends in RefSeq Gene track. The distance between on chromosomes 21 and 22), TilingArray motif pairs are in TFBS and CpG island or RefSeq Gene is the space between their nearest boundaries. 40, 34 and 25% of HMM-High-Conﬁdence, HMM-Full and Cawley’s TFBS, respectively, corresponding to 3.3-, 2.8- and 2.3-fold enrichment compared with the genomic background 5 upstream or 3 downstream of RefSeq genes (Table 2). (Table 1). In addition, in the TFBSs lacking TilingArray A total of 5, 6 and 8 TFBSs in HMM-High-Conﬁdence, motif pairs, 10 HMM-High-Conﬁdence and 21 HMM-Full HMM-Full and Cawley’s, respectively, are within 1 kb of were found to contain at least one copy of the TilingArray annotated CpG island. This seems to indicate that more Caw- motif. The single copy TilingArray motif (not motif pairs) ley’s TFBSs are proximal to CpG island than those identiﬁed represents a 2.7-fold enrichment compared with the genomic by HMM. However, since TFBSs <1kbinCawley et al. background. One might suspect the validity of this enrich- (2004) were extended equally in both directions to a length of ment analysis, since the TilingArray motif itself is extracted 1 kb, they are more likely to be longer than ours. When the from the HMM identiﬁed high conﬁdence regions and thus, distance between TFBS and CpG island was expanded to 3 kb, should be much more enriched. To address this issue, we we found almost the same number of TFBS in HMM-High- conducted enrichment analysis using two independent repres- Conﬁdence and in Cawley’s TFBSs. It is worth noting that no entations of p53 motif (TRANSFAC and Classic, Fig. 4) with TFBS of Cawley are <1kbof5 upstream or 3 downstream the same criteria on the three sets of TFBSs (Table 1). The res- of RefSeq gene, whereas three TFBSs identiﬁed in HMM are ult showed that the fold enrichments of these two p53 motifs in within those regions. The one TFBS found to be within 1 kb of our HMM identiﬁed regions are still much higher than those in 3 downstream of RefSeq gene is of special interest, because Cawley’s data. it suggests the potential antisense transcripts at the 3 of gene. We extracted 40 well-documented p53 directly regulated 3.4 Annotation of p53 TFBS genes from TRANSFAC (Matys et al., 2003) and mapped We examined the proximities of p53 TFBSs to traditional them to the human genome. Unfortunately, none of them transcriptional regulatory regions, such as CpG islands and is located along human chromosomes 21 and 22. Recently, i279 “bti1046” — 2005/6/10 — page 279 — #6 W.Li et al. two studies (Mirza et al., 2003; Kho et al., 2004) identi- 4 DISCUSSION ﬁed thousands of p53-target genes (direct or indirect) from We present a fast, scalable and sensitive HMM approach for microarray expression analysis. They further narrowed down analyzing ChIP-chip experiments on genome tiling arrays. the potential p53 directly regulated genes to several hundred The algorithm is fast because replicate datasets at each probe by ﬁnding p53-bindng consensus sequence in their regulatory location are considered together in a single HMM run to estim- regions (Mirza et al., 2003) or by excluding the genes whose ate its enrichment probability. On a 2.4 MHz Xeon server, it expression are not directly inﬂuenced by p53 protein level takes ∼6 min to run HMM on six ChIP and six control rep- (Kho et al., 2004). Thirty of these potential p53 directly regu- licates each with ∼1 million probes. Because of its speed, the lated genes are along chromosomes 21 and 22. We found two algorithm is scalable on tiling arrays of either small regions TFBSs to be associated with these genes: Blk73, which was or the whole genome, and ﬂexible enough to be used on not identiﬁed by Cawley et al. (2004), is in the intron of known other organisms. Furthermore, the algorithm is sensitive in gene PITPNB (protein gi|1060905) with a 2.8-expression-fold identifying all the regions previously veriﬁed by quantitative change (Mirza et al., 2003) in response to p53; Blk87, which PCR. The following independent evidences suggest that the was veriﬁed by quantitative PCR in Cawley et al. (2004), is 98 TFBSs we identiﬁed are likely to be genuine regulatory in the ∼13 kb upstream of known gene AB051455 (protein elements. First, our analysis strategy reported fewer TFBSs gi|6572156) with −4.1-expression-fold change (Mirza et al., residing in non-RepeatMasked tandem repeat and large seg- 2003) in response to p53. mental duplications regions which are questionable in their regulatory function. Second, a TilingArray motif resembling 3.5 Method evaluation the TRANSFAC and Classic p53 motif was successfully dis- Our analysis strategy (Fig. 1) differs from the previous study covered from the high conﬁdence TFBSs. Finally, matching by Cawley et al. (2004) in data normalization, repeat ele- the TilingArray, TRANSFAC and Classic p53 motifs in the ments ﬁltering and HMM prediction. It is worth checking TFBSs reveals substantial enrichment of the motifs compared whether the better performance of binding sequence predic- with the genomic background. tion is due to the HMM prediction or the other two factors. Our algorithm overcomes the intrinsic weakness of the To investigate the real difference between HMM and Mann– Mann–Whitney U -test used in Cawley et al. (2004). Without Whitney U -test, we performed the Mann–Whitney U -test on enough replicates and long enough window, Mann–Whitney our quantile normalized datasets with repetitive probe meas- U -test cannot identify enriched regions with a sufﬁciently urements removed. We used the same criteria from Cawley small P -value. On account of the high cost of the tiling arrays, et al. (2004), i.e. ±500 bp sliding window; probe values were often biologists cannot afford more than three replicates. The transformed to log 2(max(PM−MM,1)); predicted regions shearing process in ChIP usually generates fragments <1kb separated by <500 bp were merged together; two sets of pre- long. Reinforcinga1kb window may not only miss short diction (FL–DO1, FL–Input) were further merged together ChIP-enriched regions but also include sequences outside the to form a non-redundant set. Using the same P -value cutoff enriched regions which may confound later motif detection. of 1e−5, we identiﬁed 65 TFBSs, from which additional 11 Mann–Whitney U -test treats every probe equally and fails TFBSs could be found in HMM-Full besides the previous 24 to consider probe-by-probe variability. The statistic U does overlapping TFBSs between HMM-Full and Cawley’s data. not reﬂect whether probe values are ﬂuctuating a lot or con- Most of the 11 additional TFBSs fall just barely below the 1e–5 tinuously high within the window; only the latter indicates cutoff. We gradually decreased the P -value cutoff from 1e−5 ChIP-enriched TFBS. to 1e−9 to conduct the de novo p53-binding motif searching Our HMM approach is based on normalizing data on tiling on the fully-repeat-masked resulting sequences. With none arrays from many experiments and modeling the behavior of of these cutoffs could MDscan recover a pattern resembling each individual probe. One laboratory may not necessarily the p53 consensus binding site. As for the motif enrichment have enough ChIP-chip replicates, but pooling together freely study, we used the 1e−5 P -value cutoff and expanded the available tiling array data from many different laboratories and resulting TFBS equally in each direction to have a minimum different experiments (e.g. ChIP-chip against different TFs length of 750 bp. Therefore, the total number of non-repeat or different tissues) can provide enough information for the nucleotides in these 65 TFBSs is ∼34 kb, which is compar- modeling requirement. Probe behavior estimated from pooled able with that (∼31 kb) of HMM-High-Conﬁdence. Using experiments serves as a baseline for each ChIP experiment three p53 motif representations with the same criteria as in to be compared with, thus HMM can even get reasonable Table 1, we found 7.6, 15.7 and 6.4 motifs per 10 kb for results from a single ChIP experiment. This is particularly TilingArray, TRANSFAC and Classic p53 motifs, respect- useful for surveying at the beginning of a ChIP experiment ively. The motif enrichment is much lower than that in our to explore antibody quality, culture conditions or preliminary HMM-High-Conﬁdence regions. All of the above results quality assessment of each replicate. Our HMM approach cal- suggest that HMM contributed greatly to the success of our culates enrichment probability at each probe location, so the binding sequence prediction. i280 “bti1046” — 2005/6/10 — page 280 — #7 ChIP-chip experiments on genome tiling arrays exact enriched region boundaries could be determined. Fur- oligonucleotide array data based on variance and bias. Bioinform- atics, 19, 185–193. thermore, the sensitive HMM approach can identify short Cawley,S., Bekiranov,S., Ng,H.H., Kapranov,P., Sekinger,E.A., ChIP-enriched sequence. If we exclude those TFBS con- Kampa,D., Piccolboni,A., Sementchenko,V., Cheng,J., taining tandem repeats, the average length (771 bp) of the Williams,A.J. et al. (2004) Unbiased mapping of transcrip- HMM-High-Conﬁdence TFBSs not identiﬁed by the Mann– tion factor binding sites along human chromosomes 21 and 22 Whitney U -test is much shorter than the average length (984 points to widespread regulation of noncoding RNAs. Cell, 116, bp) of HMM-High-Conﬁdence TFBSs identiﬁed by Cawley 499–509. et al. (2004). Conlon,E.M., Liu,X.S., Lieb,J.D. and Liu,J.S. (2003) Integrating Our analysis seems to reveal two interesting characterist- regulatory motif discovery and genome-wide expression analysis. ics about p53 regulation. First, p53 binding may not simply Proc. Natl Acad. Sci. USA, 100, 3339–3344. be the classic two 10mer palindrome motif separated by a el-Deiry,W.S., Kern,S.E., Pietenpol,J.A., Kinzler,K.W. and variable length spacer of 0–13 bp. Many HMM identiﬁed Vogelstein,B. (1992) Deﬁnition of a consensus binding site for p53. Nat. Genet., 1, 45–49. TFBSs contain only a single copy of the TilingArray motif Euskirchen,G., Royce,T.E., Bertone,P., Martone,R., Rinn,J.L., or multiple copies that are >30 bp apart. In almost half (43) Nelson,F.K., Sayward,F., Luscombe,N.M., Miller,P., of the HMM-Full TFBSs, including four quantitative PCR Gerstein,M., Weissman,S. and Snyder,M. (2004) CREB binds veriﬁed regions, this motif does not occur at all. The absence to multiple loci on human chromosome 22. Mol. Cell. Biol., 24, of known p53 motif or motif pairs in TFBS suggests that 3804–3814. there might be an alternative binding mechanism (Yin et al., Funk,W.D., Pak,D.T., Karas,R.H., Wright,W.E. and Shay,J.W. 2003) for p53. Second, p53 binding sites may not be con- (1992) A transcriptionally active DNA-binding site for human served between human and rodents. The average sequence p53 protein complexes. Mol. Cell. Biol., 12, 2866–2871. identity for all the 98 HMM-Full TFBSs is only ∼30% based Harbison,C.T., Gordon,D.B., Lee,T.I., Rinaldi,N.J., Macisaac,K.D., on the BLASTZ global alignment .(Schwartz et al., 2003) Danford,T.W. Hannett,N.M., Tagne,J.B., Reynolds,D.B., Yoo,J. between human (hg15) and rodents (either mouse mm3 or rat et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99–104. rn2). Even the average sequence identity of all the quantitative International Human Genome Sequencing Consortium (2004) Fin- PCR veriﬁed regions is only 42%. Furthermore, 37 TFBSs, ishing the euchromatic sequence of the human genome. Nature, including one quantitative PCR veriﬁed region, have no rodent 431, 931–945. counterparts at all. Many algorithms for ﬁnding eukaryotic Jaiswal,A.S. and Narayan,S. (2001) p53-dependent transcriptional regulatory elements rely on identifying conserved sequences regulation of the APC promoter in colon cancer cells treated with from alignment of orthologous non-coding sequences (Loots DNA alkylating agents. J. Biol. Chem., 276, 18193–18199. et al., 2002; Liu et al., 2004). Our result suggests that the loss Kapranov,P., Cawley,S.E., Drenkow,J., Bekiranov,S., of sensitivity of this approach may be signiﬁcant. The major- Strausberg,R.L., Fodor,S.P. and Gingeras,T.R. (2002) Large-scale ity of the TFBSs identiﬁed by ChIP-chip experiments on tiling transcriptional activity in chromosomes 21 and 22. Science, 296, arrays will be sacriﬁced if sequence identity is included as a 916–919. criterion. Kho,P.S., Wang,Z., Zhuang,L., Li,Y., Chew,J.L., Ng,H.H., Liu,E.T. and Yu,Q. (2004) p53-regulated transcriptional program associ- In summary, the HMM approach presented here provides ated with genotoxic stress-induced apoptosis. J. Biol. Chem., 279, biologists with an efﬁcient computational approach for ana- 21183–21192. lyzing the massive data generated from ChIP-chip on genome Li,Z., Van Calcar,S., Qu,C., Cavenee,W.K., Zhang,M.Q. and Ren,B. tiling arrays. Its adoption will contribute to the more com- (2003) A global transcriptional regulatory role for c-Myc in prehensive models and understanding of gene regulatory Burkitt’s lymphoma cells. Proc. Natl Acad. Sci. USA, 100, networks in higher eukaryotes. 8164–8169. Lieb,J.D., Liu,X., Botstein,D. and Brown,P.O. (2001) Promoter- speciﬁc binding of Rap1 revealed by genome-wide maps of ACKNOWLEDGEMENTS protein-DNA association. Nat. Genet., 28, 327–334. The authors would like to thank David Harrington, Chen Li, Lieberfarb,M.E., Lin,M., Lechpammer,M., Li,C., Tanenbaum,D.M., Molin Wang and Jun S. Liu for their insights and advice on Febbo,P.G., Wright,R.L., Shim,J., Kantoff,P.W., Loda,M., the tiling array data analysis. The project was partly supported Meyerson,M. and Sellers,W.R. (2003) Genome-wide loss of het- by Claudia Adams Barr Program in Cancer Research. erozygosity analysis from laser capture microdissected prostate cancer using single nucleotide polymorphic allele (SNP) arrays and a novel bioinformatics platform dChipSNP. Cancer Res., REFERENCES 63, 4781–4785. Liu,X.S., Brutlag,D.L. and Liu,J.S. (2002) An algorithm for Benson,G. (1999) Tandem repeats ﬁnder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. ﬁnding protein–DNA binding sites with applications to Bolstad,B.M., Irizarry,R.A., Astrand,M. and Speed,T.P. (2003) chromatin-immunoprecipitation microarray experiments. Nat. A comparison of normalization methods for high density Biotechnol., 20, 835–839. i281 “bti1046” — 2005/6/10 — page 281 — #8 W.Li et al. Liu,Y., Liu,X.S., Wei,L., Altman,R.B. and Batzoglou,S. (2004) Rabiner,L. (1989) A tutorial on hidden Markov models and selected Eukaryotic regulatory element conservation analysis and iden- applications in speech recognition. Proc. IEEE, 77, 257–286. tiﬁcation using comparative genomics. Genome Res., 14, Ren,B., Robert,F., Wyrick,J.J., Aparicio,O., Jennings,E.G., 451–458. Simon,I., Zeitlinger,J., Schreiber,J., Hannett,N., Kanin,E. et al. Loots,G.G., Ovcharenko,I., Pachter,L., Dubchak,I. and Rubin,E.M. (2000) Genome-wide location and function of DNA binding (2002) rVista for comparative sequence-based discovery of proteins. Science, 290, 2306–2309. functional transcription factor binding sites. Genome Res., 12, Resnick-Silverman,L., St Clair,S., Maurer,M., Zhao,K. and 832–839. Manfredi,J.J. (1998) Identiﬁcation of a novel class of genomic Matys,V., Fricke,E., Geffers,R., Gossling,E., Haubrock,M., Hehl,R., DNA-binding sites suggests a mechanism for selectivity in target Hornischer,K., Karas,D., Kel,A.E., Kel-Margoulis,O.V. et al. gene activation by the tumor suppressor protein p53. Genes Dev., (2003) TRANSFAC: transcriptional regulation, from patterns to 12, 2102–2107. proﬁles. Nucleic Acids Res., 31, 374–378. Schwartz,S., Kent,W.J., Smit,A., Zhang,Z., Baertsch,R., Mirza,A., Wu,Q., Wang,L., McClanahan,T., Bishop,W.R., Hardison,R.C., Haussler, D. and Miller, W. (2003) Human–mouse Gheyas,F., Ding,W., Hutchins,B., Hockenberry,T., alignments with BLASTZ. Genome Res., 13, 103–107. Kirschmeier,P., Greene,J.R. and Liu.S. (2003) Global tran- Wells,J., Yan,P.S., Cechvala,M., Huang,T. and Farnham,P.J. (2003) scriptional program of p53 target genes during the process Identiﬁcation of novel pRb binding sites using CpG microarrays of apoptosis and cell cycle progression. Oncogene, 22, suggests that E2F recruits pRb to speciﬁc genomic sites during S 3645–3654. phase. Oncogene, 22, 1445–1460. Nakamura,Y., Koyama,K. and Matsushima,M. (1998) VNTR (vari- Workman,C., Jensen,L.J., Jarmer,H., Berka,R., Gautier,L., able number of tandem repeat) sequences as transcriptional, Nielser,H.B., Saxild,H.H., Nielsen,C., Brunak,S. and Knudsen,S. translational, or functional regulators. J. Hum. Genet., 43, (2002) A new non-linear normalization method for reducing 149–152. variability in DNA microarray experiments. Genome Biol., 3, Odom,D.T., Zizlsperger,N., Gordon,D.B., Bell,G.W., Rinaldi,N.J., research0048. Murray,H.L. Volkert,T.L., Schreiber,J., Rolfe,P.A., Gifford,D.K. Yin,Y., Liu,Y.X., Jin,Y.J., Hall,E.J. and Barrett,J.C. (2003) PAC1 et al. (2004) Control of pancreas and liver gene expression by phosphatase is a transcription target of p53 in signalling apoptosis HNF transcription factors. Science, 303, 1378–1381. and growth suppression. Nature, 422, 527–531. i282 “bti1046” — 2005/6/10 — page 282 — #9

Journal

Bioinformatics – Oxford University Press

Published: Jun 1, 2005

Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences

A hidden Markov model for analyzing ChIP-chip experiments on genome tiling arrays and its application to p53 binding sequences

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

Loading...

References (29)

Abstract

Journal

Recommended Articles

There are no references for this article.

Our policy towards the use of cookies