A varying threshold method for ChIP peak-calling using multiple sources of information

Motivation: Gene regulation commonly involves interaction among DNA, proteins and biochemical conditions. Using chromatin immunoprecipitation (ChIP) technologies, protein–DNA interactions are routinely detected in the genome scale. Computational methods that detect weak protein-binding signals and simultaneously maintain a high specificity yet remain to be challenging. An attractive approach is to incorporate biologically relevant data, such as protein co-occupancy, to improve the power of protein-binding detection. We call the additional data related with the target protein binding as supporting tracks. Results: We propose a novel but rigorous statistical method to identify protein occupancy in ChIP data using multiple supporting tracks (PASS2). We demonstrate that utilizing biologically related information can significantly increase the discovery of true protein-binding sites, while still maintaining a desired level of false positive calls. Applying the method to GATA1 restoration in mouse erythroid cell line, we detected many new GATA1-binding sites using GATA1 co-occupancy data. Availability: http://stat.psu.edu/∼yuzhang/pass2.tar Contact: yuzhang@stat.psu.edu

[1]  Francesca Chiaromonte,et al.  Erythroid GATA 1 function revealed by genome-wide analysis of transcription factor occupancy , histone modifications , and mRNA expression , 2009 .

[2]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[3]  Francesca Chiaromonte,et al.  Transcriptional enhancement by GATA1-occupied DNA segments is strongly associated with evolutionary constraint on the binding site motif. , 2008, Genome research.

[4]  Hongyu Zhao,et al.  Statistical methods to infer cooperative binding among transcription factors in Saccharomyces cerevisiae , 2008, Bioinform..

[5]  Yu Zhang,et al.  Poisson approximation for significance in genome-wide ChIP-chip tiling arrays , 2008, Bioinform..

[6]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[7]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[8]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[9]  M. Kendall Theoretical Statistics , 1956, Nature.

[10]  Francesca Chiaromonte,et al.  Primary sequence and epigenetic determinants of in vivo occupancy of genomic DNA by GATA1 , 2009, Nucleic acids research.

[11]  Terrence S. Furey,et al.  F-Seq: a feature density estimator for high-throughput sequence tags , 2008, Bioinform..

[12]  Leah Barrera,et al.  ChIP‐chip: Data, Model, and Analysis , 2007, Biometrics.

[13]  Mark Gerstein,et al.  Bioinformatics Original Paper a Supervised Hidden Markov Model Framework for Efficiently Segmenting Tiling Array Data in Transcriptional and Chip-chip Experiments: Systematically Incorporating Validated Biological Knowledge , 2022 .

[14]  Brigitte Wild,et al.  Histone Methyltransferase Activity of a Drosophila Polycomb Group Repressor Complex , 2002, Cell.

[15]  William Stafford Noble,et al.  Unsupervised segmentation of continuous genomic data , 2007, Bioinform..

[16]  Stephen D. Bay Multivariate Discretization for Set Mining , 2001, Knowledge and Information Systems.

[17]  Mark Bieda,et al.  Unbiased location analysis of E2F1-binding sites suggests a widespread role for E2F1 in the human genome. , 2006, Genome research.

[18]  T. Rabbitts,et al.  The LIM‐only protein Lmo2 is a bridging molecule assembling an erythroid, DNA‐binding complex which includes the TAL1, E47, GATA‐1 and Ldb1/NLI proteins , 1997, The EMBO journal.

[19]  Nathaniel D. Heintzman,et al.  Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome , 2007, Nature Genetics.