An effective statistical evaluation of ChIPseq dataset similarity

MOTIVATION ChIPseq is rapidly becoming a common technique for investigating protein-DNA interactions. However, results from individual experiments provide a limited understanding of chromatin structure, as various chromatin factors cooperate in complex ways to orchestrate transcription. In order to quantify chromtain interactions, it is thus necessary to devise a robust similarity metric applicable to ChIPseq data. Unfortunately, moving past simple overlap calculations to give statistically rigorous comparisons of ChIPseq datasets often involves arbitrary choices of distance metrics, with significance being estimated by computationally intensive permutation tests whose statistical power may be sensitive to non-biological experimental and post-processing variation. RESULTS We show that it is in fact possible to compare ChIPseq datasets through the efficient computation of exact P-values for proximity. Our method is insensitive to non-biological variation in datasets such as peak width, and can rigorously model peak location biases by evaluating similarity conditioned on a restricted set of genomic regions (such as mappable genome or promoter regions). Applying our method to the well-studied dataset of Chen et al. (2008), we elucidate novel interactions which conform well with our biological understanding. By comparing ChIPseq data in an asymmetric way, we are able to observe clear interaction differences between cofactors such as p300 and factors that bind DNA directly. AVAILABILITY Source code is available for download at http://sonorus.princeton.edu/IntervalStats/IntervalStats.tar.gz. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Zhiping Weng,et al.  Statistical analysis of the genomic distribution and correlation of regulatory elements in the ENCODE regions. , 2007, Genome research.

[2]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[3]  Janet Rossant,et al.  Distinct histone modifications in stem cell lines and tissue lineages from the early mouse embryo , 2010, Proceedings of the National Academy of Sciences.

[4]  Steven Russell,et al.  On the use of resampling tests for evaluating statistical significance of binding-site co-occurrence , 2009, BMC Bioinformatics.

[5]  Thomas Zeng,et al.  Global analysis of in vivo Foxa2-binding sites in mouse adult liver using massively parallel sequencing , 2008, Nucleic acids research.

[6]  Stuart H. Orkin,et al.  A protein interaction network for pluripotency of embryonic stem cells , 2006, Nature.

[7]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[8]  Manolis Kellis,et al.  Dynamics of the epigenetic landscape during erythroid differentiation after GATA1 restoration. , 2011, Genome research.

[9]  M. Eisen,et al.  Impact of Chromatin Structures on DNA Processing for Genomic Analyses , 2009, PloS one.

[10]  M. Zajac-Kaye,et al.  Myc oncogene: a key component in cell cycle regulation and its implication for lung cancer. , 2001, Lung cancer.

[11]  G. Galbraith,et al.  In vitro reprogramming of fibroblasts into a pluripotent ES-cell-like state , 2008 .

[12]  Inanc Birol,et al.  Locus co-occupancy, nucleosome positioning, and H3K4me1 regulate the functionality of FOXA2-, HNF4A-, and PDX1-bound loci in islets and liver. , 2010, Genome research.

[13]  Ralf Janknecht,et al.  Transcriptional control: Versatile molecular glue , 1996, Current Biology.

[14]  R. Janknecht,et al.  Versatile molecular glue. Transcriptional control. , 1996, Current biology : CB.

[15]  Yuuki Kodama,et al.  Generation of induced pluripotent stem cells by efficient reprogramming of adult bone marrow cells. , 2010, Stem cells and development.

[16]  Ole Winther,et al.  Multivariate Hawkes process models of the occurrence of regulatory elements , 2010, BMC Bioinformatics.

[17]  Audrey Qiuyan Fu,et al.  Scoring overlapping and adjacent signals from genome-wide ChIP and DamID assays. , 2009, Molecular bioSystems.

[18]  Michael F. Lin,et al.  Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals , 2009, Nature.

[19]  R. Jaenisch,et al.  In vitro reprogramming of fibroblasts into a pluripotent ES-cell-like state , 2007, Nature.

[20]  A. Mortazavi,et al.  Computation for ChIP-seq and RNA-seq studies , 2009, Nature Methods.

[21]  J. Nevins,et al.  A role for Myc in facilitating transcription activation by E2F1 , 2008, Oncogene.

[22]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[23]  Guangjin Pan,et al.  Nanog and transcriptional networks in embryonic stem cell pluripotency , 2007, Cell Research.

[24]  Dustin E. Schones,et al.  Global analysis of the insulator binding protein CTCF in chromatin barrier regions reveals demarcation of active and repressive domains. , 2008, Genome research.

[25]  W. Wong,et al.  ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells , 2009, Proceedings of the National Academy of Sciences.

[26]  Fred H. Gage,et al.  Nanog binds to Smad1 and blocks bone morphogenetic protein-induced differentiation of embryonic stem cells , 2006, Proceedings of the National Academy of Sciences.

[27]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[28]  Stephen Dalton,et al.  The cell cycle and Myc intersect with mechanisms that regulate pluripotency and reprogramming. , 2009, Cell stem cell.