A hidden Markov random field-based Bayesian method for the detection of long-range chromosomal interactions in Hi-C data

MOTIVATION Advances in chromosome conformation capture and next-generation sequencing technologies are enabling genome-wide investigation of dynamic chromatin interactions. For example, Hi-C experiments generate genome-wide contact frequencies between pairs of loci by sequencing DNA segments ligated from loci in close spatial proximity. One essential task in such studies is peak calling, that is, detecting non-random interactions between loci from the two-dimensional contact frequency matrix. Successful fulfillment of this task has many important implications including identifying long-range interactions that assist interpreting a sizable fraction of the results from genome-wide association studies. The task - distinguishing biologically meaningful chromatin interactions from massive numbers of random interactions - poses great challenges both statistically and computationally. Model-based methods to address this challenge are still lacking. In particular, no statistical model exists that takes the underlying dependency structure into consideration. RESULTS In this paper, we propose a hidden Markov random field (HMRF) based Bayesian method to rigorously model interaction probabilities in the two-dimensional space based on the contact frequency matrix. By borrowing information from neighboring loci pairs, our method demonstrates superior reproducibility and statistical power in both simulation studies and real data analysis. AVAILABILITY AND IMPLEMENTATION The Source codes can be downloaded at: http://www.unc.edu/∼yunmli/HMRFBayesHiC CONTACT: ming.hu@nyumc.org or yunli@med.unc.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Elizabeth Pennisi,et al.  The Biology of Genomes. Disease risk links to gene regulation. , 2011, Science.

[2]  William Stafford Noble,et al.  Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts , 2014, Genome research.

[3]  W. Sung,et al.  Chromatin connectivity maps reveal dynamic promoter–enhancer long-range associations , 2013, Nature.

[4]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[5]  Bok-Ghee Han,et al.  Genome-wide association study of rheumatoid arthritis in Koreans: population-specific loci as well as overlap with European susceptibility loci. , 2011, Arthritis and rheumatism.

[6]  L. Mirny,et al.  Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data , 2013, Nature Reviews Genetics.

[7]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[8]  Boris Lenhard,et al.  Patterns of regulatory activity across diverse human cell types predict tissue identity, transcription factor binding, and long-range interactions , 2013, Genome research.

[9]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[10]  J. Besag,et al.  Bayesian Computation and Stochastic Systems , 1995 .

[11]  Jianlin Cheng,et al.  Large-scale reconstruction of 3D structures of human chromosomes from chromosomal contact data , 2014, Nucleic acids research.

[12]  Liang Niu,et al.  Statistical Models for Detecting Differential Chromatin Interactions Mediated by a Protein , 2014, PloS one.

[13]  J. Dekker,et al.  The long-range interaction landscape of gene promoters , 2012, Nature.

[14]  Zhaohui S. Qin,et al.  Gene density, transcription, and insulators contribute to the partition of the Drosophila genome into physical domains. , 2012, Molecular cell.

[15]  N. Cox,et al.  Obesity-associated variants within FTO form long-range functional connections with IRX3 , 2014, Nature.

[16]  Hongzhe Li,et al.  A hidden Markov random field model for genome-wide association studies. , 2010, Biostatistics.

[17]  Marc A. Martí-Renom,et al.  Bridging the Resolution Gap in Structural Modeling of 3D Genome Organization , 2011, PLoS Comput. Biol..

[18]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[19]  Zhaohui S. Qin,et al.  HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data , 2010, BMC Bioinformatics.

[20]  Sophie Ancelet,et al.  Bayesian Clustering Using Hidden Markov Random Fields in Spatial Population Genetics , 2006, Genetics.

[21]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[22]  Qianxing Mo,et al.  A fully Bayesian hidden Ising model for ChIP-seq data analysis. , 2012, Biostatistics.

[23]  Yan Li,et al.  A high-resolution map of three-dimensional chromatin interactome in human cells , 2013, Nature.

[24]  Jie Wang,et al.  Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium , 2012, Nucleic Acids Res..

[25]  Ronald Rousseau,et al.  Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula , 1989, Inf. Process. Manag..

[26]  Ming Hu,et al.  Bayesian Inference of Spatial Organizations of Chromosomes , 2013, PLoS Comput. Biol..

[27]  Tobias A. Knoch,et al.  The 3D Structure of the Immunoglobulin Heavy-Chain Locus: Implications for Long-Range Genomic Interactions , 2008, Cell.

[28]  William Stafford Noble,et al.  A Three-Dimensional Model of the Yeast Genome , 2010, Nature.

[29]  Wei Pan,et al.  Network‐based genomic discovery: application and comparison of Markov random‐field models , 2010, Journal of the Royal Statistical Society. Series C, Applied statistics.

[30]  Manolis Kellis,et al.  Interpreting non-coding variation in complex disease genetics , 2012, Nature Biotechnology.

[31]  J. Dekker,et al.  Capturing Chromosome Conformation , 2002, Science.

[32]  Cisca Wijmenga,et al.  From genome-wide association studies to disease mechanisms: celiac disease as a model for autoimmune diseases , 2012, Seminars in Immunopathology.

[33]  A. Tanay,et al.  Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome , 2012, Cell.

[34]  L. Mirny,et al.  Iterative Correction of Hi-C Data Reveals Hallmarks of Chromosome Organization , 2012, Nature Methods.

[35]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[36]  B. Ren,et al.  Genome organization and long-range regulation of gene expression by enhancers. , 2013, Current opinion in cell biology.

[37]  J. Lawrence,et al.  The three-dimensional folding of the α-globin gene domain reveals formation of chromatin globules , 2011, Nature Structural &Molecular Biology.

[38]  Marina Vannucci,et al.  Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data , 2011, Bioinform..

[39]  Martin Renqiang Min,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[40]  J. Laurie Snell,et al.  Markov Random Fields and Their Applications , 1980 .

[41]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[42]  Shane J. Neph,et al.  Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[43]  Ben M. Webb,et al.  Putting the Pieces Together: Integrative Modeling Platform Software for Structure Determination of Macromolecular Assemblies , 2012, PLoS biology.

[44]  Raymond K. Auerbach,et al.  Extensive Promoter-Centered Chromatin Interactions Provide a Topological Basis for Transcription Regulation , 2012, Cell.

[45]  J. Sedat,et al.  Spatial partitioning of the regulatory landscape of the X-inactivation centre , 2012, Nature.

[46]  R. D. Hawkins,et al.  Methods for identifying higher-order chromatin structure. , 2012, Annual review of genomics and human genetics.

[47]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[48]  Hyungwon Choi,et al.  A Double-Layered Mixture Model for the Joint Analysis of DNA Copy Number and Gene Expression Data , 2010, J. Comput. Biol..