Probabilistic method corrects previously uncharacterized Hi-C artifact

Three-dimensional chromosomal structure plays an important role in gene regulation. Chromosome conformation capture techniques, especially the high-throughput, sequencing-based technique Hi-C, provide new insights on spatial architectures of chromosomes. However, Hi-C data contains artifacts and systemic biases that substantially influence subsequent analysis. Computational models have been developed to address these biases explicitly, however, it is difficult to enumerate and eliminate all the biases in models. Other models are designed to correct biases implicitly, but they will also be invalid in some situations such as copy number variations. We characterize a new kind of artifact in Hi-C data. We find that this artifact is caused by incorrect alignment of Hi-C reads against approximate repeat regions and can lead to erroneous chromatin contact signals. The artifact cannot be corrected by current Hi-C correction methods. We design a probabilistic method and develop a new Hi-C processing pipeline by integrating our probabilistic method with the HiC-Pro pipeline. We find that the new pipeline can remove this new artifact effectively, while preserving important features of the original Hi-C matrices.

[1]  D. Moazed,et al.  Heterochromatin and Epigenetic Control of Gene Expression , 2003, Science.

[2]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[3]  S. Bicciato,et al.  Comparison of computational methods for Hi-C data analysis , 2017, Nature Methods.

[4]  Yan Li,et al.  A high-resolution map of three-dimensional chromatin interactome in human cells , 2013, Nature.

[5]  Jian Zhang,et al.  SEdb: a comprehensive human super-enhancer database , 2018, Nucleic Acids Res..

[6]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[7]  Erez Lieberman Aiden,et al.  Cohesin Loss Eliminates All Loop Domains , 2017, Cell.

[8]  L. Mirny,et al.  Iterative Correction of Hi-C Data Reveals Hallmarks of Chromosome Organization , 2012, Nature Methods.

[9]  Steven L Salzberg,et al.  Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype , 2019, Nature Biotechnology.

[10]  A. Tanay,et al.  Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture , 2011, Nature Genetics.

[11]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[12]  Robert Patro,et al.  Identification of alternative topological domains in chromatin , 2014, Algorithms for Molecular Biology.

[13]  Mitchell Guttman,et al.  RNA and dynamic nuclear organization , 2014, Science.

[14]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[15]  Joel S. Parker,et al.  ReQON: a Bioconductor package for recalibrating quality scores from next-generation sequencing data , 2012, BMC Bioinformatics.

[16]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[17]  James Taylor,et al.  HiFive: a tool suite for easy and efficient HiC and 5C data analysis , 2014, Genome Biology.

[18]  William Stafford Noble,et al.  A Three-Dimensional Model of the Yeast Genome , 2010, Nature.

[19]  V. Corces,et al.  Nuclear organization and genome function. , 2012, Annual review of cell and developmental biology.

[20]  J. Dekker,et al.  Capturing Chromosome Conformation , 2002, Science.

[21]  Neva C. Durand,et al.  Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-C Experiments. , 2016, Cell systems.

[22]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[23]  Huanming Yang,et al.  SNP detection for massively parallel whole-genome resequencing. , 2009, Genome research.

[24]  Ming Hu,et al.  HiCNorm: removing biases in Hi-C data via Poisson regression , 2012, Bioinform..

[25]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[26]  Anupam Chattopadhyay,et al.  Identification and utilization of copy number information for correcting Hi-C contact map of cancer cell lines , 2019, BMC Bioinformatics.

[27]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[28]  Javier Quilez,et al.  OneD: increasing reproducibility of Hi-C samples with abnormal karyotypes , 2017, bioRxiv.

[29]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[30]  Daniel Ruiz,et al.  A Fast Algorithm for Matrix Balancing , 2013, Web Information Retrieval and Linear Algebra Algorithms.

[31]  P. Fraser,et al.  Nuclear organization of the genome and the potential for gene regulation , 2007, Nature.

[32]  B. Steensel,et al.  Nuclear organization of active and inactive chromatin domains uncovered by chromosome conformation capture–on-chip (4C) , 2006, Nature Genetics.

[33]  William Stafford Noble,et al.  HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient , 2017, bioRxiv.

[34]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[35]  Jean-Philippe Vert,et al.  HiC-Pro: an optimized and flexible pipeline for Hi-C data processing , 2015, Genome Biology.

[36]  William Stafford Noble,et al.  Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts , 2014, Genome research.

[37]  Emmanuel Barillot,et al.  Effective normalization for copy number variation in Hi-C data , 2017, BMC Bioinformatics.

[38]  Adam Ameur,et al.  Goodbye reference, hello genome graphs , 2019, Nature Biotechnology.

[39]  Carl Kingsford,et al.  Analysis of the structural variability of topologically associated domains as revealed by Hi-C , 2018, bioRxiv.

[40]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.