De novo identification of replication-timing domains in the human genome by deep learning

Abstract Motivation: The de novo identification of the initiation and termination zones—regions that replicate earlier or later than their upstream and downstream neighbours, respectively—remains a key challenge in DNA replication. Results: Building on advances in deep learning, we developed a novel hybrid architecture combining a pre-trained, deep neural network and a hidden Markov model (DNN-HMM) for the de novo identification of replication domains using replication timing profiles. Our results demonstrate that DNN-HMM can significantly outperform strong, discriminatively trained Gaussian mixture model–HMM (GMM-HMM) systems and other six reported methods that can be applied to this challenge. We applied our trained DNN-HMM to identify distinct replication domain types, namely the early replication domain (ERD), the down transition zone (DTZ), the late replication domain (LRD) and the up transition zone (UTZ), using newly replicated DNA sequencing (Repli-Seq) data across 15 human cells. A subsequent integrative analysis revealed that these replication domains harbour unique genomic and epigenetic patterns, transcriptional activity and higher-order chromosomal structure. Our findings support the ‘replication-domain’ model, which states (1) that ERDs and LRDs, connected by UTZs and DTZs, are spatially compartmentalized structural and functional units of higher-order chromosomal structure, (2) that the adjacent DTZ-UTZ pairs form chromatin loops and (3) that intra-interactions within ERDs and LRDs tend to be short-range and long-range, respectively. Our model reveals an important chromatin organizational principle of the human genome and represents a critical step towards understanding the mechanisms regulating replication timing. Availability and implementation: Our DNN-HMM method and three additional algorithms can be freely accessed at https://github.com/wenjiegroup/DNN-HMM. The replication domain regions identified in this study are available in GEO under the accession ID GSE53984. Contact: shuwj@bmi.ac.cn or boxc@bmi.ac.cn Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Penny A Jeggo,et al.  Mutations in ORC1, encoding the largest subunit of the origin recognition complex, cause microcephalic primordial dwarfism resembling Meier-Gorlin syndrome , 2011, Nature Genetics.

[2]  Manolis Kellis,et al.  Discovery and characterization of chromatin states for systematic annotation of the human genome , 2010, Nature Biotechnology.

[3]  William Stafford Noble,et al.  Identification of higher-order functional domains in the human ENCODE regions. , 2007, Genome research.

[4]  S Nicolay,et al.  DNA replication timing data corroborate in silico human replication origin predictions. , 2007, Physical review letters.

[5]  Ronald W. Davis,et al.  Replication dynamics of the yeast genome. , 2001, Science.

[6]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[7]  Haiyan Jiang,et al.  Mutations in origin recognition complex gene ORC4 cause Meier-Gorlin syndrome , 2011, Nature Genetics.

[8]  Hisao Masai,et al.  Eukaryotic chromosome DNA replication: where, when, and how? , 2010, Annual review of biochemistry.

[9]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[10]  William Stafford Noble,et al.  Unsupervised segmentation of continuous genomic data , 2007, Bioinform..

[11]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[12]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Charles Kooperberg,et al.  Genome-wide DNA replication profile for Drosophila melanogaster: a link between transcription and replication timing , 2002, Nature Genetics.

[14]  Zohar Yakhini,et al.  Global organization of replication time zones of the mouse genome. , 2008, Genome research.

[15]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[16]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[17]  R. Scott Hansen,et al.  Cell-type-specific replication initiation programs set fragility of the FRA3B fragile site , 2011, Nature.

[18]  David J Young,et al.  High‐throughput mapping of origins of replication in human cells , 2007, EMBO reports.

[19]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[20]  Yoshua Bengio,et al.  Why Does Unsupervised Pre-training Help Deep Learning? , 2010, AISTATS.

[21]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[22]  R. Sclafani,et al.  Cell cycle regulation of DNA replication. , 2007, Annual review of genetics.

[23]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Takashi Takahashi,et al.  Aberrant DNA replication in cancer. , 2013, Mutation research.

[25]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[26]  S. Dalton,et al.  Evolutionarily conserved replication timing profiles predict long-range chromatin interactions and distinguish closely related cell types. , 2010, Genome research.

[27]  Anindya Dutta,et al.  DNA replication in eukaryotic cells. , 2002, Annual review of biochemistry.

[28]  Edward J Oakeley,et al.  Chromatin state marks cell-type- and gender-specific replication of the Drosophila genome. , 2009, Genes & development.

[29]  Neerja Karnani,et al.  Pan-S replication patterns and chromosomal domains defined by genome-tiling arrays of ENCODE genomic areas. , 2007, Genome research.

[30]  Vladimir B. Bajic,et al.  Comparing the Success of Different Prediction Software in Sequence Analysis: A Review , 2000, Briefings Bioinform..

[31]  Ian Dunham,et al.  Replication Timing of Human Chromosome 6 , 2005, Cell cycle.

[32]  Wen-Hsiung Li,et al.  DNA replication timing and selection shape the landscape of nucleotide variation in cancer genomes , 2012, Nature Communications.

[33]  E. S. Venkatraman,et al.  A faster circular binary segmentation algorithm for the analysis of array CGH data , 2007, Bioinform..

[34]  Nine V.A.M. Knoers,et al.  Mutations in the Pre-Replication Complex cause Meier-Gorlin syndrome , 2011, Nature Genetics.

[35]  David M MacAlpine,et al.  Coordination of replication and transcription along a Drosophila chromosome. , 2004, Genes & development.

[36]  Michael O Dorschner,et al.  Sequencing newly replicated DNA reveals widespread plasticity in human replication timing , 2009, Proceedings of the National Academy of Sciences.

[37]  Yanli Wang,et al.  Topologically associating domains are stable units of replication-timing regulation , 2014, Nature.

[38]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[39]  이상헌,et al.  Deep Belief Networks , 2010, Encyclopedia of Machine Learning.