Accurate prediction of single-cell DNA methylation states using deep learning

Recent technological advances have enabled assaying DNA methylation at single-cell resolution. Current protocols are limited by incomplete CpG coverage and hence methods to predict missing methylation states are critical to enable genome-wide analyses. Here, we report DeepCpG, a computational approach based on deep neural networks to predict DNA methylation states from DNA sequence and incomplete methylation profiles in single cells. We evaluated DeepCpG on single-cell methylation data from five cell types generated using alternative sequencing protocols, finding that DeepCpG yields substantially more accurate predictions than previous methods. Additionally, we show that the parameters of our model can be interpreted, thereby providing insights into the effect of sequence composition on methylation variability.

[1]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[2]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[3]  T. D. Schneider,et al.  Use of the 'Perceptron' algorithm to distinguish translational initiation sites in E. coli. , 1982, Nucleic acids research.

[4]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[5]  R. Treisman,et al.  The SRF accessory protein Elk-1 contains a growth factor-regulated transcriptional activation domain , 1993, Cell.

[6]  A. Mccarthy Development , 1996, Current Opinion in Neurobiology.

[7]  E. Morrisey,et al.  GATA-5: a transcriptional activator expressed in a novel temporally and spatially-restricted pattern during embryonic development. , 1997, Developmental biology.

[8]  R. Mcinnes,et al.  The Tlx-2 homeobox gene is a downstream target of BMP signalling and is required for mouse mesoderm development. , 1998, Development.

[9]  A. Nordheim,et al.  Serum response factor is essential for mesoderm formation during mouse embryogenesis , 1998, The EMBO journal.

[10]  M. Gratacós,et al.  HMG20A and HMG20B map to human chromosomes 15q24 and 19p13.3 and constitute a distinct class of HMG-box genes with ubiquitous expression , 2000, Cytogenetic and Genome Research.

[11]  R. Eisenman,et al.  The Myc/Max/Mad network and the transcriptional control of cell behavior. , 2000, Annual review of cell and developmental biology.

[12]  M. Saraste,et al.  FEBS Lett , 2000 .

[13]  K. Irvine,et al.  Glycosylation regulates Notch signalling , 2003, Nature Reviews Molecular Cell Biology.

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[16]  K. Robertson DNA methylation and human disease , 2005, Nature Reviews Genetics.

[17]  K. Kaestner,et al.  Foxa2 is required for the differentiation of pancreatic α-cells , 2005 .

[18]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[19]  Jeffrey A Whitsett,et al.  Compensatory Roles of Foxa1 and Foxa2 during Lung Morphogenesis* , 2005, Journal of Biological Chemistry.

[20]  Manoj Bhasin,et al.  Prediction of methylated CpGs in DNA sequences using a support vector machine , 2005, FEBS letters.

[21]  Saurabh Sinha,et al.  On counting position weight matrix matches in a sequence, with application to discriminative motif finding , 2006, ISMB.

[22]  Michael Q. Zhang,et al.  Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human Genome , 2007, Cell.

[23]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[24]  Alain de Bruin,et al.  Mouse development with a single E2F activator , 2008, Nature.

[25]  A. Bird,et al.  DNA methylation landscapes: provocative insights from epigenomics , 2008, Nature Reviews Genetics.

[26]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[28]  Martha L. Bulyk,et al.  UniPROBE: an online database of protein binding microarray data on protein–DNA interactions , 2008, Nucleic Acids Res..

[29]  P. Cartron,et al.  Dnmt3/transcription factor interactions as crucial players in targeted DNA methylation , 2009, Epigenetics.

[30]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[31]  Yu-Dong Cai,et al.  Predicting DNA methylation status using word composition , 2010 .

[32]  P. Laird Principles and challenges of genome-wide DNA methylation analysis , 2010, Nature Reviews Genetics.

[33]  Robert S. Illingworth,et al.  CpG islands influence chromatin structure via the CpG-binding protein Cfp1 , 2010, Nature.

[34]  Bradley E. Bernstein,et al.  GC-Rich Sequence Elements Recruit PRC2 in Mammalian ES Cells , 2010, PLoS genetics.

[35]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[36]  D. Trono,et al.  In Embryonic Stem Cells, ZFP57/KAP1 Recognize a Methylated Hexanucleotide to Affect Chromatin and DNA Methylation of Imprinting Control Regions , 2011, Molecular cell.

[37]  Michael R Bardsley,et al.  A functional family-wide screening of SP/KLF proteins identifies a subset of suppressors of KRAS-mediated cell growth. , 2011, The Biochemical journal.

[38]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[39]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[40]  Peter A. Jones Functions of DNA methylation: islands, start sites, gene bodies and beyond , 2012, Nature Reviews Genetics.

[41]  Xuan Zhou,et al.  Prediction of methylation CpGs and their methylation degrees in human DNA sequences , 2012, Comput. Biol. Medicine.

[42]  Olgert Denas,et al.  Deep modeling of gene expression regulation in an Erythropoiesis model , 2013 .

[43]  A. Gnirke,et al.  Charting a dynamic DNA methylation landscape of the human genome , 2013, Nature.

[44]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[45]  M. Araúzo-Bravo,et al.  Disclosing the crosstalk among DNA methylation, transcription factors, and histone marks in human pluripotent cells through discovery of DNA methylation motifs , 2013, Genome research.

[46]  F. Tang,et al.  Single-cell methylome landscapes of mouse embryonic stem cells and early embryos analyzed using reduced representation bisulfite sequencing , 2013, Genome research.

[47]  Jeffrey B. Cheng,et al.  Estimating absolute methylation levels at single-CpG resolution from methylation enrichment and restriction enzyme sequencing methods , 2013, RECOMB.

[48]  Yan-Hua Lai,et al.  The prediction of methylation states in human DNA sequences based on hexanucleotide composition and feature selection , 2014 .

[49]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[50]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[51]  J. Marioni,et al.  Genome-wide Bisulfite Sequencing in Zygotes Identifies Demethylation Targets and Maps the Contribution of TET3 Oxidation , 2014, Cell reports.

[52]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[53]  O. Stegle,et al.  Single-Cell Genome-Wide Bisulfite Sequencing for Assessing Epigenetic Heterogeneity , 2014, Nature Methods.

[54]  Kate B. Cook,et al.  Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity , 2014, Cell.

[55]  Wei Wang,et al.  Predicting the Human Epigenome from DNA Motifs , 2014, Nature Methods.

[56]  Byunghan Lee,et al.  DNA-Level Splice Junction Prediction using Deep Recurrent Neural Networks , 2015, ArXiv.

[57]  Irene M. Kaplow,et al.  A pooling-based approach to mapping genetic variants associated with DNA methylation , 2015, bioRxiv.

[58]  B. Frey,et al.  The human splicing code reveals new insights into the genetic determinants of disease , 2015, Science.

[59]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[60]  Nathan C. Sheffield,et al.  Single-Cell DNA Methylome Sequencing and Bioinformatic Inference of Epigenomic Cell-State Dynamics , 2015, Cell reports.

[61]  K. Chou,et al.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. , 2015, Analytical biochemistry.

[62]  David R. Kelley,et al.  Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015 .

[63]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[64]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.

[65]  T. Spector,et al.  Predicting genome-wide DNA methylation using methylation marks, genomic position, and DNA regulatory elements , 2013, Genome Biology.

[66]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[67]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[68]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[69]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[71]  Lu Wen,et al.  Single-cell triple omics sequencing reveals genetic, epigenetic, and transcriptomic heterogeneity in hepatocellular carcinomas , 2016, Cell Research.

[72]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[73]  C. Ponting,et al.  Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity , 2015, Nature Methods.

[74]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[75]  Zhigang Xue,et al.  Simultaneous profiling of transcriptome and DNA methylome from a single cell , 2016, Genome Biology.

[76]  J. Zhang,et al.  IL-6 mediates differentiation disorder during spermatogenesis in obesity-associated inflammation by affecting the expression of Zfp637 through the SOCS3/STAT3 pathway , 2016, Scientific Reports.

[77]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.