CNN-Peaks: ChIP-Seq peak detection pipeline using convolutional neural networks that imitate human visual inspection

ChIP-seq is one of the core experimental resources available to understand genome-wide epigenetic interactions and identify the functional elements associated with diseases. The analysis of ChIP-seq data is important but poses a difficult computational challenge, due to the presence of irregular noise and bias on various levels. Although many peak-calling methods have been developed, the current computational tools still require, in some cases, human manual inspection using data visualization. However, the huge volumes of ChIP-seq data make it almost impossible for human researchers to manually uncover all the peaks. Recently developed convolutional neural networks (CNN), which are capable of achieving human-like classification accuracy, can be applied to this challenging problem. In this study, we design a novel supervised learning approach for identifying ChIP-seq peaks using CNNs, and integrate it into a software pipeline called CNN-Peaks. We use data labeled by human researchers who annotate the presence or absence of peaks in some genomic segments, as training data for our model. The trained model is then applied to predict peaks in previously unseen genomic segments from multiple ChIP-seq datasets including benchmark datasets commonly used for validation of peak calling methods. We observe a performance superior to that of previous methods.

[1]  J. Michael Cherry,et al.  ENCODE data at the ENCODE portal , 2015, Nucleic Acids Res..

[2]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[3]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[4]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[5]  Bao-Gang Hu,et al.  Learning with Average Top-k Loss , 2017, NIPS.

[6]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[7]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[10]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[11]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[12]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[13]  Chen Zeng,et al.  A clustering approach for identification of enriched domains from histone modification ChIP-Seq data , 2009, Bioinform..

[14]  H Nielsen,et al.  Machine learning approaches for the prediction of signal peptides and other protein sorting signals. , 1999, Protein engineering.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Toby Hocking,et al.  Optimizing ChIP-seq peak detectors using visual labels and supervised machine learning , 2016, Bioinform..

[17]  SOURAV GHOSH,et al.  Distinct patterns of epigenetic marks and transcription factor binding sites across promoters of sense-intronic long noncoding RNAs , 2015, Journal of Genetics.

[18]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Martin Werner,et al.  Histone modifiers and marks define heterogeneous groups of colorectal carcinomas and affect responses to HDAC inhibitors in vitro. , 2016, American journal of cancer research.

[20]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[21]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[22]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[23]  E. Greer,et al.  Histone methylation: a dynamic mark in health, disease and inheritance , 2012, Nature Reviews Genetics.

[24]  Finn Drabløs,et al.  A manually curated ChIP-seq benchmark demonstrates room for improvement in current peak-finder programs , 2010, Nucleic acids research.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  T. Furey ChIP – seq and beyond : new and improved methodologies to detect and characterize protein – DNA interactions , 2012 .

[27]  Manolis Kellis,et al.  Deep learning for regulatory genomics , 2015, Nature Biotechnology.

[28]  Manolis Kellis,et al.  Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments , 2013, Nucleic acids research.

[29]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[30]  Noah Spies,et al.  Comprehensive, integrated, and phased whole-genome analysis of the primary ENCODE cell line K562. , 2019, Genome research.

[31]  Rob DeSalle,et al.  Investigating repetitively matching short sequencing reads: The enigmatic nature of H3K9me3 , 2009, Epigenetics.

[32]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[33]  Ryuichiro Nakato,et al.  Recent advances in ChIP-seq analysis: from quality management to whole-genome annotation , 2016, Briefings Bioinform..

[34]  Jens Lichtenberg,et al.  SigSeeker: a peak‐calling ensemble approach for constructing epigenetic signatures , 2017, Bioinform..

[35]  Andrey A. Mironov,et al.  Exploring Massive, Genome Scale Datasets with the GenometriCorr Package , 2012, PLoS Comput. Biol..

[36]  Katherine S. Pollard,et al.  Features that define the best ChIP-seq peak calling algorithms , 2016, Briefings Bioinform..

[37]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[38]  G. Deng,et al.  An adaptive Gaussian filter for noise reduction and edge detection , 1993, 1993 IEEE Conference Record Nuclear Science Symposium and Medical Imaging Conference.

[39]  Anshul Kundaje,et al.  Denoising genome-wide histone ChIP-seq with convolutional neural networks , 2016, bioRxiv.

[40]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer , 2011, Nature Biotechnology.

[41]  Beatrix Ueberheide,et al.  Histone methyltransferases direct different degrees of methylation to define distinct chromatin domains. , 2003, Molecular cell.