Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks

The complex language of eukaryotic gene expression remains incompletely understood. Thus, most of the many noncoding variants statistically associated with human disease have unknown mechanism. Here, we address this challenge using an approach based on a recent machine learning advance—deep convolutional neural networks (CNNs). We introduce an open source package Basset (https://github.com/davek44/Basset) to apply deep CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNaseI-seq. Basset predictions for the change in accessibility between two variant alleles were far greater for GWAS SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell’s chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.

[1]  Yan Geng,et al.  p63-expressing cells are the stem cells of developing prostate, bladder, and colorectal epithelia , 2013, Proceedings of the National Academy of Sciences.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Joseph K. Pickrell,et al.  DNaseI sensitivity QTLs are a major determinant of human expression variation , 2011, Nature.

[4]  A. Bird,et al.  Methylation-Induced Repression— Belts, Braces, and Chromatin , 1999, Cell.

[5]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[6]  R. Mann,et al.  The role of DNA shape in protein-DNA recognition , 2009, Nature.

[7]  Jay Shendure,et al.  High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis , 2009, Nature Biotechnology.

[8]  Michael Q. Zhang,et al.  Computational prediction of methylation status in human genomic sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[9]  S. Rafii,et al.  Splitting vessels: Keeping lymph apart from blood , 2003, Nature Medicine.

[10]  Jeffrey W Pollard,et al.  KLF15 negatively regulates estrogen-induced epithelial cell proliferation by inhibition of DNA replication licensing , 2012, Proceedings of the National Academy of Sciences.

[11]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[12]  Wei Wang,et al.  Predicting the Human Epigenome from DNA Motifs , 2014, Nature Methods.

[13]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[14]  M. Daly,et al.  Genetic and Epigenetic Fine-Mapping of Causal Autoimmune Disease Variants , 2014, Nature.

[15]  Joseph B Hiatt,et al.  Massively parallel functional dissection of mammalian enhancers in vivo , 2012, Nature Biotechnology.

[16]  Simon C. Potter,et al.  Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis , 2011, Nature.

[17]  S. Orkin,et al.  Analysis of chromatin-state plasticity identifies cell-type–specific regulators of H3K27me3 patterns , 2014, Proceedings of the National Academy of Sciences.

[18]  Matthew Slattery,et al.  Absence of a simple code: how transcription factors read the genome. , 2014, Trends in biochemical sciences.

[19]  Jun S. Liu,et al.  Genetics of rheumatoid arthritis contributes to biology and drug discovery , 2013 .

[20]  Kate B. Cook,et al.  Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity , 2014, Cell.

[21]  Guido Sanguinetti,et al.  Explorer Transcription factor binding predicts histone modifications in human cell lines , 2017 .

[22]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[23]  D. Pe’er,et al.  Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data , 2003, Nature Genetics.

[24]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[25]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[26]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[27]  Kevin Y. Yip,et al.  FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer , 2014, Genome Biology.

[28]  Myong-Hee Sung,et al.  Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. , 2011, Molecular cell.

[29]  Ty C. Voss,et al.  Dynamic regulation of transcriptional states by chromatin and transcription factors , 2013, Nature Reviews Genetics.

[30]  Tatsunori B. Hashimoto,et al.  Discovery of non-directional and directional pioneer transcription factors by modeling DNase profile magnitude and shape , 2014, Nature Biotechnology.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  E. Segal,et al.  In pursuit of design principles of regulatory sequences , 2014, Nature Reviews Genetics.

[33]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[34]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[35]  Anne de Jong,et al.  Adaptation of Hansenula polymorpha to methanol: a transcriptome analysis , 2010, BMC Genomics.

[36]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[37]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[38]  Mikhail Pachkov,et al.  Modeling of epigenome dynamics identifies transcription factors that mediate Polycomb targeting , 2013, Genome research.

[39]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[40]  A. Visel,et al.  Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions , 2015, Cell.

[41]  Michael Q. Zhang,et al.  CRISPR Inversion of CTCF Sites Alters Genome Topology and Enhancer/Promoter Function , 2015, Cell.

[42]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[43]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  R. Rohs,et al.  A widespread role of the motif environment in transcription factor binding across diverse protein families , 2015, Genome research.

[45]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[46]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[47]  Shane J. Neph,et al.  Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[48]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[49]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[50]  B. L,et al.  The accessible chromatin landscape of the human genome , 2016 .

[51]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[52]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[53]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[54]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[55]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[56]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[57]  Jo Lambert,et al.  Genome-wide association analyses identify 13 new susceptibility loci for generalized vitiligo , 2012, Nature Genetics.

[58]  Christina S. Leslie,et al.  SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps , 2015, PLoS Comput. Biol..

[59]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[60]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[61]  John M Cunningham,et al.  Perturbed desmosomal cadherin expression in grainy head‐like 1‐null mice , 2008, The EMBO journal.