Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks

The complex language of eukaryotic gene expression remains incompletely understood. Despite the importance suggested by many noncoding variants statistically associated with human disease, nearly all such variants have unknown mechanisms. Here, we address this challenge using an approach based on a recent machine learning advance-deep convolutional neural networks (CNNs). We introduce the open source package Basset to apply CNNs to learn the functional activity of DNA sequences from genomics data. We trained Basset on a compendium of accessible genomic sites mapped in 164 cell types by DNase-seq, and demonstrate greater predictive accuracy than previous methods. Basset predictions for the change in accessibility between variant alleles were far greater for Genome-wide association study (GWAS) SNPs that are likely to be causal relative to nearby SNPs in linkage disequilibrium with them. With Basset, a researcher can perform a single sequencing assay in their cell type of interest and simultaneously learn that cell's chromatin accessibility code and annotate every mutation in the genome with its influence on present accessibility and latent potential for accessibility. Thus, Basset offers a powerful computational approach to annotate and interpret the noncoding genome.

[1]  Sydney Abbey,et al.  What is A “Method”? , 1991 .

[2]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[3]  A. Bird,et al.  Methylation-Induced Repression— Belts, Braces, and Chromatin , 1999, Cell.

[4]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[5]  H. Bussemaker,et al.  Regulatory element detection using correlation with expression , 2001, Nature Genetics.

[6]  S. Rafii,et al.  Splitting vessels: Keeping lymph apart from blood , 2003, Nature Medicine.

[7]  Michael A. Beer,et al.  Predicting Gene Expression from Sequence , 2004, Cell.

[8]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[9]  Michael Q. Zhang,et al.  Computational prediction of methylation status in human genomic sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[10]  John M Cunningham,et al.  Perturbed desmosomal cadherin expression in grainy head‐like 1‐null mice , 2008, The EMBO journal.

[11]  F. Collins,et al.  Potential etiologic and functional implications of genome-wide association loci for human diseases and traits , 2009, Proceedings of the National Academy of Sciences.

[12]  Jay Shendure,et al.  High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis , 2009, Nature Biotechnology.

[13]  Lorenz Wernisch,et al.  Variable structure motifs for transcription factor binding sites , 2010, BMC Genomics.

[14]  R. Mann,et al.  The role of DNA shape in protein-DNA recognition , 2009, Nature.

[15]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[16]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[17]  Simon C. Potter,et al.  Genetic risk and a primary role for cell-mediated immune mechanisms in multiple sclerosis , 2011, Nature.

[18]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[19]  Myong-Hee Sung,et al.  Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. , 2011, Molecular cell.

[20]  Jo Lambert,et al.  Genome-wide association analyses identify 13 new susceptibility loci for generalized vitiligo , 2012, Nature Genetics.

[21]  Joseph K. Pickrell,et al.  DNaseI sensitivity QTLs are a major determinant of human expression variation , 2011, Nature.

[22]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[23]  Jeffrey W Pollard,et al.  KLF15 negatively regulates estrogen-induced epithelial cell proliferation by inhibition of DNA replication licensing , 2012, Proceedings of the National Academy of Sciences.

[24]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[25]  Joseph B Hiatt,et al.  Massively parallel functional dissection of mammalian enhancers in vivo , 2012, Nature Biotechnology.

[26]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[27]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[28]  Shane J. Neph,et al.  Systematic Localization of Common Disease-Associated Variation in Regulatory DNA , 2012, Science.

[29]  T. Mikkelsen,et al.  Rapid dissection and model-based optimization of inducible enhancers in human cells using a massively parallel reporter assay , 2012, Nature biotechnology.

[30]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[33]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Mikhail Pachkov,et al.  Modeling of epigenome dynamics identifies transcription factors that mediate Polycomb targeting , 2013, Genome research.

[35]  Yan Geng,et al.  p63-expressing cells are the stem cells of developing prostate, bladder, and colorectal epithelia , 2013, Proceedings of the National Academy of Sciences.

[36]  E. Zeggini,et al.  Functional annotation of non-coding sequence variants , 2014, Nature Methods.

[37]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[38]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[39]  Guido Sanguinetti,et al.  Explorer Transcription factor binding predicts histone modifications in human cell lines , 2017 .

[40]  S. Orkin,et al.  Analysis of chromatin-state plasticity identifies cell-type–specific regulators of H3K27me3 patterns , 2014, Proceedings of the National Academy of Sciences.

[41]  Matthew Slattery,et al.  Absence of a simple code: how transcription factors read the genome. , 2014, Trends in biochemical sciences.

[42]  E. Segal,et al.  In pursuit of design principles of regulatory sequences , 2014, Nature Reviews Genetics.

[43]  Hang Li,et al.  Convolutional Neural Network Architectures for Matching Natural Language Sentences , 2014, NIPS.

[44]  Tatsunori B. Hashimoto,et al.  Discovery of non-directional and directional pioneer transcription factors by modeling DNase profile magnitude and shape , 2014, Nature Biotechnology.

[45]  Ty C. Voss,et al.  Dynamic regulation of transcriptional states by chromatin and transcription factors , 2013, Nature Reviews Genetics.

[46]  J. Shendure,et al.  A general framework for estimating the relative pathogenicity of human genetic variants , 2014, Nature Genetics.

[47]  Jun S. Liu,et al.  Genetics of rheumatoid arthritis contributes to biology and drug discovery , 2013 .

[48]  Kate B. Cook,et al.  Determination and Inference of Eukaryotic Transcription Factor Sequence Specificity , 2014, Cell.

[49]  Kevin Y. Yip,et al.  FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer , 2014, Genome Biology.

[50]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[51]  Wei Wang,et al.  Predicting the Human Epigenome from DNA Motifs , 2014, Nature Methods.

[52]  A. Visel,et al.  Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions , 2015, Cell.

[53]  R. Rohs,et al.  A widespread role of the motif environment in transcription factor binding across diverse protein families , 2015, Genome research.

[54]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[55]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[56]  Christina S. Leslie,et al.  SeqGL Identifies Context-Dependent Binding Signals in Genome-Wide Regulatory Element Maps , 2015, PLoS Comput. Biol..

[57]  Michael Q. Zhang,et al.  CRISPR Inversion of CTCF Sites Alters Genome Topology and Enhancer/Promoter Function , 2015, Cell.

[58]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[59]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  M. Daly,et al.  Genetic and Epigenetic Fine-Mapping of Causal Autoimmune Disease Variants , 2014, Nature.

[61]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[62]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[63]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[64]  B. L,et al.  The accessible chromatin landscape of the human genome , 2016 .