Multi-scale deep tensor factorization learns a latent representation of the human epigenome

The human epigenome has been experimentally characterized by measurements of protein binding, chromatin acessibility, methylation, and histone modification in hundreds of cell types. The result is a huge compendium of data, consisting of thousands of measurements for every basepair in the human genome. These data are difficult to make sense of, not only for humans, but also for computational methods that aim to detect genes and other functional elements, predict gene expression, characterize polymorphisms, etc. To address this challenge, we propose a deep neural network tensor factorization method, Avocado, that compresses epigenomic data into a dense, information-rich representation of the human genome. We use data from the Roadmap Epigenomics Consortium to demonstrate that this learned representation of the genome is broadly useful: first, by imputing epigenomic data more accurately than previous methods, and second, by showing that machine learning models that exploit this representation outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, and elements of 3D chromatin architecture. Our findings suggest the broad utility of Avocado’s learned latent representation for computational genomics and epigenomics.

[1]  Michael A. Beer,et al.  Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy , 2018, bioRxiv.

[2]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[3]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[4]  Marc-Thorsten Hütt,et al.  Toward a theory of coactivation patterns in excitable neural networks , 2018, PLoS Comput. Biol..

[5]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[6]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[7]  William Stafford Noble,et al.  PREDICTD PaRallel Epigenomics Data Imputation with Cloud-based Tensor Decomposition , 2018, Nature Communications.

[8]  Yanjun Qi,et al.  DeepChrome: deep-learning for predicting gene expression from histone modifications , 2016, Bioinform..

[9]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[12]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[13]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[14]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[15]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[16]  Arshdeep Sekhon,et al.  Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin , 2017, bioRxiv.

[17]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[18]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[19]  George Trigeorgis,et al.  A Deep Matrix Factorization Method for Learning Attribute Representations , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  William Stafford Noble,et al.  Integrative annotation of chromatin elements from ENCODE data , 2012, Nucleic acids research.

[21]  K. Pollard,et al.  Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin , 2016, Nature Genetics.

[22]  Geir Kjetil Sandve,et al.  In the loop: promoter–enhancer interactions and bioinformatics , 2015, Briefings Bioinform..

[23]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[24]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[25]  Scott Lundberg,et al.  An unexpected unity among methods for interpreting model predictions , 2016, ArXiv.

[26]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.

[27]  T. Meehan,et al.  An atlas of active enhancers across human cell types and tissues , 2014, Nature.

[28]  Timothy J. Durham,et al.  "Systematic" , 1966, Comput. J..

[29]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[30]  John Salvatier,et al.  Theano: A Python framework for fast computation of mathematical expressions , 2016, ArXiv.

[31]  Daniel L. Vera,et al.  Genome-wide analysis of replication timing by next-generation sequencing with E/L Repli-seq , 2018, Nature Protocols.

[32]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[33]  Nathaniel D. Heintzman,et al.  Histone modifications at human enhancers reflect global cell-type-specific gene expression , 2009, Nature.

[34]  Anthony D. Schmitt,et al.  A Compendium of Chromatin Contact Maps Reveals Spatially Active Regions in the Human Genome. , 2016, Cell reports.

[35]  S. Dalton,et al.  Evolutionarily conserved replication timing profiles predict long-range chromatin interactions and distinguish closely related cell types. , 2010, Genome research.

[36]  Jicong Fan,et al.  Matrix completion by deep matrix factorization , 2018, Neural Networks.

[37]  Hendrik Blockeel,et al.  Demystifying Relational Latent Representations , 2017, ILP.

[38]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[39]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  William Stafford Noble,et al.  Topologically associating domains and their long-range contacts are established during early G1 coincident with the establishment of the replication-timing program , 2015, Genome research.

[42]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[43]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[44]  Maksims Volkovs,et al.  Content-based Neighbor Models for Cold Start in Recommender Systems , 2017, RecSys 2017.

[45]  Vlad Sandulescu,et al.  Predicting the future relevance of research institutions - The winning solution of the KDD Cup 2016 , 2016, ArXiv.