E2M: A Deep Learning Framework for Associating Combinatorial Methylation Patterns with Gene Expression

Motivation We focus on the new problem of determining which methylation patterns in gene promoters strongly associate with gene expression in cancer cells of different types. Although a number of results regarding the influence of methylation on expression data have been reported in the literature, our approach is unique in so far that it retrospectively predicts the combinations of methylated sites in promoter regions of genes that are reflected in the expression data. Reversing the traditional prediction order in many cases makes estimation of the model parameters easier, as real-valued data are used to predict categorical data, rather than vice-versa; in addition, our approach allows one to better assess the overall influence of methylation in modulating expression via state-of-the-art learning methods. For this purpose, we developed a novel neural network learning framework termed E2M (Expression-to-Methylation) to predict the status of different methylation sites in promoter regions of several bio-marker genes based on a sufficient statistics of the whole gene expression captured through Landmark genes. We ran our experiments on unquantized and quantized expression sets and neural network weights to illustrate the robustness of the method and reduce the storage footprint of the processing pipeline. Results We implemented a number of machine learning algorithms to address the new problem of methylation pattern inference, including multiclass regression, canonical correlation analysis (CCA), naive fully connected neural network and inception neural networks. Inception neural networks such as E2M learners outperform all other techniques and offer an average prediction accuracy of 82% when tested on 3,671 pan-cancer samples including low grade glioma, glioblastoma, lung adenocarcinoma, lung squamus cell carcinoma, and stomach adenocarcinoma. As an illustrative example, one can increase the prediction accuracy for the methylation pattern in the promoter of gene GATA6 in glioblastoma samples by 20% when using inception rather than simple fully connected neural networks. These performance guarantees remain largely unchanged even when both expression values and network weights are quantized. Our work also provides new insight about the importance of specific methylation site patterns on expression variations for different genes. In this context, we identified genes for which the overwhelming majority of patients exhibit one methylation pattern, and other genes with three or more significant classes of methylation patterns. Inception networks identify such patterns with high accuracy and suggest possible stratification of cancers based on methylation pattern profiles. Availability The E2M code and datasets are freely available at https://github.com/jianhao2016/E2M Contact idoia@illinois.edu, milenkov@illinois.edu

[1]  Christian A. Rees,et al.  Systematic variation in gene expression patterns in human cancer cell lines , 2000, Nature Genetics.

[2]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[3]  N. Perrimon,et al.  Targeted gene expression as a means of altering cell fates and generating dominant phenotypes. , 1993, Development.

[4]  Carl-Fredrik Tiger,et al.  Identification of candidate cancer-causing genes in mouse brain tumors by retroviral tagging. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Yi Li,et al.  Gene expression inference with deep learning , 2015, bioRxiv.

[6]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[7]  A. Bird CpG-rich islands and the function of DNA methylation , 1986, Nature.

[8]  Z. Yakhini,et al.  Predicting Methylation from Sequence and Gene Expression Using Deep Learning with Attention , 2018, bioRxiv.

[9]  M. Wolter,et al.  A hypoxic niche regulates glioblastoma stem cells through hypoxia inducible factor 2 alpha. , 2010, Brain : a journal of neurology.

[10]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[11]  T. Seyfried,et al.  Targeting energy metabolism in brain cancer: review and hypothesis , 2005, Nutrition & metabolism.

[12]  Yuan Ji,et al.  Identification of thresholds for dichotomizing DNA methylation data , 2013, EURASIP J. Bioinform. Syst. Biol..

[13]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[14]  A. Regev,et al.  An embryonic stem cell–like gene expression signature in poorly differentiated aggressive human tumors , 2008, Nature Genetics.

[15]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[17]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[18]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[19]  A. Bird,et al.  Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals , 2003, Nature Genetics.

[20]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Olivier Langlois,et al.  Integrated multi-omics analysis of oligodendroglial tumours identifies three subgroups of 1p/19q co-deleted gliomas , 2016, Nature Communications.

[23]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[24]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[25]  Osonde Osoba,et al.  Noise benefits in backpropagation and deep bidirectional pre-training , 2013, The 2013 International Joint Conference on Neural Networks (IJCNN).

[26]  Ali Farhadi,et al.  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[27]  Andrew I. Su,et al.  Omics Pipe: a community-based framework for reproducible multi-omics data analysis , 2015, Bioinform..

[28]  T. Seyfried,et al.  Role of glucose and ketone bodies in the metabolic control of experimental brain cancer , 2003, British Journal of Cancer.

[29]  Yanjun Qi,et al.  DeepChrome: deep-learning for predicting gene expression from histone modifications , 2016, Bioinform..

[30]  J. Herman,et al.  Methylation-specific PCR: a novel PCR assay for methylation status of CpG islands. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[31]  Nathan D. VanderKraats,et al.  Modeling complex patterns of differential DNA methylation that associate with gene expression changes , 2017, Nucleic acids research.

[32]  D. Brutlag,et al.  A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters , 2006, Proceedings of the National Academy of Sciences of the United States of America.