Variational Autoencoders for Cancer Data Integration: Design Principles and Computational Practice

International initiatives such as the Molecular Taxonomy of Breast Cancer International Consortium are collecting multiple data sets at different genome-scales with the aim to identify novel cancer bio-markers and predict patient survival. To analyze such data, several machine learning, bioinformatics, and statistical methods have been applied, among them neural networks such as autoencoders. Although these models provide a good statistical learning framework to analyze multi-omic and/or clinical data, there is a distinct lack of work on how to integrate diverse patient data and identify the optimal design best suited to the available data.In this paper, we investigate several autoencoder architectures that integrate a variety of cancer patient data types (e.g., multi-omics and clinical data). We perform extensive analyses of these approaches and provide a clear methodological and computational framework for designing systems that enable clinicians to investigate cancer traits and translate the results into clinical applications. We demonstrate how these networks can be designed, built, and, in particular, applied to tasks of integrative analyses of heterogeneous breast cancer data. The results show that these approaches yield relevant data representations that, in turn, lead to accurate and stable diagnosis.

[1]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[2]  Roger B. Grosse,et al.  Isolating Sources of Disentanglement in Variational Autoencoders , 2018, NeurIPS.

[3]  Marinka Zitnik,et al.  Data Fusion by Matrix Factorization , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Max Welling,et al.  VAE with a VampPrior , 2017, AISTATS.

[5]  G. Parmigiani,et al.  Integrated analysis of homozygous deletions, focal amplifications, and sequence alterations in breast and colorectal cancers , 2008, Proceedings of the National Academy of Sciences.

[6]  U. V. Luxburg,et al.  Improving Variational Autoencoders with Inverse Autoregressive Flow , 2016 .

[7]  Lana X. Garmire,et al.  More Is Better: Recent Progress in Multi-Omics Data Integration Methods , 2017, Front. Genet..

[8]  Li Li,et al.  Deep Patient: An Unsupervised Representation to Predict the Future of Patients from the Electronic Health Records , 2016, Scientific Reports.

[9]  Bo Wang,et al.  Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities , 2018, Inf. Fusion.

[10]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[11]  Mikhail Belkin,et al.  Laplacian Eigenmaps for Dimensionality Reduction and Data Representation , 2003, Neural Computation.

[12]  Jan Ellenberg,et al.  Integrating Imaging and Omics: Computational Methods and Challenges , 2019, Annual Review of Biomedical Data Science.

[13]  A. Frigessi,et al.  Principles and methods of integrative genomic analyses in cancer , 2014, Nature Reviews Cancer.

[14]  Ryan P. Adams,et al.  Structured VAEs: Composing Probabilistic Graphical Models and Variational Autoencoders , 2016 .

[15]  Albert Y. Zomaya,et al.  A Review of Ensemble Methods in Bioinformatics , 2010, Current Bioinformatics.

[16]  Núria Malats,et al.  Challenges in the Integration of Omics and Non-Omics Data , 2019, Genes.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  F. Markowetz,et al.  The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups , 2012, Nature.

[21]  Jason Weston,et al.  Learning Gene Functional Classifications from Multiple Data Types , 2002, J. Comput. Biol..

[22]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[23]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[24]  Tieliu Shi,et al.  Deep Learning-Based Multi-Omics Data Integration Reveals Two Prognostic Subtypes in High-Risk Neuroblastoma , 2018, Front. Genet..

[25]  Adam B. Olshen,et al.  Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis , 2009, Bioinform..

[26]  Johan A. K. Suykens,et al.  Predicting breast cancer using an expression values weighted clinical classifier , 2014, BMC Bioinformatics.

[27]  Brendan J. Frey,et al.  k-Sparse Autoencoders , 2013, ICLR.

[28]  Murray Shanahan,et al.  Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders , 2016, ArXiv.

[29]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[30]  David Gomez-Cabrero,et al.  Data integration in the era of omics: current and future challenges , 2014, BMC Systems Biology.

[31]  Florian Markowetz,et al.  Patient-Specific Data Fusion Defines Prognostic Cancer Subtypes , 2011, PLoS Comput. Biol..

[32]  M. Field,et al.  The role of deep learning and radiomic feature extraction in cancer-specific predictive modelling: a review , 2018, Translational Cancer Research.

[33]  Casey S. Greene,et al.  Unsupervised Feature Construction and Knowledge Extraction from Genome-Wide Assays of Breast Cancer with Denoising Autoencoders , 2014, Pacific Symposium on Biocomputing.

[34]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[35]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[36]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[37]  Carly A. Bobak,et al.  Unsupervised deep learning with variational autoencoders applied to breast tumor genome-wide DNA methylation data with biologic feature extraction , 2018, bioRxiv.

[38]  Kumardeep Chaudhary,et al.  Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer , 2017, Clinical Cancer Research.

[39]  Yi Li,et al.  Gene Expression Profile Alone Is Inadequate In Predicting Complete Response In Multiple Myeloma , 2014, Leukemia.

[40]  Konrad J. Karczewski,et al.  Integrative omics for health and disease , 2018, Nature Reviews Genetics.

[41]  Andre Esteva,et al.  A guide to deep learning in healthcare , 2019, Nature Medicine.

[42]  Jason I. Herschkowitz,et al.  Phenotypic and molecular characterization of the claudin-low intrinsic subtype of breast cancer , 2010, Breast Cancer Research.

[43]  Kumardeep Chaudhary,et al.  Deep Learning data integration for better risk stratification models of bladder cancer , 2018, AMIA Joint Summits on Translational Science proceedings. AMIA Joint Summits on Translational Science.

[44]  Pieter Abbeel,et al.  Variational Lossy Autoencoder , 2016, ICLR.

[45]  G. Corrado,et al.  End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography , 2019, Nature Medicine.

[46]  David Brown,et al.  Pharmacodynamic Modeling of Anti-Cancer Activity of Tetraiodothyroacetic Acid in a Perfused Cell Culture System , 2011, PLoS Comput. Biol..

[47]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[48]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[49]  N. Rosenfeld,et al.  The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes , 2016, Nature Communications.

[50]  Robin Coope,et al.  Rise of the Machines: Advances in Deep Learning for Cancer Diagnosis. , 2019, Trends in cancer.

[51]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[52]  P. Deloukas,et al.  Signatures of mutation and selection in the cancer genome , 2010, Nature.

[53]  Bart De Moor,et al.  Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks , 2006, ISMB.

[54]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[55]  Bin Liu,et al.  Dynamics of breast cancer relapse reveal late recurring ER-positive genomic subgroups , 2019, Nature.

[56]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[57]  Marco Cote STICK-BREAKING VARIATIONAL AUTOENCODERS , 2017 .

[58]  Casey S. Greene,et al.  Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders , 2017, bioRxiv.