Gene Expression Imputation with Generative Adversarial Imputation Nets

A question of fundamental biological significance is to what extent the expression of a subset of genes can be used to recover the full transcriptome, with important implications for biological discovery and clinical application. To address this challenge, we present GAIN-GTEx, a method for gene expression imputation based on Generative Adversarial Imputation Networks. In order to increase the applicability of our approach, we leverage data from GTEx v8, a reference resource that has generated a comprehensive collection of transcriptomes from a diverse set of human tissues. We compare our model to several standard and state-of-the-art imputation methods and show that GAIN-GTEx is significantly superior in terms of predictive performance and runtime. Furthermore, our results indicate strong generalisation on RNA-Seq data from 3 cancer types across varying levels of missingness. Our work can facilitate a cost-effective integration of large-scale RNA biorepositories into genomic studies of disease, with high applicability across diverse tissue types.

[1]  Nicola J. Rinaldi,et al.  Genetic effects on gene expression across human tissues , 2017, Nature.

[2]  Benjamin J. Raphael,et al.  Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. , 2013, The New England journal of medicine.

[3]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[4]  S. Davies,et al.  Artificial Intelligence in Global Health , 2019, Ethics & International Affairs.

[5]  Russ B. Altman,et al.  Missing value estimation methods for DNA microarrays , 2001, Bioinform..

[6]  L. Liang,et al.  Mapping complex disease traits with global gene expression , 2009, Nature Reviews Genetics.

[7]  Luis Mateus Rocha,et al.  Singular value decomposition and principal component analysis , 2003 .

[8]  Mihaela van der Schaar,et al.  GAIN: Missing Data Imputation using Generative Adversarial Nets , 2018, ICML.

[9]  Aleix Prat Aparicio Comprehensive molecular portraits of human breast tumours , 2012 .

[10]  Chun-Chi Liu,et al.  Bayesian approach to transforming public gene expression repositories into disease diagnosis databases , 2010, Proceedings of the National Academy of Sciences.

[11]  Ayellet V. Segrè,et al.  Using an atlas of gene regulation across 44 human tissues to inform complex disease- and trait-associated variation , 2018, Nature Genetics.

[12]  Peter Bühlmann,et al.  MissForest - non-parametric missing value imputation for mixed-type data , 2011, Bioinform..

[13]  E. Kieff,et al.  Epstein-Barr Virus-Induced Changes in B-Lymphocyte Gene Expression , 2002, Journal of Virology.

[14]  H. Stefánsson,et al.  Genetics of gene expression and its effect on disease , 2008, Nature.

[15]  E. Gamazon,et al.  Inferred divergent gene regulation in archaic hominins reveals potential phenotypic differences , 2019, Nature Ecology & Evolution.

[16]  D. Rubin,et al.  Statistical Analysis with Missing Data. , 1989 .

[17]  Alexander A. Morgan,et al.  Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data , 2011, Science Translational Medicine.

[18]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[19]  Kaanan P. Shah,et al.  A gene-based association method for mapping traits using reference transcriptome data , 2015, Nature Genetics.

[20]  R. Durbin,et al.  Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses , 2012, Nature Protocols.

[21]  Stephanie A. Bien,et al.  Genetic analyses of diverse populations improves discovery for complex traits , 2019, Nature.

[22]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[23]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[24]  N. Shah,et al.  Implementing Machine Learning in Health Care - Addressing Ethical Challenges. , 2018, The New England journal of medicine.

[25]  A. Frigessi,et al.  Covariate Selection in High-Dimensional Generalized Linear Models With Measurement Error , 2014, Journal of Computational and Graphical Statistics.

[26]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[27]  M. King,et al.  Evolution at two levels in humans and chimpanzees. , 1975, Science.

[28]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection , 2018, J. Open Source Softw..

[29]  M. Relling,et al.  Moving towards individualized medicine with pharmacogenomics , 2004, Nature.

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  Steven J. M. Jones,et al.  Comprehensive molecular profiling of lung adenocarcinoma , 2014, Nature.

[32]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[33]  Stef van Buuren,et al.  MICE: Multivariate Imputation by Chained Equations in R , 2011 .

[34]  Steven J. M. Jones,et al.  Comprehensive molecular portraits of human breast tumors , 2012, Nature.

[35]  S. Vilar,et al.  High-Throughput Methods for Combinatorial Drug Discovery , 2013, Science Translational Medicine.

[36]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[37]  Christopher D. Brown,et al.  The GTEx Consortium atlas of genetic regulatory effects across human tissues , 2019, Science.

[38]  Stef van Buuren,et al.  Multivariate Imputation by Chained Equations , 2015 .

[39]  Roderick J. A. Little,et al.  Statistical Analysis with Missing Data , 1988 .

[40]  Peter A. Jones,et al.  A decade of exploring the cancer epigenome — biological and translational implications , 2011, Nature Reviews Cancer.