GAN-based data augmentation for transcriptomics: survey and comparative assessment

Abstract Motivation Transcriptomics data are becoming more accessible due to high-throughput and less costly sequencing methods. However, data scarcity prevents exploiting deep learning models’ full predictive power for phenotypes prediction. Artificially enhancing the training sets, namely data augmentation, is suggested as a regularization strategy. Data augmentation corresponds to label-invariant transformations of the training set (e.g. geometric transformations on images and syntax parsing on text data). Such transformations are, unfortunately, unknown in the transcriptomic field. Therefore, deep generative models such as generative adversarial networks (GANs) have been proposed to generate additional samples. In this article, we analyze GAN-based data augmentation strategies with respect to performance indicators and the classification of cancer phenotypes. Results This work highlights a significant boost in binary and multiclass classification performances due to augmentation strategies. Without augmentation, training a classifier on only 50 RNA-seq samples yields an accuracy of, respectively, 94% and 70% for binary and tissue classification. In comparison, we achieved 98% and 94% of accuracy when adding 1000 augmented samples. Richer architectures and more expensive training of the GAN return better augmentation performances and generated data quality overall. Further analysis of the generated data shows that several performance indicators are needed to assess its quality correctly. Availability and implementation All data used for this research are publicly available and comes from The Cancer Genome Atlas. Reproducible code is available on the GitLab repository: https://forge.ibisc.univ-evry.fr/alacan/GANs-for-transcriptomics

[1]  Supratim Das,et al.  Offspring GAN augments biased human genomic data , 2022, BCB.

[2]  Oriol Vinyals,et al.  General-purpose, long-context autoregressive modeling with Perceiver AR , 2022, ICML.

[3]  A. Blum,et al.  A Theory of PAC Learnability under Transformation Invariances , 2022, NeurIPS.

[4]  Eduard Hovy,et al.  A Survey of Data Augmentation Approaches for NLP , 2021, FINDINGS.

[5]  Corentin Tallec,et al.  Creating artificial human genomes using generative neural networks , 2021, PLoS genetics.

[6]  Shu-juan Xie,et al.  RNA sequencing: new technologies and applications in cancer research , 2020, Journal of Hematology & Oncology.

[7]  Micah Goldblum,et al.  Data Augmentation for Meta-Learning , 2020, ICML.

[8]  Junjie Chen,et al.  Population-scale Genomic Data Augmentation Based on Conditional Generative Adversarial Networks , 2020, BCB.

[9]  S. Makin The RNA and protein landscape that could bring precision medicine to more people , 2020, Nature.

[10]  Qiao Liu,et al.  Simultaneous deep generative modelling and clustering of single-cell genomic data , 2020, Nature Machine Intelligence.

[11]  Sunkyu Kim,et al.  Improved survival analysis by learning shared genomic information from pan-cancer data , 2020, Bioinform..

[12]  M. Cheon,et al.  A practical application of generative adversarial networks for RNA-seq analysis to predict the molecular progress of Alzheimer's disease , 2020, PLoS Comput. Biol..

[13]  Issam H. Laradji,et al.  Learning Data Augmentation with Online Bilevel Optimization for Image Classification , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Lefteris Koumakis,et al.  Deep learning models in genomics; are we there yet? , 2020, Computational and structural biotechnology journal.

[15]  Leonardo Neves,et al.  Data Augmentation for Graph Neural Networks , 2020, AAAI.

[16]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[17]  TaeChoong Chung,et al.  SaliencyMix: A Saliency Guided Data Augmentation Strategy for Better Regularization , 2020, ICLR.

[18]  Isabelle Guyon,et al.  Generation and evaluation of privacy preserving synthetic health data , 2020, Neurocomputing.

[19]  Xiaomin Song,et al.  Time Series Data Augmentation for Deep Learning: A Survey , 2020, IJCAI.

[20]  Pietro Liò,et al.  Adversarial generation of gene expression data , 2019, bioRxiv.

[21]  Takuya Akiba,et al.  Optuna: A Next-generation Hyperparameter Optimization Framework , 2019, KDD.

[22]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[23]  Quoc V. Le,et al.  AutoAugment: Learning Augmentation Strategies From Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Jaakko Lehtinen,et al.  Improved Precision and Recall Metric for Assessing Generative Models , 2019, NeurIPS.

[26]  Fabian J Theis,et al.  Single-cell RNA-seq denoising using a deep count autoencoder , 2019, Nature Communications.

[27]  Michael I. Jordan,et al.  Deep Generative Modeling for Single-cell Transcriptomics , 2018, Nature Methods.

[28]  Jin Gu,et al.  VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder , 2018, Genom. Proteom. Bioinform..

[29]  Pierre Machart,et al.  Realistic in silico generation and augmentation of single cell RNA-seq data using Generative Adversarial Neural Networks , 2018, bioRxiv.

[30]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[31]  Casper Kaae Sønderby,et al.  scVAE: Variational auto-encoders for single-cell gene expression data , 2018, bioRxiv.

[32]  Tri Dao,et al.  A Kernel Theory of Modern Data Augmentation , 2018, ICML.

[33]  Uri Shaham,et al.  DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network , 2016, BMC Medical Research Methodology.

[34]  Nicholas M. Luscombe,et al.  Generative adversarial networks simulate gene expression and predict perturbations in single cells , 2018, bioRxiv.

[35]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[36]  Casey S. Greene,et al.  Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders , 2017, bioRxiv.

[37]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[38]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[39]  Xueyang Feng,et al.  DeepMetabolism: A Deep Learning System to Predict Phenotype from Genome Sequencing , 2017, bioRxiv.

[40]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[41]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[42]  G. Rajagopal,et al.  The path from big data to precision medicine , 2016 .

[43]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[44]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[45]  Dimitrios I. Fotiadis,et al.  Machine learning applications in cancer prognosis and prediction , 2014, Computational and structural biotechnology journal.

[46]  Simon Osindero,et al.  Conditional Generative Adversarial Nets , 2014, ArXiv.

[47]  Aaron C. Courville,et al.  Generative adversarial networks , 2014, Commun. ACM.

[48]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[49]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[50]  Peter Norvig,et al.  The Unreasonable Effectiveness of Data , 2009, IEEE Intelligent Systems.

[51]  Yves Grandvalet,et al.  Noise Injection: Theoretical Prospects , 1997, Neural Computation.

[52]  Ethan Dyer,et al.  Tradeoffs in Data Augmentation: An Empirical Study , 2021, ICLR.

[53]  Blaise Hanczar,et al.  Phenotypes Prediction from Gene Expression Data with Deep Multilayer Perceptron and Unsupervised Pre-training , 2018 .

[54]  Reza Ghaeini,et al.  A Deep Learning Approach for Cancer Detection and Relevant Gene Identification , 2017, PSB.

[55]  Joshua F. McMichael,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013 .

[56]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.