Creating Artificial Human Genomes Using Generative Models

Generative models have shown breakthroughs in a wide spectrum of domains due to recent advancements in machine learning algorithms and increased computational power. Despite these impressive achievements, the ability of generative models to create realistic synthetic data is still under-exploited in genetics and absent from population genetics. Yet a known limitation of this field is the reduced access to many genetic databases due to concerns about violations of individual privacy, although they would provide a rich resource for data mining and integration towards advancing genetic studies. In this study, we demonstrated that deep generative adversarial networks (GANs) and restricted Boltzmann machines (RBMs) can be trained to learn the high dimensional distributions of real genomic datasets and create high quality artificial genomes (AGs) with none to little privacy loss. To illustrate the promising outcomes of our method, we showed that (i) imputation quality for low frequency alleles can be improved by augmenting reference panels with AGs, (ii) scores obtained from selection tests on AGs and real genomes are highly correlated and (iii) AGs can inherit genotype-phenotype associations. AGs have the potential to become valuable assets in genetic studies by providing high quality anonymous substitutes for private databases.

[1]  Paul Smolensky,et al.  Information processing in dynamical systems: foundations of harmony theory , 1986 .

[2]  Yee Whye Teh,et al.  Rate-coded Restricted Boltzmann Machines for Face Recognition , 2000, NIPS.

[3]  Leo P. Kadanoff,et al.  The Unreasonable Effectiveness of , 2000 .

[4]  Eric Jones,et al.  SciPy: Open Source Scientific Tools for Python , 2001 .

[5]  Jonathan Scott Friedlaender,et al.  A Human Genome Diversity Cell Line Panel , 2002, Science.

[6]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[7]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[8]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[9]  Zhanjiang Liu DNA Sequencing Technologies , 2007 .

[10]  Pardis C Sabeti,et al.  Genome-wide detection and characterization of positive selection in human populations , 2007, Nature.

[11]  F. Hu,et al.  A Genome-Wide Association Study Identifies Novel Alleles Associated with Hair Color and Skin Pigmentation , 2008, PLoS genetics.

[12]  Yoshua Bengio,et al.  Classification using discriminative restricted Boltzmann machines , 2008, ICML '08.

[13]  Geoffrey E. Hinton Reducing the Dimensionality of Data with Neural , 2008 .

[14]  Asan,et al.  Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude , 2010, Science.

[15]  Brian T. Naughton,et al.  Web-Based, Participant-Driven Studies Yield Novel Genetic Associations for Common Traits , 2010, PLoS genetics.

[16]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[17]  J. Marchini,et al.  Genotype Imputation with Thousands of Genomes , 2011, G3: Genes | Genomes | Genetics.

[18]  D. Falush,et al.  Inference of Population Structure using Dense Haplotype Data , 2012, PLoS genetics.

[19]  C. Basu Mallick,et al.  The Light Skin Allele of SLC24A5 in South Asians and Europeans Shares Identity by Descent , 2013, PLoS genetics.

[20]  Jeffrey E. Lee,et al.  Genome-wide association studies identify several new loci associated with pigmentation traits and skin cancer risk in European Americans. , 2013, Human molecular genetics.

[21]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[22]  C. Tyler-Smith,et al.  Human genomic regions with exceptionally high levels of population differentiation identified from 911 whole-genome sequences , 2014, Genome Biology.

[23]  R. Mägi,et al.  Cohort Profile Cohort Profile : Estonian Biobank of the Estonian Genome Center , University of Tartu , 2015 .

[24]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[25]  E. Zeggini,et al.  The African Genome Variation Project shapes medical genetics in Africa , 2014, Nature.

[26]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[27]  T. Guillot,et al.  SOPHIE velocimetry of Kepler transit candidates XVII. The physical properties of giant exoplanets within 400 days of period , 2015, 1511.00643.

[28]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2016, J. Priv. Confidentiality.

[29]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[30]  Yun S. Song,et al.  Deep Learning for Population Genetic Inference , 2015, bioRxiv.

[31]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[32]  Yun S. Song,et al.  The Simons Genome Diversity Project: 300 genomes from 142 diverse populations , 2016, Nature.

[33]  Roland R. Regoes,et al.  Investigating the Consequences of Interference between Multiple CD8+ T Cell Escape Mutations in Early HIV Infection , 2016, PLoS Comput. Biol..

[34]  S. Fullerton,et al.  Genomics is failing on diversity , 2016, Nature.

[35]  Alan Edelman,et al.  Julia: A Fresh Approach to Numerical Computing , 2014, SIAM Rev..

[36]  E. Mardis DNA sequencing technologies: 2006–2016 , 2017, Nature Protocols.

[37]  Davide Marnetto,et al.  Haplostrips: revealing population structure through haplotype visualization , 2017 .

[38]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[39]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  R. Mägi,et al.  Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel , 2017, European Journal of Human Genetics.

[41]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Brendan J. Frey,et al.  Generating and designing DNA with deep generative models , 2017, ArXiv.

[43]  Aaron C. Courville,et al.  Adversarially Learned Inference , 2016, ICLR.

[44]  Andrew M. Dai,et al.  MaskGAN: Better Text Generation via Filling in the ______ , 2018, ICLR.

[45]  Alex Diaz-Papkovich,et al.  Revealing multi-scale population structure in large cohorts , 2018, bioRxiv.

[46]  Yann Ollivier,et al.  Mixed batches and symmetric discriminators for GAN training , 2018, ICML.

[47]  Scott M. Williams,et al.  The Missing Diversity in Human Genetic Studies , 2019, Cell.

[48]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[49]  Hairong Lv,et al.  hicGAN infers super resolution Hi-C data with generative adversarial networks , 2019, Bioinform..

[50]  Daniel R. Schrider,et al.  The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference , 2018, bioRxiv.

[51]  William S. DeWitt,et al.  Deep generative models for T cell receptor protein sequences , 2019, eLife.

[52]  Simona Cocco,et al.  Learning protein constitutive motifs from sequence data , 2018, eLife.

[53]  Jeffrey R. Adrion,et al.  Inferring the landscape of recombination using recurrent neural networks , 2019, bioRxiv.

[54]  Selection of sequence motifs and generative Hopfield-Potts models for protein families. , 2019, Physical review. E.

[55]  Reihaneh Torkzadehmahani,et al.  DP-CGAN: Differentially Private Synthetic Data and Label Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[56]  Martin Weigt,et al.  Selection of sequence motifs and generative Hopfield-Potts models for protein families , 2019, bioRxiv.

[57]  David Rolnick,et al.  Generative models and abstractions for large-scale neuroanatomy datasets , 2019, Current Opinion in Neurobiology.