ImaGene: a convolutional neural network to quantify natural selection from genomic data

The genetic bases of many complex phenotypes are still largely unknown, mostly due to the polygenic nature of the traits and the small effect of each associated mutation. An alternative approach to classic association studies to determining such genetic bases is an evolutionary framework. As sites targeted by natural selection are likely to harbor important functionalities for the carrier, the identification of selection signatures in the genome has the potential to unveil the genetic mechanisms underpinning human phenotypes. Popular methods of detecting such signals rely on compressing genomic information into summary statistics, resulting in the loss of information. Furthermore, few methods are able to quantify the strength of selection. Here we explored the use of deep learning in evolutionary biology and implemented a program, called ImaGene, to apply convolutional neural networks on population genomic data for the detection and quantification of natural selection. ImaGene enables genomic information from multiple individuals to be represented as abstract images. Each image is created by stacking aligned genomic data and encoding distinct alleles into separate colors. To detect and quantify signatures of positive selection, ImaGene implements a convolutional neural network which is trained using simulations. We show how the method implemented in ImaGene can be affected by data manipulation and learning strategies. In particular, we show how sorting images by row and column leads to accurate predictions. We also demonstrate how the misspecification of the correct demographic model for producing training data can influence the quantification of positive selection. We finally illustrate an approach to estimate the selection coefficient, a continuous variable, using multiclass classification techniques. While the use of deep learning in evolutionary genomics is in its infancy, here we demonstrated its potential to detect informative patterns from large-scale genomic data. We implemented methods to process genomic data for deep learning in a user-friendly program called ImaGene. The joint inference of the evolutionary history of mutations and their functional impact will facilitate mapping studies and provide novel insights into the molecular mechanisms associated with human phenotypes.

[1]  Aaron P. Ragsdale,et al.  Inferring the Joint Demographic History of Multiple Populations: Beyond the Diffusion Approximation , 2017, Genetics.

[2]  V. Bafna,et al.  Learning Natural Selection from the Site Frequency Spectrum , 2013, Genetics.

[3]  Yun S. Song,et al.  Robust and scalable inference of population history from hundreds of unphased whole genomes , 2016, Nature Genetics.

[4]  Eran Halperin,et al.  Learning Natural Selection from the Site Frequency Spectrum , 2013, RECOMB.

[5]  Pavlos Pavlidis,et al.  A survey of methods and tools to detect recent and strong positive selection , 2017, Journal of Biological Research-Thessaloniki.

[6]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[7]  Keurcien Luu,et al.  Detecting Genomic Signatures of Natural Selection with Principal Component Analysis: Application to the 1000 Genomes Data , 2015, Molecular biology and evolution.

[8]  G. de los Campos,et al.  Can Deep Learning Improve Genomic Prediction of Complex Human Traits? , 2018, Genetics.

[9]  Davide Marnetto,et al.  Haplostrips: revealing population structure through haplotype visualization , 2017 .

[10]  Asan,et al.  Altitude adaptation in Tibet caused by introgression of Denisovan-like DNA , 2014, Nature.

[11]  Anders Albrechtsen,et al.  Greenlandic Inuit show genetic signatures of diet and climate adaptation , 2015, Science.

[12]  Gabor T. Marth,et al.  The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations , 2004, Genetics.

[13]  Randal S. Olson,et al.  Data-driven advice for applying machine learning to bioinformatics problems , 2017, PSB.

[14]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[15]  Nicola Jones,et al.  Computer science: The learning machines , 2014, Nature.

[16]  Casey S. Greene,et al.  Extracting a Biologically Relevant Latent Space from Cancer Transcriptomes with Variational Autoencoders , 2017, bioRxiv.

[17]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Jerome Kelleher,et al.  Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes , 2015, bioRxiv.

[19]  P. Keightley,et al.  Detecting positive selection in the genome , 2017, BMC Biology.

[20]  Amir Hussain,et al.  Applications of Deep Learning and Reinforcement Learning to Biological Data , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Pardis C Sabeti,et al.  Genome-wide detection and characterization of positive selection in human populations , 2007, Nature.

[22]  F. Tajima Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. , 1989, Genetics.

[23]  Luis Antonio Brasil Kowada,et al.  Identifying Maximal Perfect Haplotype Blocks , 2018, BSB.

[24]  Nikolaos S. Alachiotis,et al.  SweeD: Likelihood-Based Detection of Selective Sweeps in Thousands of Genomes , 2013, Molecular biology and evolution.

[25]  Yongtao Guan,et al.  Variation in Human Recombination Rates and Its Genetic Determinants , 2011, PloS one.

[26]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[27]  J. Pritchard,et al.  A Map of Recent Positive Selection in the Human Genome , 2006, PLoS biology.

[28]  Peter D. Keightley,et al.  Inferring the Probability of the Derived vs. the Ancestral Allelic State at a Polymorphic Site , 2018, Genetics.

[29]  R. Myers,et al.  Advancements in Next-Generation Sequencing. , 2016, Annual review of genomics and human genetics.

[30]  Ting Liu,et al.  Recent advances in convolutional neural networks , 2015, Pattern Recognit..

[31]  L. Quintana-Murci,et al.  The impact of natural selection on health and disease: uses of the population genetics approach in humans , 2013, Evolutionary applications.

[32]  Jessica F. Brinkworth,et al.  The contribution of natural selection to present-day susceptibility to chronic inflammatory and autoimmune disease. , 2014, Current opinion in immunology.

[33]  R. Durbin,et al.  Revising the human mutation rate: implications for understanding human evolution , 2012, Nature Reviews Genetics.

[34]  Or Zuk,et al.  A Composite of Multiple Signals Distinguishes Causal Variants in Regions of Positive Selection , 2010, Science.

[35]  Sohini Ramachandran,et al.  Localization of adaptive variants in human genomes using averaged one-dependence estimation , 2017, Nature Communications.

[36]  R. Nielsen,et al.  Human adaptation to extreme environmental conditions. , 2018, Current opinion in genetics & development.

[37]  Bo Wang,et al.  Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities , 2018, Inf. Fusion.

[38]  Jeremiah D. Degenhardt,et al.  Targets of balancing selection in the human genome. , 2009, Molecular biology and evolution.

[39]  R. Nielsen,et al.  Ascertainment biases in SNP chips affect measures of population divergence. , 2010, Molecular biology and evolution.

[40]  Orestis Malaspinas,et al.  Estimating Allele Age and Selection Coefficient from Time-Serial Data , 2012, Genetics.

[41]  Daniel R. Schrider,et al.  diploS/HIC: An Updated Approach to Classifying Selective Sweeps , 2018, G3: Genes, Genomes, Genetics.

[42]  Andrew Collins,et al.  Sequencing era methods for identifying signatures of selection in the genome , 2018, Briefings Bioinform..

[43]  R. Durbin,et al.  Inferring human population size and separation history from multiple genome sequences , 2014, Nature Genetics.

[44]  Philipp W. Messer,et al.  SLiM 2: Flexible, Interactive Forward Genetic Simulations , 2017, Molecular biology and evolution.

[45]  Kevin R. Thornton,et al.  Efficient pedigree recording for fast population genetics simulation , 2018, bioRxiv.

[46]  Pardis C Sabeti,et al.  Detecting recent positive selection in the human genome from haplotype structure , 2002, Nature.

[47]  Philipp W. Messer,et al.  Population genomics of rapid adaptation by soft selective sweeps. , 2013, Trends in ecology & evolution.

[48]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[49]  Philipp W. Messer,et al.  Tree‐sequence recording in SLiM opens new horizons for forward‐time simulation of whole genomes , 2019, Molecular ecology resources.

[50]  Rolando González-José,et al.  A genome-wide association scan implicates DCHS2, RUNX2, GLI3, PAX1 and EDAR in human facial variation , 2016, Nature Communications.

[51]  Yun S. Song,et al.  A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks , 2018, bioRxiv.

[52]  R. Durbin,et al.  Inference of human population history from individual whole-genome sequences. , 2011, Nature.

[53]  Joseph K. Pickrell,et al.  The Genetics of Human Adaptation: Hard Sweeps, Soft Sweeps, and Polygenic Adaptation , 2010, Current Biology.

[54]  Jun Wang,et al.  Population Genomics Reveal Recent Speciation and Rapid Evolutionary Adaptation in Polar Bears , 2014, Cell.

[55]  Daniel R. Schrider,et al.  The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference , 2018, bioRxiv.

[56]  Asan,et al.  Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude , 2010, Science.

[57]  Xiaoquan Wen,et al.  Correction: A Map of Recent Positive Selection in the Human Genome , 2006, PLoS Biology.

[58]  Pardis C. Sabeti,et al.  Natural selection and infectious disease in human populations , 2014, Nature Reviews Genetics.

[59]  Yun S. Song,et al.  Deep Learning for Population Genetic Inference , 2015, bioRxiv.

[60]  R. Nielsen,et al.  Distinguishing between Selective Sweeps from Standing Variation and from a De Novo Mutation , 2012, PLoS genetics.

[61]  Daniel R. Schrider,et al.  Supervised Machine Learning for Population Genetics: A New Paradigm , 2018, Trends in genetics : TIG.

[62]  Andrew D. Kern,et al.  S/HIC: Robust Identification of Soft and Hard Sweeps Using Machine Learning , 2015, bioRxiv.

[63]  Mark Stoneking,et al.  Positive selection in East Asians for an EDAR allele that enhances NF-kappaB activation. , 2008, PloS one.

[64]  Amnon Shashua,et al.  Ranking with Large Margin Principle: Two Approaches , 2002, NIPS.

[65]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[66]  Gregory Ewing,et al.  MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus , 2010, Bioinform..

[67]  M. Dixon,et al.  Enhanced ectodysplasin‐A receptor (EDAR) signaling alters multiple fiber characteristics to produce the East Asian hair form , 2008, Human mutation.