A Likelihood-Free Inference Framework for Population Genetic Data using Exchangeable Neural Networks

An explosion of high-throughput DNA sequencing in the past decade has led to a surge of interest in population-scale inference with whole-genome data. Recent work in population genetics has centered on designing inference methods for relatively simple model classes, and few scalable general-purpose inference techniques exist for more realistic, complex models. To achieve this, two inferential challenges need to be addressed: (1) population data are exchangeable, calling for methods that efficiently exploit the symmetries of the data, and (2) computing likelihoods is intractable as it requires integrating over a set of correlated, extremely high-dimensional latent variables. These challenges are traditionally tackled by likelihood-free methods that use scientific simulators to generate datasets and reduce them to hand-designed, permutation-invariant summary statistics, often leading to inaccurate inference. In this work, we develop an exchangeable neural network that performs summary statistic-free, likelihood-free inference. Our frame-work can be applied in a black-box fashion across a variety of simulation-based tasks, both within and outside biology. We demonstrate the power of our approach on the recombination hotspot testing problem, outperforming the state-of-the-art.

[1]  Toshihiro Tanaka The International HapMap Project , 2003, Nature.

[2]  Gil McVean,et al.  Identifying recombination hotspots using population genetic data , 2014, 1403.4264.

[3]  Michael Q. Zhang,et al.  A new method for detecting human recombination hotspots and its applications to the HapMap ENCODE data. , 2006, American journal of human genetics.

[4]  Nathaniel Virgo,et al.  Permutation-equivariant neural networks applied to dynamics prediction , 2016, ArXiv.

[5]  T. Petes,et al.  Meiotic recombination hot spots and cold spots , 2001, Nature Reviews Genetics.

[6]  Yun S. Song,et al.  Robust and scalable inference of population history from hundreds of unphased whole genomes , 2016, Nature Genetics.

[7]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[8]  Iain Murray,et al.  Fast $\epsilon$-free Inference of Simulation Models with Bayesian Conditional Density Estimation , 2016 .

[9]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[10]  Tony Jebara,et al.  Permutation invariant SVMs , 2006, ICML.

[11]  Daniel R. Schrider,et al.  The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference , 2018, bioRxiv.

[12]  P. Donnelly,et al.  The Fine-Scale Structure of Recombination Rate Variation in the Human Genome , 2004, Science.

[13]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[14]  Andrew D. Kern,et al.  Inferring Selective Constraint from Population Genomic Data Suggests Recent Regulatory Turnover in the Human Brain , 2013, Genome biology and evolution.

[16]  Jody Hey,et al.  What's So Hot about Recombination Hotspots? , 2004, PLoS biology.

[17]  Paul Fearnhead,et al.  Bioinformatics Original Paper Sequenceldhot: Detecting Recombination Hotspots , 2022 .

[18]  Yun S. Song,et al.  AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION. , 2010, The annals of applied probability : an official journal of the Institute of Mathematical Statistics.

[19]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[20]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  L. Excoffier,et al.  Efficient Approximate Bayesian Computation Coupled With Markov Chain Monte Carlo Without Likelihood , 2009, Genetics.

[23]  Jerome Kelleher,et al.  Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes , 2015, bioRxiv.

[24]  Michael I. Jordan,et al.  Minimizing Nonconvex Population Risk from Rough Empirical Risk , 2018, ArXiv.

[25]  Bai Jiang,et al.  Learning Summary Statistic for Approximate Bayesian Computation via Deep Neural Network , 2015, 1510.02175.

[26]  Jeffrey D. Wall,et al.  Detecting Recombination Hotspots from Patterns of Linkage Disequilibrium , 2016, G3: Genes, Genomes, Genetics.

[27]  Wojciech Niemiro,et al.  Sufficiency in bayesian models , 1998 .

[28]  C. J-F,et al.  THE COALESCENT , 1980 .

[29]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[30]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[31]  John A. Kamm,et al.  Two-Locus Likelihoods Under Variable Population Size and Fine-Scale Recombination Rate Estimation , 2015, Genetics.

[32]  Mark A. Beaumont,et al.  Approximate Bayesian Computation Without Summary Statistics: The Case of Admixture , 2009, Genetics.

[33]  D. Balding,et al.  Approximate Bayesian computation in population genetics. , 2002, Genetics.

[34]  Flora Jay,et al.  Inferring Population Size History from Large Samples of Genome-Wide Molecular Data - An Approximate Bayesian Computation Approach , 2016, bioRxiv.

[35]  W. Stephan,et al.  Searching for Footprints of Positive Selection in Whole-Genome SNP Data From Nonequilibrium Populations , 2010, Genetics.

[36]  Barnabás Póczos,et al.  Deep Learning with Sets and Point Clouds , 2016, ICLR.

[37]  R. Hudson Properties of a neutral allele model with intragenic recombination. , 1983, Theoretical population biology.

[38]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[39]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[40]  J. Møller Discussion on the paper by Feranhead and Prangle , 2012 .

[41]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[42]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[43]  Yun S. Song,et al.  Deep Learning for Population Genetic Inference , 2015, bioRxiv.

[44]  Olivier François,et al.  Non-linear regression models for Approximate Bayesian Computation , 2008, Stat. Comput..

[45]  M. Feldman,et al.  Population growth of human Y chromosomes: a study of Y chromosome microsatellites. , 1999, Molecular biology and evolution.

[46]  Michael I. Jordan,et al.  On the Local Minima of the Empirical Risk , 2018, NeurIPS.

[47]  Robert C. Griffiths,et al.  Neutral two-locus multiple allele models with recombination , 1981 .

[48]  Alexander J. Smola,et al.  Deep Sets , 2017, 1703.06114.

[49]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[50]  Paul Fearnhead,et al.  Constructing summary statistics for approximate Bayesian computation: semi‐automatic approximate Bayesian computation , 2012 .

[51]  David T. Frazier,et al.  Asymptotic properties of approximate Bayesian computation , 2016, Biometrika.

[52]  R. Hudson Two-locus sampling distributions and their application. , 2001, Genetics.

[53]  Iain Murray,et al.  Fast $\epsilon$-free Inference of Simulation Models with Bayesian Conditional Density Estimation , 2016, 1605.06376.