Deep learning based methods for estimating distribution of coalescence rates from genome-wide data

Demographic and population structure inference is one of the most important problems in genomics. Population parameters such as effective population sizes, population split times and migration rates are of high interest both themselves and for many applications, e.g. for genome-wide association studies. Hidden Markov Model (HMM) based methods, such as PSMC, MSMC, coalHMM etc., proved to be powerful and useful for estimation of these parameters in many population genetics studies. At the same time, machine and deep learning have began to be used in natural science widely. In particular, deep learning based approaches have already substituted hidden Markov models in many areas, such as speech recognition or user input prediction. We develop a deep learning (DL) approach for local coalescent time estimation from one whole diploid genome. Our DL models are trained on simulated datasets. Importantly, demographic and population parameters can be inferred based on the distribution of coalescent times. We expect that our approach will be useful under complex population scenarios, which cannot be studied with existing HMM based methods. Our work is also a crucial step in developing a deep learning framework which would allow to create population genomics methods for different genomic data representations.

[1]  Ryan D. Hernandez,et al.  Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data , 2009, PLoS genetics.

[2]  Yun S. Song,et al.  Efficiently Inferring the Demographic History of Many Populations With Allele Count Data , 2020, Journal of the American Statistical Association.

[3]  R. Durbin,et al.  Inference of human population history from individual whole-genome sequences. , 2011, Nature.

[4]  Paul Marjoram,et al.  Fast "coalescent" simulation , 2006, BMC Genetics.

[5]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[6]  Jared O'Connell,et al.  Tracking human population structure through time from whole genome sequences , 2019, bioRxiv.

[7]  Yun S. Song,et al.  Efficiently inferring the demographic history of many populations with allele count data , 2018, bioRxiv.

[8]  R. Durbin,et al.  Inferring human population size and separation history from multiple genome sequences , 2014, Nature Genetics.

[9]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[10]  Phil Blunsom,et al.  A Convolutional Neural Network for Modelling Sentences , 2014, ACL.

[11]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[12]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[13]  Jerome Kelleher,et al.  Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes , 2015, bioRxiv.