Efficient Computation of the Joint Sample Frequency Spectra for Multiple Populations

ABSTRACT A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences and provides a highly efficient dimensional reduction of large-scale population genomic variation data. Recently, there has been much interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. SFS-based inference methods require accurate computation of the expected SFS under a given demographic model. Although much methodological progress has been made, existing methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this article, we present new analytic formulas and algorithms that enable accurate, efficient computation of the expected joint SFS for thousands of individuals sampled from hundreds of populations related by a complex demographic model with arbitrary population size histories (including piecewise-exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study, we demonstrate our improvements to numerical stability and computational complexity.

[1]  F. Hoppe Pólya-like urns and the Ewens' sampling formula , 1984 .

[2]  Ryan D. Hernandez,et al.  Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data , 2009, PLoS genetics.

[3]  Christian Schlötterer,et al.  Linking Great Apes Genome Evolution across Time Scales Using Polymorphism-Aware Phylogenetic Models , 2013, Molecular biology and evolution.

[4]  Sergio Lukić,et al.  Demographic Inference Using Spectral Methods on SNP Data, with an Analysis of the Human Out-of-Africa Expansion , 2012, Genetics.

[5]  W. Ewens Mathematical Population Genetics : I. Theoretical Introduction , 2004 .

[6]  Anand Bhaskar,et al.  Approximate Sampling Formulae for General Finite-Alleles Models of Mutation , 2011, Advances in Applied Probability.

[7]  M. Kimmel,et al.  A note on distributions of times to coalescence, under time-dependent population size. , 2003, Theoretical population biology.

[8]  S. Tavaré,et al.  Line-of-descent and genealogical processes, and their applications in population genetics models. , 1984, Theoretical population biology.

[9]  Claudio J. Verzilli,et al.  An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People , 2012, Science.

[10]  Paul A. Jenkins,et al.  The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. , 2011, Theoretical population biology.

[11]  Hua Chen,et al.  Intercoalescence Time Distribution of Incomplete Gene Genealogies in Temporally Varying Populations, and Applications in Population Genetic Inference , 2013, Annals of human genetics.

[12]  Paul R. Staab,et al.  scrm: efficiently simulating long sequences using the approximated coalescent with recombination , 2015, Bioinform..

[13]  D. A. Sprott Urn Models and Their Application—An Approach to Modern Discrete Probability Theory , 1978 .

[14]  Simon Tavaré,et al.  Lines-of-descent and genealogical processes, and their applications in population genetics models , 1984, Advances in Applied Probability.

[15]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[16]  Ryan D. Hernandez,et al.  Assessing the Evolutionary Impact of Amino Acid Mutations in the Human Genome , 2008, PLoS genetics.

[17]  Paul A. Jenkins,et al.  General Triallelic Frequency Spectrum Under Demographic Models with Variable Population Size , 2013, Genetics.

[18]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[19]  R. Nielsen Estimation of population parameters and recombination rates from single nucleotide polymorphisms. , 2000, Genetics.

[20]  J. Wakeley,et al.  Estimating ancestral population parameters. , 1997, Genetics.

[21]  Manuel Dehnert,et al.  Probability Models for DNA Sequence Evolution (2nd edn.). R. Durrett (2008). New York: Springer. ISBN: 978-0-387-78168-6 , 2009 .

[22]  J. Kingman On the genealogy of large populations , 1982 .

[23]  Y. Fu,et al.  Statistical properties of segregating sites. , 1995, Theoretical population biology.

[24]  David Bryant,et al.  Next-generation sequencing reveals phylogeographic structure and a species tree for recent bird divergences. , 2009, Molecular phylogenetics and evolution.

[25]  Hua Chen The joint allele frequency spectrum of multiple populations: a coalescent theory approach. , 2012, Theoretical population biology.

[26]  Taylor J. Maxwell,et al.  Deep resequencing reveals excess rare recent variants consistent with explosive population growth , 2010, Nature communications.

[27]  L. Excoffier,et al.  Robust Demographic Inference from Genomic and SNP Data , 2013, PLoS genetics.

[28]  Roger B. Sidje,et al.  Expokit: a software package for computing matrix exponentials , 1998, TOMS.

[29]  J. C. Gower,et al.  Accuracy and stability , 2004 .

[30]  S. Tavaré,et al.  The age of a mutation in a general coalescent tree , 1998 .

[31]  Anand Bhaskar,et al.  DESCARTES' RULE OF SIGNS AND THE IDENTIFIABILITY OF POPULATION DEMOGRAPHIC MODELS FROM GENOMIC VARIATION DATA. , 2013, Annals of statistics.

[32]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[33]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[34]  Z. Yang,et al.  Probability models for DNA sequence evolution , 2004, Heredity.

[35]  P. A. P. Moran,et al.  Random processes in genetics , 1958, Mathematical Proceedings of the Cambridge Philosophical Society.

[36]  C. J-F,et al.  THE COALESCENT , 1980 .

[37]  M. Kimmel,et al.  New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. , 2003, Genetics.

[38]  Gabor T. Marth,et al.  Demographic history and rare allele sharing among human populations , 2011, Proceedings of the National Academy of Sciences.

[39]  M. Beaumont,et al.  Evaluating loci for use in the genetic analysis of population structure , 1996, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[40]  C. Fefferman,et al.  Can one learn history from the allelic spectrum? , 2008, Theoretical population biology.

[41]  M Kimura,et al.  SOLUTION OF A PROCESS OF RANDOM GENETIC DRIFT WITH A CONTINUOUS MODEL. , 1955, Proceedings of the National Academy of Sciences of the United States of America.

[42]  M. Kimura The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. , 1969, Genetics.

[43]  Judea Pearl,et al.  Reverend Bayes on Inference Engines: A Distributed Hierarchical Approach , 1982, AAAI.

[44]  R. Gibbs,et al.  Neutral genomic regions refine models of recent rapid human population growth , 2013, Proceedings of the National Academy of Sciences.

[45]  Awad H. Al-Mohy,et al.  Computing the Action of the Matrix Exponential, with an Application to Exponential Integrators , 2011, SIAM J. Sci. Comput..

[46]  Anand Bhaskar,et al.  Efficient inference of population size histories and locus-specific mutation rates from large-sample genomic variation data , 2014, bioRxiv.

[47]  S. Gabriel,et al.  Calibrating a coalescent simulation of human genome sequence variation. , 2005, Genome research.

[48]  D. Aldous Exchangeability and related topics , 1985 .

[49]  Cleve B. Moler,et al.  Nineteen Dubious Ways to Compute the Exponential of a Matrix, Twenty-Five Years Later , 1978, SIAM Rev..

[50]  C. Loan,et al.  Nineteen Dubious Ways to Compute the Exponential of a Matrix , 1978 .