Inference in molecular population genetics

Full likelihood‐based inference for modern population genetics data presents methodological and computational challenges. The problem is of considerable practical importance and has attracted recent attention, with the development of algorithms based on importance sampling (IS) and Markov chain Monte Carlo (MCMC) sampling. Here we introduce a new IS algorithm. The optimal proposal distribution for these problems can be characterized, and we exploit a detailed analysis of genealogical processes to develop a practicable approximation to it. We compare the new method with existing algorithms on a variety of genetic examples. Our approach substantially outperforms existing IS algorithms, with efficiency typically improved by several orders of magnitude. The new method also compares favourably with existing MCMC methods in some problems, and less favourably in others, suggesting that both IS and MCMC methods have a continuing role to play in this area. We offer insights into the relative advantages of each approach, and we discuss diagnostics in the IS framework.

[1]  H. Kahn,et al.  Methods of Reducing Sample Size in Monte Carlo Computations , 1953, Oper. Res..

[2]  Sewall Wright,et al.  The theory of gene frequencies , 1969 .

[3]  P. Moran,et al.  Wandering distributions and the electrophoretic profile. II. , 1975, Theoretical population biology.

[4]  Wandering distributions and the electrophoretic profile. , 1975, Theoretical population biology.

[5]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[6]  F. Hoppe Pólya-like urns and the Ewens' sampling formula , 1984 .

[7]  P. Donnelly Dual processes in population genetics , 1986 .

[8]  J. Brookfield A model for DNA sequence evolution within transposable element families. , 1986, Genetics.

[9]  S. Ethier,et al.  The Infinitely-Many-Sites Model as a Measure-Valued Diffusion , 1987 .

[10]  Brian D. Ripley,et al.  Stochastic Simulation , 2005 .

[11]  Adrian F. M. Smith,et al.  Sampling-Based Approaches to Calculating Marginal Densities , 1990 .

[12]  Richard L. Smith,et al.  Models for exceedances over high thresholds , 1990 .

[13]  Charles J. Geyer,et al.  Reweighting Monte Carlo Mixtures , 1991 .

[14]  M. C. Jones,et al.  A reliable data-based bandwidth selection method for kernel density estimation , 1991 .

[15]  Andrew L. Rukhin,et al.  Tools for statistical inference , 1991 .

[16]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[17]  C. Geyer,et al.  Constrained Monte Carlo Maximum Likelihood for Dependent Data , 1992 .

[18]  G. Evans Practical Numerical Integration , 1993 .

[19]  Stewart N. Ethier,et al.  Fleming-Viot processes in population genetics , 1993 .

[20]  A. Kong,et al.  Sequential imputation for multilocus linkage analysis. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Robert C. Griffiths,et al.  Simulating Probability Distributions in the Coalescent , 1994 .

[22]  S. Tavaré,et al.  Ancestral Inference in Population Genetics , 1994 .

[23]  Jun S. Liu,et al.  Sequential Imputations and Bayesian Missing Data Problems , 1994 .

[24]  Ming-Hui Chen Importance-Weighted Marginal Bayesian Posterior Density Estimation , 1994 .

[25]  Walter R. Gilks,et al.  Hypothesis testing and model selection , 1995 .

[26]  P Donnelly,et al.  Coalescents and genealogical structure under neutrality. , 1995, Annual review of genetics.

[27]  C. Geyer,et al.  Annealing Markov chain Monte Carlo with applications to ancestral inference , 1995 .

[28]  T. Hesterberg,et al.  Weighted Average Importance Sampling and Defensive Mixture Distributions , 1995 .

[29]  Jon A Yamato,et al.  Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. , 1995, Genetics.

[30]  Peter Green,et al.  Markov chain Monte Carlo in Practice , 1996 .

[31]  Peter Donnelly,et al.  The asymptotic behavior of an urn model arising in population genetics , 1996 .

[32]  Xiao-Li Meng,et al.  Fitting Full-Information Item Factor Models and an Empirical Investigation of Bridge Sampling , 1996 .

[33]  Xiao-Li Meng,et al.  SIMULATING RATIOS OF NORMALIZING CONSTANTS VIA A SIMPLE IDENTITY: A THEORETICAL EXPLORATION , 1996 .

[34]  Adrian E. Raftery,et al.  Hypothesis testing and model selection , 1996 .

[35]  Charles J. Geyer,et al.  Estimation and Optimization of Functions , 1996 .

[36]  Peter Donnelly,et al.  A countable representation of the Fleming-Viot measure-valued diffusion , 1996 .

[37]  P. Marjoram,et al.  Ancestral Inference from Samples of DNA Sequences with Recombination , 1996, J. Comput. Biol..

[38]  D. Rubinsztein,et al.  Network analysis of human Y microsatellite haplotypes. , 1996, Human Molecular Genetics.

[39]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[40]  S. Tavaré,et al.  Computational Methods for the Coalescent , 1997 .

[41]  P. Grassberger Pruned-enriched Rosenbluth method: Simulations of θ polymers of chain length up to 1 000 000 , 1997 .

[42]  Jun S. Liu,et al.  Sequential Monte Carlo methods for dynamic systems , 1997 .

[43]  S. Heath Markov chain Monte Carlo segregation and linkage analysis for oligogenic models. , 1997, American journal of human genetics.

[44]  R. Nielsen A likelihood approach to populations samples of microsatellite alleles. , 1997, Genetics.

[45]  P. Diaconis,et al.  Matchings and phylogenetic trees. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[46]  P Donnelly,et al.  Heterogeneity of microsatellite mutations within and between loci, and implications for human demographic histories. , 1998, Genetics.

[47]  Siem Jan Koopman,et al.  Time Series Analysis of Non-Gaussian Observations Based on State Space Models from Both Classical and Bayesian Perspectives , 1999 .

[48]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[49]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[50]  Jon A Yamato,et al.  Maximum likelihood estimation of population growth rates based on the coalescent. , 1998, Genetics.

[51]  Jun S. Liu,et al.  Rejection Control and Sequential Importance Sampling , 1998 .

[52]  D. Balding,et al.  Genealogical inference from microsatellite data. , 1998, Genetics.

[53]  Problems with computational methods in population ge - , 1999 .

[54]  Peter Beerli,et al.  Likelihoods on coalescents: a Monte Carlo sampling approach to inferring parameters from population samples of molecular data , 1999 .

[55]  Bob Mau,et al.  Markov chain Monte Carlo for the Bayesian analysis of evolutionary trees from aligned molecular sequences , 1999 .

[56]  B. Larget,et al.  Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees , 2000 .

[57]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[58]  J. Felsenstein,et al.  Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. , 1999, Genetics.

[59]  J. R. Koehler,et al.  Modern Applied Statistics with S-Plus. , 1996 .

[60]  Peter Donnelly,et al.  Particle Representations for Measure-Valued Population Models , 1999 .

[61]  Robert C. Griffiths,et al.  The ages of mutations in gene trees , 1999 .

[62]  C S Jensen,et al.  Blocking Gibbs sampling for linkage analysis in large pedigrees with many loops. , 1999, American journal of human genetics.

[63]  J. Hein,et al.  Recombination as a point process along sequences. , 1999, Theoretical population biology.

[64]  J. Pritchard,et al.  Use of unlinked genetic markers to detect population stratification in association studies. , 1999, American journal of human genetics.

[65]  M. Beaumont Detecting population expansion and decline using microsatellites. , 1999, Genetics.

[66]  D. Goldstein,et al.  Statistical Properties of Two Teststhat Use Multilocus Data Sets to Detect Population Expansions , 1999 .

[67]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[68]  M. Stephens Times on trees, and the age of an allele. , 2000, Theoretical population biology.

[69]  P. F. Slade Simulation of selected genealogies. , 2000, Theoretical population biology.

[70]  S. Tavaré,et al.  The effects of rate variation on ancestral inference in the coalescent. , 2000, Genetics.

[71]  Hani Doss,et al.  Phylogenetic Tree Construction Using Markov Chain Monte Carlo , 2000 .

[72]  R. Nielsen Estimation of population parameters and recombination rates from single nucleotide polymorphisms. , 2000, Genetics.

[73]  Jon A Yamato,et al.  Maximum likelihood estimation of recombination rates from population data. , 2000, Genetics.

[74]  Elizabeth A. Thompson,et al.  MCMC Estimation of Multi‐locus Genome Sharing and Multipoint Gene Location Scores , 2000 .

[75]  S. Tavaré,et al.  The age of a unique event polymorphism. , 2000, Genetics.

[76]  J. Wall,et al.  When did the human population size start increasing? , 2000, Genetics.

[77]  R. Griffiths,et al.  Inference from gene trees in a subdivided population. , 2000, Theoretical population biology.

[78]  A. Owen,et al.  Safe and Effective Importance Sampling , 2000 .

[79]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[80]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[81]  Rong Chen,et al.  A Theoretical Framework for Sequential Importance Sampling with Resampling , 2001, Sequential Monte Carlo Methods in Practice.

[82]  P. Donnelly,et al.  Estimating recombination rates from population genetic data. , 2001, Genetics.

[83]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[84]  C. Simulating Probability Distributions in the Coalescent * , 2022 .