Importance Sampling for the Infinite Sites Model

Importance sampling or Markov Chain Monte Carlo sampling is required for state-of-the-art statistical analysis of population genetics data. The applicability of these sampling-based inference techniques depends crucially on the proposal distribution. In this paper, we discuss importance sampling for the infinite sites model. The infinite sites assumption is attractive because it constraints the number of possible genealogies, thereby allowing for the analysis of larger data sets. We recall the Griffiths-Tavaré and Stephens-Donnelly proposals and emphasize the relation between the latter proposal and exact sampling from the infinite alleles model. We also introduce a new proposal that takes knowledge of the ancestral state into account. The new proposal is derived from a new result on exact sampling from a single site. The methods are illustrated on simulated data sets and the data considered in Griffiths and Tavaré (1994).

[1]  M. The sampling theory of neutral alleles and an urn model in population genetics * , 2003 .

[2]  Peter Beerli,et al.  Likelihoods on coalescents: a Monte Carlo sampling approach to inferring parameters from population samples of molecular data , 1999 .

[3]  R. Griffiths,et al.  Ewens' sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles. , 2005, Theoretical population biology.

[4]  R. Hudson,et al.  Statistical properties of the number of recombination events in the history of a sample of DNA sequences. , 1985, Genetics.

[5]  Yun S. Song,et al.  Counting All Possible Ancestral Configurations of Sample Sequences in Population Genetics , 2006, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  J. Kingman Random partitions in population genetics , 1978, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[7]  Albert D. Shieh,et al.  Statistical Applications in Genetics and Molecular Biology , 2010 .

[8]  A. Raftery,et al.  Local Adaptive Importance Sampling for Multivariate Densities with Strong Nonlinear Relationships , 1996 .

[9]  P. Donnelly,et al.  Conditional genealogies and the age of a neutral mutant. , 1999, Theoretical population biology.

[10]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[11]  M. De Iorio,et al.  Importance sampling on coalescent histories. I , 2004, Advances in Applied Probability.

[12]  S. Tavaré,et al.  Ancestral Inference in Population Genetics , 1994 .

[13]  J. McGregor,et al.  Addendum to a paper of W. Ewens. , 1972, Theoretical population biology.

[14]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[15]  M. Stephens Times on trees, and the age of an allele. , 2000, Theoretical population biology.

[16]  P. Donnelly,et al.  Partition structures, Polya urns, the Ewens sampling formula, and the ages of alleles. , 1986, Theoretical population biology.

[17]  P. Donnelly,et al.  Inference in molecular population genetics , 2000 .

[18]  Jun S. Liu,et al.  Monte Carlo strategies in scientific computing , 2001 .

[19]  S. Ethier,et al.  The Infinitely-Many-Sites Model as a Measure-Valued Diffusion , 1987 .

[20]  Jon A Yamato,et al.  Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling. , 1995, Genetics.

[21]  Y. Fu,et al.  Statistical properties of segregating sites. , 1995, Theoretical population biology.

[22]  Dan Gusfield,et al.  Efficient algorithms for inferring evolutionary trees , 1991, Networks.