Approximate Sampling Formulae for General Finite-Alleles Models of Mutation

Many applications in genetic analyses utilize sampling distributions, which describe the probability of observing a sample of DNA sequences randomly drawn from a population. In the one-locus case with special models of mutation, such as the infinite-alleles model or the finite-alleles parent-independent mutation model, closed-form sampling distributions under the coalescent have been known for many decades. However, no exact formula is currently known for more general models of mutation that are of biological interest. In this paper, models with finitely-many alleles are considered, and an urn construction related to the coalescent is used to derive approximate closed-form sampling formulae for an arbitrary irreducible recurrent mutation model or for a reversible recurrent mutation model, depending on whether the number of distinct observed allele types is at most three or four, respectively. It is demonstrated empirically that the formulae derived here are highly accurate when the per-base mutation rate is low, which holds for many biological organisms.

[1]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[2]  Paul A. Jenkins,et al.  Closed-Form Two-Locus Sampling Distributions: Accuracy and Universality , 2009, Genetics.

[3]  Jim Pitman,et al.  The two-parameter generalization of Ewens' random partition structure , 2003 .

[4]  S. Sampling theory for neutral alleles in a varying environment , 2003 .

[5]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[6]  C. J-F,et al.  THE COALESCENT , 1980 .

[7]  M. Nachman,et al.  Estimate of the mutation rate per nucleotide in humans. , 2000, Genetics.

[8]  Paul A. Jenkins,et al.  The effect of recurrent mutation on the frequency spectrum of a segregating site and the age of an allele. , 2011, Theoretical population biology.

[9]  E. Mayr Adaptation and selection , 1981 .

[10]  R. Arratia,et al.  Logarithmic Combinatorial Structures: A Probabilistic Approach , 2003 .

[11]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[12]  R. Griffiths,et al.  The frequency spectrum of a mutation, and its age, in a general diffusion model. , 2003, Theoretical population biology.

[13]  J. Pitman Exchangeable and partially exchangeable random partitions , 1995 .

[14]  Yun S. Song,et al.  PADÉ APPROXIMANTS AND EXACT TWO-LOCUS SAMPLING DISTRIBUTIONS. , 2011, The annals of applied probability : an official journal of the Institute of Mathematical Statistics.

[15]  S. Tavaré,et al.  Ancestral Inference in Population Genetics , 1994 .

[16]  F. Hoppe Pólya-like urns and the Ewens' sampling formula , 1984 .

[17]  Y. Fu,et al.  Statistical properties of segregating sites. , 1995, Theoretical population biology.

[18]  Yun S. Song,et al.  AN ASYMPTOTIC SAMPLING FORMULA FOR THE COALESCENT WITH RECOMBINATION. , 2010, The annals of applied probability : an official journal of the Institute of Mathematical Statistics.

[19]  R. Griffiths,et al.  Ewens' sampling formula and related formulae: combinatorial proofs, extensions to variable population size and applications to ages of alleles. , 2005, Theoretical population biology.

[20]  Yun S. Song,et al.  Closed-Form Asymptotic Sampling Distributions under the Coalescent with Recombination for an Arbitrary Number of Loci , 2011, Advances in Applied Probability.

[21]  M. Stephens,et al.  Inference Under the Coalescent , 2004 .