Inference of Population Structure Under a Dirichlet Process Model

Inferring population structure from genetic data sampled from some number of individuals is a formidable statistical problem. One widely used approach considers the number of populations to be fixed and calculates the posterior probability of assigning individuals to each population. More recently, the assignment of individuals to populations and the number of populations have both been considered random variables that follow a Dirichlet process prior. We examined the statistical behavior of assignment of individuals to populations under a Dirichlet process prior. First, we examined a best-case scenario, in which all of the assumptions of the Dirichlet process prior were satisfied, by generating data under a Dirichlet process prior. Second, we examined the performance of the method when the genetic data were generated under a population genetics model with symmetric migration between populations. We examined the accuracy of population assignment using a distance on partitions. The method can be quite accurate with a moderate number of loci. As expected, inferences on the number of populations are more accurate when θ = 4Neu is large and when the migration rate (4Nem) is low. We also examined the sensitivity of inferences of population structure to choice of the parameter of the Dirichlet process model. Although inferences could be sensitive to the choice of the prior on the number of populations, this sensitivity occurred when the number of loci sampled was small; inferences are more robust to the prior on the number of populations when the number of sampled loci is large. Finally, we discuss several methods for summarizing the results of a Bayesian Markov chain Monte Carlo (MCMC) analysis of population structure. We develop the notion of the mean population partition, which is the partition of individuals to populations that minimizes the squared partition distance to the partitions sampled by the MCMC algorithm.

[1]  L. M. M.-T. Theory of Probability , 1929, Nature.

[2]  Sewall Wright,et al.  Breeding Structure of Populations in Relation to Speciation , 1940, The American Naturalist.

[3]  S. Wright,et al.  Isolation by Distance. , 1943, Genetics.

[4]  S WRIGHT,et al.  Genetical structure of populations. , 1950, Nature.

[5]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[6]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[7]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[8]  D. White,et al.  Constructive combinatorics , 1986 .

[9]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[10]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[11]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .

[12]  Walter R. Gilks,et al.  Introduction to general state-space Markov chain theory , 1995 .

[13]  B. Rannala,et al.  Detecting immigration by using multilocus genotypes. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Annie Orth,et al.  Hybridation naturelle entre deux sous-espèces de souris domestique, Mus musculus domesticus et Mus musculus castaneus, près du lac Casitas (Californie) , 1998 .

[15]  M. Schervish,et al.  Bayes Factors: What They are and What They are Not , 1999 .

[16]  P. Andolfatto,et al.  A genome-wide departure from the standard neutral model in natural populations of Drosophila. , 2000, Genetics.

[17]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[18]  M. Stephens Dealing with label switching in mixture models , 2000 .

[19]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[20]  R. Nielsen Statistical tests of selective neutrality in the age of genomics , 2001, Heredity.

[21]  K J Dawson,et al.  A Bayesian approach to the identification of panmictic populations and the assignment of individuals. , 2001, Genetical research.

[22]  Dan Gusfield,et al.  Partition-distance: A problem and class of perfect graphs arising in clustering , 2002, Inf. Process. Lett..

[23]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[24]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[25]  Dipak K Dey,et al.  A Bayesian approach to inferring population structure from dominant markers , 2002, Molecular ecology.

[26]  Molly Przeworski,et al.  The signature of positive selection at randomly chosen loci. , 2002, Genetics.

[27]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[28]  M. Hammer,et al.  Human population structure and its effects on sampling Y chromosome sequence variation. , 2003, Genetics.

[29]  M. Sillanpää,et al.  Bayesian analysis of genetic differentiation between populations. , 2003, Genetics.

[30]  Jukka Corander,et al.  BAPS 2: enhanced possibilities for the analysis of genetic population structure , 2004, Bioinform..

[31]  E. Lorenzen,et al.  No suggestion of hybridization between the vulnerable black‐faced impala (Aepyceros melampus petersi) and the common impala (A. m. melampus) in Etosha National Park, Namibia , 2004, Molecular ecology.

[32]  L. Lens,et al.  Genetic variability and gene flow in the globally, critically-endangered Taita thrush , 2000, Conservation Genetics.

[33]  D. Balding,et al.  A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity , 2005, Genetica.

[34]  G. Evanno,et al.  Detecting the number of clusters of individuals using the software structure: a simulation study , 2005, Molecular ecology.

[35]  N. Leo,et al.  The head and body lice of humans are genetically distinct (Insecta: Phthiraptera, Pediculidae): evidence from double infestations , 2005, Heredity.

[36]  E. Harley,et al.  Population structuring in mountain zebras (Equus zebra): The molecular consequences of divergent demographic histories , 2006, Conservation Genetics.

[37]  E. Ziv,et al.  Admixture-matched case-control study: a practical approach for genetic association studies in admixed populations , 2005, Human Genetics.

[38]  R. Haque,et al.  Ascariasis Is a Zoonosis in Denmark , 2005, Journal of Clinical Microbiology.

[39]  M. Small,et al.  Genetic Structure of Chum Salmon (Oncorhynchus keta) Populations in the Lower Columbia River: Are Chum Salmon in Cascade Tributaries Remnant Populations? , 2004, Conservation Genetics.

[40]  J. Vulule,et al.  Rangewide population genetic structure of the African malaria vector Anopheles funestus , 2005, Molecular ecology.

[41]  P. Arctander,et al.  Regional genetic structuring and evolutionary history of the impala Aepyceros melampus. , 2006, The Journal of heredity.

[42]  R. Elston,et al.  A powerful method of combining measures of association and Hardy–Weinberg disequilibrium for fine‐mapping in case‐control studies , 2006, Statistics in medicine.

[43]  P. Smouse,et al.  genalex 6: genetic analysis in Excel. Population genetic software for teaching and research , 2006 .

[44]  J. Pella,et al.  The Gibbs and splitmerge sampler for population mixture analysis from genetic data with incomplete baselines , 2006 .

[45]  M. McMullen,et al.  A unified mixed-model method for association mapping that accounts for multiple levels of relatedness , 2006, Nature Genetics.

[46]  M. Newton,et al.  Estimating the Integrated Likelihood via Posterior Simulation Using the Harmonic Mean Identity , 2006 .

[47]  G. Schönian,et al.  Multilocus Microsatellite Typing as a New Tool for Discrimination of Leishmania infantum MON-1 Strains , 2006, Journal of Clinical Microbiology.