Characterization of a Bayesian genetic clustering algorithm based on a Dirichlet process prior and comparison among Bayesian clustering methods

BackgroundA Bayesian approach based on a Dirichlet process (DP) prior is useful for inferring genetic population structures because it can infer the number of populations and the assignment of individuals simultaneously. However, the properties of the DP prior method are not well understood, and therefore, the use of this method is relatively uncommon. We characterized the DP prior method to increase its practical use.ResultsFirst, we evaluated the usefulness of the sequentially-allocated merge-split (SAMS) sampler, which is a technique for improving the mixing of Markov chain Monte Carlo algorithms. Although this sampler has been implemented in a preceding program, HWLER, its effectiveness has not been investigated. We showed that this sampler was effective for population structure analysis. Implementation of this sampler was useful with regard to the accuracy of inference and computational time. Second, we examined the effect of a hyperparameter for the prior distribution of allele frequencies and showed that the specification of this parameter was important and could be resolved by considering the parameter as a variable. Third, we compared the DP prior method with other Bayesian clustering methods and showed that the DP prior method was suitable for data sets with unbalanced sample sizes among populations. In contrast, although current popular algorithms for population structure analysis, such as those implemented in STRUCTURE, were suitable for data sets with uniform sample sizes, inferences with these algorithms for unbalanced sample sizes tended to be less accurate than those with the DP prior method.ConclusionsThe clustering method based on the DP prior was found to be useful because it can infer the number of populations and simultaneously assign individuals into populations, and it is suitable for data sets with unbalanced sample sizes among populations. Here we presented a novel program, DPART, that implements the SAMS sampler and can consider the hyperparameter for the prior distribution of allele frequencies to be a variable.

[1]  J. Pella,et al.  The Gibbs and splitmerge sampler for population mixture analysis from genetic data with incomplete baselines , 2006 .

[2]  M. Hellberg,et al.  Nuclear sequences reveal mid‐range isolation of an imperilled deep‐water coral population , 2009, Molecular ecology.

[3]  Arnaud Estoup,et al.  A Spatial Statistical Model for Landscape Genetics , 2005, Genetics.

[4]  C. Geyer Markov Chain Monte Carlo Maximum Likelihood , 1991 .

[5]  John P Huelsenbeck,et al.  A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[7]  Silvia T Rodríguez-Ramilo,et al.  Assessing population genetic structure via the maximisation of genetic distance , 2009, Genetics Selection Evolution.

[8]  Jukka Corander,et al.  Bayesian spatial modeling of genetic population structure , 2008, Comput. Stat..

[9]  Martin Lascoux,et al.  Cryptic population genetic structure: the number of inferred clusters depends on sample size , 2010, Molecular ecology resources.

[10]  B. Rannala,et al.  Detecting immigration by using multilocus genotypes. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[11]  S. MacEachern Estimating normal means with a conjugate style dirichlet process prior , 1994 .

[12]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[13]  Nianjun Liu,et al.  PSMIX: an R package for population structure inference via maximum likelihood method , 2006, BMC Bioinformatics.

[14]  Sophie Ancelet,et al.  Bayesian Clustering Using Hidden Markov Random Fields in Spatial Population Genetics , 2006, Genetics.

[15]  Xiaoyi Gao,et al.  Human population structure detection via multilocus genotype clustering , 2007, BMC Genetics.

[16]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[17]  Yu Zhang Tree-guided Bayesian inference of population structures , 2008, Bioinform..

[18]  Carlos D Bustamante,et al.  A Markov Chain Monte Carlo Approach for Joint Inference of Population Structure and Inbreeding Rates From Multilocus Genotype Data , 2007, Genetics.

[19]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[20]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[21]  C. Richards,et al.  Accurate Inference of Subtle Population Structure (and Other Genetic Discontinuities) Using Principal Coordinates , 2009, PloS one.

[22]  M. Sillanpää,et al.  Bayesian analysis of genetic differentiation between populations. , 2003, Genetics.

[23]  F. Bonhomme,et al.  GENETIX 4.05, logiciel sous Windows TM pour la génétique des populations. , 1996 .

[24]  D. B. Dahl An improved merge-split sampler for conjugate dirichlet process mixture models , 2003 .

[25]  Rongying Tang,et al.  A distinct population of Saccharomyces cerevisiae in New Zealand: evidence for local dispersal by insects and human-aided global dispersal in oak barrels. , 2010, Environmental microbiology.

[26]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[27]  A Vignal,et al.  Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. , 2001, Genetics.

[28]  Dan Gusfield,et al.  Partition-distance: A problem and class of perfect graphs arising in clustering , 2002, Inf. Process. Lett..

[29]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[30]  BMC Bioinformatics , 2005 .

[31]  E. Xing,et al.  mStruct: Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations , 2009, Genetics.

[32]  Stefano Mariani,et al.  Divergent origins of sympatric herring population components determined using genetic mixture analysis , 2007 .

[33]  Olivier François,et al.  Bayesian clustering algorithms ascertaining spatial population structure: a new computer program and a comparison study , 2007 .

[34]  Eric P. Xing,et al.  Spectrum: joint bayesian inference of population structure and recombination events , 2007, ISMB/ECCB.

[35]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[36]  Hongyu Zhao,et al.  Practical Population Group Assignment with Selected Informative Markers: Characteristics and Properties of Bayesian Clustering via Structure , 2022 .

[37]  E. González,et al.  Relative role of life-history traits and historical factors in shaping genetic population structure of sardines (Sardina pilchardus) , 2007, BMC Evolutionary Biology.

[38]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[39]  Brian J Reich,et al.  A spatial dirichlet process mixture model for clustering population genetics data. , 2011, Biometrics.

[40]  K J Dawson,et al.  A Bayesian approach to the identification of panmictic populations and the assignment of individuals. , 2001, Genetical research.

[41]  Dmitry A. Konovalov,et al.  Partition-distance via the assignment problem , 2005, Bioinform..

[42]  G. Evanno,et al.  Detecting the number of clusters of individuals using the software structure: a simulation study , 2005, Molecular ecology.

[43]  C. D. Harvell,et al.  Globally panmictic population structure in the opportunistic fungal pathogen Aspergillus sydowii , 2008, Molecular ecology.

[44]  Guha Dharmarajan,et al.  Relative performance of Bayesian clustering software for inferring population substructure and individual assignment at low levels of population differentiation , 2006, Conservation Genetics.

[45]  F. Balloux EASYPOP (version 1.7): a computer program for population genetics simulations. , 2001, The Journal of heredity.

[46]  C. Richards,et al.  Genetic diversity and population structure in Malus sieversii, a wild progenitor species of domesticated apple , 2009, Tree Genetics & Genomes.

[47]  J. Huelsenbeck,et al.  Inference of Population Structure Under a Dirichlet Process Model , 2007, Genetics.

[48]  Radford M. Neal Bayesian Mixture Modeling , 1992 .