StructHDP: automatic inference of number of clusters and population structure from admixed genotype data

Motivation: Clustering of genotype data is an important way of understanding similarities and differences between populations. A summary of populations through clustering allows us to make inferences about the evolutionary history of the populations. Many methods have been proposed to perform clustering on multilocus genotype data. However, most of these methods do not directly address the question of how many clusters the data should be divided into and leave that choice to the user. Methods: We present StructHDP, which is a method for automatically inferring the number of clusters from genotype data in the presence of admixture. Our method is an extension of two existing methods, Structure and Structurama. Using a Hierarchical Dirichlet Process (HDP), we model the presence of admixture of an unknown number of ancestral populations in a given sample of genotype data. We use a Gibbs sampler to perform inference on the resulting model and infer the ancestry proportions and the number of clusters that best explain the data. Results: To demonstrate our method, we simulated data from an island model using the neutral coalescent. Comparing the results of StructHDP with Structurama shows the utility of combining HDPs with the Structure model. We used StructHDP to analyze a dataset of 155 Taita thrush, Turdus helleri, which has been previously analyzed using Structure and Structurama. StructHDP correctly picks the optimal number of populations to cluster the data. The clustering based on the inferred ancestry proportions also agrees with that inferred using Structure for the optimal number of populations. We also analyzed data from 1048 individuals from the Human Genome Diversity project from 53 world populations. We found that the clusters obtained correspond with major geographical divisions of the world, which is in agreement with previous analyses of the dataset. Availability: StructHDP is written in C++. The code will be available for download at http://www.sailing.cs.cmu.edu/structhdp. Contact: suyash@cs.cmu.edu; epxing@cs.cmu.edu

[1]  J. Pella,et al.  The Gibbs and splitmerge sampler for population mixture analysis from genetic data with incomplete baselines , 2006 .

[2]  J. Huelsenbeck,et al.  Inference of Population Structure Under a Dirichlet Process Model , 2007, Genetics.

[3]  H. Akaike A new look at the statistical model identification , 1974 .

[4]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[5]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[6]  D. White,et al.  Constructive combinatorics , 1986 .

[7]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[8]  Sohini Ramachandran,et al.  Support from the relationship of genetic and geographic distance in human populations for a serial founder effect originating in Africa. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[9]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[10]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[11]  R. Cann The history and geography of human genes , 1995, The Journal of Asian Studies.

[12]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[13]  D. F. Roberts,et al.  The History and Geography of Human Genes , 1996 .

[14]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[15]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[16]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[17]  L. Lens,et al.  Genetic variability and gene flow in the globally, critically-endangered Taita thrush , 2000, Conservation Genetics.

[18]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[19]  E. Xing,et al.  mStruct: Inference of Population Structure in Light of Both Genetic Admixing and Allele Mutations , 2009, Genetics.

[20]  Yee Whye Teh,et al.  Collapsed Variational Inference for HDP , 2007, NIPS.

[21]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[22]  G. Schwarz Estimating the Dimension of a Model , 1978 .