Inference of Population Structure using Dense Haplotype Data

The advent of genome-wide dense variation data provides an opportunity to investigate ancestry in unprecedented detail, but presents new statistical challenges. We propose a novel inference framework that aims to efficiently capture information on population structure provided by patterns of haplotype similarity. Each individual in a sample is considered in turn as a recipient, whose chromosomes are reconstructed using chunks of DNA donated by the other individuals. Results of this “chromosome painting” can be summarized as a “coancestry matrix,” which directly reveals key information about ancestral relationships among individuals. If markers are viewed as independent, we show that this matrix almost completely captures the information used by both standard Principal Components Analysis (PCA) and model-based approaches such as STRUCTURE in a unified manner. Furthermore, when markers are in linkage disequilibrium, the matrix combines information across successive markers to increase the ability to discern fine-scale population structure using PCA. In parallel, we have developed an efficient model-based approach to identify discrete populations using this matrix, which offers advantages over PCA in terms of interpretability and over existing clustering algorithms in terms of speed, number of separable populations, and sensitivity to subtle population structure. We analyse Human Genome Diversity Panel data for 938 individuals and 641,000 markers, and we identify 226 populations reflecting differences on continental, regional, local, and family scales. We present multiple lines of evidence that, while many methods capture similar information among strongly differentiated groups, more subtle population structure in human populations is consistently present at a much finer level than currently available geographic labels and is only captured by the haplotype-based approach. The software used for this article, ChromoPainter and fineSTRUCTURE, is available from http://www.paintmychromosomes.com/.

[1]  E. Bacon The Inquiry into the History of the Hazara Mongols of Afghanistan , 1951, Southwestern Journal of Anthropology.

[2]  C. Auerbach,et al.  Genetical Research , 1960, Nature.

[3]  G. A. Watterson On the number of segregating sites in genetical models without recombination. , 1975, Theoretical population biology.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  P. Menozzi,et al.  Synthetic maps of human gene frequencies in Europeans. , 1978, Science.

[6]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[7]  Dani Gamerman,et al.  Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference , 1997 .

[8]  D. Schaid Mathematical and Statistical Methods for Genetic Analysis , 1999 .

[9]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[10]  A Vignal,et al.  Empirical evaluation of genetic clustering methods using multilocus genotypes from 20 chicken breeds. , 2001, Genetics.

[11]  K J Dawson,et al.  A Bayesian approach to the identification of panmictic populations and the assignment of individuals. , 2001, Genetical research.

[12]  Peter Donnelly,et al.  Assessing population differentiation and isolation from single‐nucleotide polymorphism data , 2002 .

[13]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[14]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[15]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[16]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[17]  M. Feldman,et al.  Features of evolution and expansion of modern humans, inferred from genomewide microsatellite markers. , 2003, American journal of human genetics.

[18]  D. B. Dahl An improved merge-split sampler for conjugate dirichlet process mixture models , 2003 .

[19]  M. Sillanpää,et al.  Bayesian analysis of genetic differentiation between populations. , 2003, Genetics.

[20]  T. Niu Algorithms for inferring haplotypes , 2004, Genetic epidemiology.

[21]  Yungang He,et al.  Genetic evidence supports demic diffusion of Han culture , 2004, Nature.

[22]  P. Donnelly,et al.  Comparison of Fine-Scale Recombination Rates in Humans and Chimpanzees , 2005, Science.

[23]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[24]  Arnaud Estoup,et al.  A Spatial Statistical Model for Landscape Genetics , 2005, Genetics.

[25]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[26]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[27]  N. Risch,et al.  Reconstructing genetic ancestry blocks in admixed individuals. , 2006, American journal of human genetics.

[28]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[29]  J. Pella,et al.  The Gibbs and splitmerge sampler for population mixture analysis from genetic data with incomplete baselines , 2006 .

[30]  D. Conrad,et al.  A worldwide survey of haplotype variation and linkage disequilibrium in the human genome , 2006, Nature Genetics.

[31]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[32]  J. Huelsenbeck,et al.  Inference of Population Structure Under a Dirichlet Process Model , 2007, Genetics.

[33]  L. Cavalli-Sforza,et al.  The mystery of Etruscan origins: novel clues from Bos taurus mitochondrial DNA , 2007, Proceedings of the Royal Society B: Biological Sciences.

[34]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[35]  B. Browning,et al.  Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. , 2007, American journal of human genetics.

[36]  Daniel Falush,et al.  Inferring Human Colonization History Using a Copying Model , 2008, PLoS genetics.

[37]  E. Halperin,et al.  Estimating Local Ancestry in Admixed Populations , 2022 .

[38]  Zachary A. Szpiech,et al.  Genotype, haplotype and copy-number variation in worldwide human populations , 2008, Nature.

[39]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[40]  S. E. Ahmed,et al.  Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference , 2008, Technometrics.

[41]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[42]  Ryan D. Hernandez,et al.  A flexible forward simulator for populations subject to selection and demography , 2008, Bioinform..

[43]  M. Stephens,et al.  Interpreting principal component analyses of spatial population genetic variation , 2008, Nature Genetics.

[44]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[45]  Alkes L. Price,et al.  Reconstructing Indian Population History , 2009, Nature.

[46]  Flora Jay,et al.  Spatial inference of admixture proportions and secondary contact zones. , 2009, Molecular biology and evolution.

[47]  D. Reich,et al.  Sensitive Detection of Chromosomal Segments of Distinct Ancestry in Admixed Populations , 2009, PLoS genetics.

[48]  Scott M. Williams,et al.  The Genetic Structure and History of Africans and African Americans , 2009, Science.

[49]  Joseph K. Pickrell,et al.  Signals of recent positive selection in a worldwide sample of human populations. , 2009, Genome research.

[50]  G. McVean A Genealogical Interpretation of Principal Components Analysis , 2009, PLoS genetics.

[51]  P. Donnelly,et al.  The coalescent and its descendants , 2010, 1006.1514.

[52]  P. Donnelly,et al.  The coalescent and its descendants , 2010, 1006.1514.

[53]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[54]  F. Balloux,et al.  Discriminant analysis of principal components: a new method for the analysis of genetically structured populations , 2010, BMC Genetics.

[55]  M. Stephens,et al.  Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis , 2010, PLoS genetics.

[56]  B. Weir,et al.  Population Structure With Localized Haplotype Clusters , 2010, Genetics.

[57]  Stephen R Quake,et al.  Whole-genome molecular haplotyping of single cells , 2011, Nature Biotechnology.

[58]  Andrew C. Adey,et al.  Haplotype-resolved genome sequencing of a Gujarati Indian individual , 2011, Nature Biotechnology.

[59]  Emmanouil Collab A map of human genome variation from population-scale sequencing , 2011, Nature.

[60]  M. Jakobsson,et al.  Combining Markers into Haplotypes Can Improve Population Structure Inference , 2012, Genetics.