Fast model-based estimation of ancestry in unrelated individuals.

Population stratification has long been recognized as a confounding factor in genetic association studies. Estimated ancestries, derived from multi-locus genotype data, can be used to perform a statistical correction for population stratification. One popular technique for estimation of ancestry is the model-based approach embodied by the widely applied program structure. Another approach, implemented in the program EIGENSTRAT, relies on Principal Component Analysis rather than model-based estimation and does not directly deliver admixture fractions. EIGENSTRAT has gained in popularity in part owing to its remarkable speed in comparison to structure. We present a new algorithm and a program, ADMIXTURE, for model-based estimation of ancestry in unrelated individuals. ADMIXTURE adopts the likelihood model embedded in structure. However, ADMIXTURE runs considerably faster, solving problems in minutes that take structure hours. In many of our experiments, we have found that ADMIXTURE is almost as fast as EIGENSTRAT. The runtime improvements of ADMIXTURE rely on a fast block relaxation scheme using sequential quadratic programming for block updates, coupled with a novel quasi-Newton acceleration of convergence. Our algorithm also runs faster and with greater accuracy than the implementation of an Expectation-Maximization (EM) algorithm incorporated in the program FRAPPE. Our simulations show that ADMIXTURE's maximum likelihood estimates of the underlying admixture coefficients and ancestral allele frequencies are as accurate as structure's Bayesian estimates. On real-world data sets, ADMIXTURE's estimates are directly comparable to those from structure and EIGENSTRAT. Taken together, our results show that ADMIXTURE's computational speed opens up the possibility of using a much larger set of markers in model-based ancestry estimation and that its estimates are suitable for use in correcting for population stratification in association studies.

[1]  C C Li,et al.  Population subdivision with respect to multiple alleles , 1969, Annals of human genetics.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  R. Williams,et al.  Gm3;5,13,14 and type 2 diabetes mellitus: an association in American Indians with genetic admixture. , 1988, American journal of human genetics.

[4]  H. Künsch The Jackknife and the Bootstrap for General Stationary Observations , 1989 .

[5]  J. Mattick Genome research , 1990, Nature.

[6]  M. Chavance [Jackknife and bootstrap]. , 1992, Revue d'epidemiologie et de sante publique.

[7]  R. Jennrich,et al.  Conjugate Gradient Acceleration of the EM Algorithm , 1993 .

[8]  Hans-Hermann Bock,et al.  Information Systems and Data Analysis , 1994 .

[9]  Jan de Leeuw,et al.  Block-relaxation Algorithms in Statistics , 1994 .

[10]  J. Shao,et al.  The jackknife and bootstrap , 1996 .

[11]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[12]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[13]  P. Donnelly,et al.  Case-control studies of association in structured or admixed populations. , 2001, Theoretical population biology.

[14]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[15]  M. Daly,et al.  Methods for high-density admixture mapping of disease genes. , 2004, American journal of human genetics.

[16]  P. Donnelly,et al.  The effects of human population structure on large genetic association studies , 2004, Nature Genetics.

[17]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[18]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[19]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[20]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[21]  N. Risch,et al.  Reconstructing genetic ancestry blocks in admixed individuals. , 2006, American journal of human genetics.

[22]  Stephen B. Johnson,et al.  The New York cancer project: Rationale, organization, design, and baseline characteristics , 2004, Journal of Urban Health.

[23]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[24]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[25]  E. Halperin,et al.  Estimating Local Ancestry in Admixed Populations , 2022 .

[26]  Zachary A. Szpiech,et al.  Genotype, haplotype and copy-number variation in worldwide human populations , 2008, Nature.

[27]  R. Varadhan,et al.  Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm , 2008 .

[28]  David Reich,et al.  Discerning the Ancestry of European Americans in Genetic Association Studies , 2007, PLoS genetics.

[29]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[30]  Michael I. Jordan,et al.  On the Inference of Ancestries in Admixed Populations , 2008, RECOMB.

[31]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[32]  M. Stephens,et al.  Interpreting principal component analyses of spatial population genetic variation , 2008, Nature Genetics.