Ancestry inference using principal component analysis and spatial analysis: a distance-based analysis to account for population substructure

BackgroundAccurate inference of genetic ancestry is of fundamental interest to many biomedical, forensic, and anthropological research areas. Genetic ancestry memberships may relate to genetic disease risks. In a genome association study, failing to account for differences in genetic ancestry between cases and controls may also lead to false-positive results. Although a number of strategies for inferring and taking into account the confounding effects of genetic ancestry are available, applying them to large studies (tens thousands samples) is challenging. The goal of this study is to develop an approach for inferring genetic ancestry of samples with unknown ancestry among closely related populations and to provide accurate estimates of ancestry for application to large-scale studies.MethodsIn this study we developed a novel distance-based approach, Ancestry Inference using Principal component analysis and Spatial analysis (AIPS) that incorporates an Inverse Distance Weighted (IDW) interpolation method from spatial analysis to assign individuals to population memberships.ResultsWe demonstrate the benefits of AIPS in analyzing population substructure, specifically related to the four most commonly used tools EIGENSTRAT, STRUCTURE, fastSTRUCTURE, and ADMIXTURE using genotype data from various intra-European panels and European-Americans. While the aforementioned commonly used tools performed poorly in inferring ancestry from a large number of subpopulations, AIPS accurately distinguished variations between and within subpopulations.ConclusionsOur results show that AIPS can be applied to large-scale data sets to discriminate the modest variability among intra-continental populations as well as for characterizing inter-continental variation. The method we developed will protect against spurious associations when mapping the genetic basis of a disease. Our approach is more accurate and computationally efficient method for inferring genetic ancestry in the large-scale genetic studies.

[1]  R. Mägi,et al.  Genetic Structure of Europeans: A View from the North–East , 2009, PloS one.

[2]  P. Menozzi,et al.  Synthetic maps of human gene frequencies in Europeans. , 1978, Science.

[3]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[4]  A. Salas,et al.  Evaluating the accuracy of AIM panels at quantifying genome ancestry , 2014, BMC Genomics.

[5]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[6]  Kevin M. Bradley,et al.  A Small Number of Candidate Gene SNPs Reveal Continental Ancestry in African Americans , 2013, Annals of human genetics.

[7]  Ann B. Lee,et al.  Discovering genetic ancestry using spectral graph theory , 2009, Genetic epidemiology.

[8]  Weihua Guan,et al.  Genotype‐based matching to correct for population stratification in large‐scale case‐control genetic association studies , 2009, Genetic epidemiology.

[9]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[10]  Christopher I. Amos,et al.  Principal Components Analysis of Population Admixture , 2012, PloS one.

[11]  佐藤 保,et al.  Principal Components , 2021, Encyclopedic Dictionary of Archaeology.

[12]  Annette Lee,et al.  European Population Genetic Substructure: Further Definition of Ancestry Informative Markers for Distinguishing among Diverse European Ethnic Groups , 2009, Molecular medicine.

[13]  Yafang Li,et al.  FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data , 2016, BMC Bioinformatics.

[14]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[15]  David Reich,et al.  Discerning the Ancestry of European Americans in Genetic Association Studies , 2007, PLoS genetics.

[16]  Tesfaye M Baye,et al.  AncestrySNPminer: a bioinformatics tool to retrieve and develop ancestry informative SNP panels. , 2012, Genomics.

[17]  Pablo Villoslada,et al.  Analysis and Application of European Genetic Substructure Using 300 K SNP Information , 2008, PLoS genetics.

[18]  Christopher Phillips,et al.  An overview of STRUCTURE: applications, parameter settings, and supporting software , 2013, Front. Genet..

[19]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[20]  Dennis J. Hazelett,et al.  The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers , 2016, Cancer Epidemiology, Biomarkers & Prevention.

[21]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[22]  Sang Hong Lee,et al.  A Simple and Fast Two-Locus Quality Control Test to Detect False Positives Due to Batch Effects in Genome-Wide Association Studies , 2010, Genetic epidemiology.

[23]  Nathaniel Rothman,et al.  Counterpoint: bias from population stratification is not a major threat to the validity of conclusions from epidemiological studies of common polymorphisms and cancer. , 2002, Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology.

[24]  David B. Allison,et al.  Database mining for selection of SNP markers useful in admixture mapping , 2009, BioData Mining.

[25]  Kenneth Lange,et al.  Enhancements to the ADMIXTURE algorithm for individual ancestry estimation , 2011, BMC Bioinformatics.

[26]  F. Wright,et al.  CONVERGENCE AND PREDICTION OF PRINCIPAL COMPONENT SCORES IN HIGH-DIMENSIONAL SETTINGS. , 2012, Annals of statistics.

[27]  Gabriel Silva,et al.  Ancestry informative marker sets for determining continental origin and admixture proportions in common populations in America , 2009, Human mutation.

[28]  Amit R. Indap,et al.  Genes mirror geography within Europe , 2008, Nature.

[29]  P. Gregersen,et al.  Accounting for ancestry: population substructure and genome-wide association studies. , 2008, Human molecular genetics.

[30]  P. Elliott,et al.  UK Biobank: An Open Access Resource for Identifying the Causes of a Wide Range of Complex Diseases of Middle and Old Age , 2015, PLoS medicine.

[31]  Caitlin P. McHugh,et al.  Genome-wide association study identifies novel loci predisposing to cutaneous melanoma. , 2011, Human molecular genetics.

[32]  M. Stephens,et al.  fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets , 2014, Genetics.