Clustering by genetic ancestry using genome-wide SNP data

BackgroundPopulation stratification can cause spurious associations in a genome-wide association study (GWAS), and occurs when differences in allele frequencies of single nucleotide polymorphisms (SNPs) are due to ancestral differences between cases and controls rather than the trait of interest. Principal components analysis (PCA) is the established approach to detect population substructure using genome-wide data and to adjust the genetic association for stratification by including the top principal components in the analysis. An alternative solution is genetic matching of cases and controls that requires, however, well defined population strata for appropriate selection of cases and controls.ResultsWe developed a novel algorithm to cluster individuals into groups with similar ancestral backgrounds based on the principal components computed by PCA. We demonstrate the effectiveness of our algorithm in real and simulated data, and show that matching cases and controls using the clusters assigned by the algorithm substantially reduces population stratification bias. Through simulation we show that the power of our method is higher than adjustment for PCs in certain situations.ConclusionsIn addition to reducing population stratification bias and improving power, matching creates a clean dataset free of population stratification which can then be used to build prediction models without including variables to adjust for ancestry. The cluster assignments also allow for the estimation of genetic heterogeneity by examining cluster specific effects.

[1]  Weihua Guan,et al.  Genotype‐based matching to correct for population stratification in large‐scale case‐control genetic association studies , 2009, Genetic epidemiology.

[2]  Fredrik Nyberg,et al.  Optimizing the Power of Genome-Wide Association Studies by Using Publicly Available Reference Samples to Expand the Control Group , 2010, Genetic epidemiology.

[3]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[4]  Richard A. Nichols,et al.  A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity , 2008, Genetica.

[5]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[6]  Paola Sebastiani,et al.  Genome-Wide Association Studies (GWAS) , 2019, Definitions.

[7]  S. Tishkoff,et al.  African genetic diversity: implications for human demographic history, modern human origins, and complex disease mapping. , 2008, Annual review of genomics and human genetics.

[8]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[9]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[10]  Michael I. Jordan,et al.  A randomization test for controlling population stratification in whole-genome association studies. , 2007, American journal of human genetics.

[11]  D. Balding,et al.  A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identity and paternity , 2005, Genetica.

[12]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[13]  D. F. Roberts,et al.  The History and Geography of Human Genes , 1996 .

[14]  K. Holsinger,et al.  Genetics in geographically structured populations: defining, estimating and interpreting FST , 2009, Nature Reviews Genetics.

[15]  Christian Gieger,et al.  Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts , 2009, Nature Genetics.

[16]  Kai Wang Testing for genetic association in the presence of population stratification in genome‐wide association studies , 2009, Genetic epidemiology.

[17]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[18]  Kaare Christensen,et al.  The quest for genetic determinants of human longevity: challenges and insights , 2006, Nature Reviews Genetics.

[19]  Pablo Villoslada,et al.  Analysis and Application of European Genetic Substructure Using 300 K SNP Information , 2008, PLoS genetics.

[20]  Ann B. Lee,et al.  Discovering genetic ancestry using spectral graph theory , 2009, Genetic epidemiology.

[21]  Kathryn L Lunetta,et al.  Principal-component-based population structure adjustment in the North American Rheumatoid Arthritis Consortium data: impact of single-nucleotide polymorphism set and analysis method , 2009, BMC proceedings.

[22]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[23]  M. Daly,et al.  Genetic Mapping in Human Disease , 2008, Science.

[24]  I. Jolliffe Principal Component Analysis , 2002 .

[25]  C. Hoggart,et al.  Genome-wide association analysis of metabolic traits in a birth cohort from a founder population , 2008, Nature Genetics.

[26]  Luigi Luca Cavalli-sfroza The History and Geography of Human Genes , 1994 .

[27]  Ann B. Lee,et al.  On the use of general control samples for genome-wide association studies: genetic matching highlights causal variants. , 2008, American journal of human genetics.

[28]  P. Sebastiani,et al.  Disentangling the roles of disability and morbidity in survival to exceptional old age. , 2008, Archives of internal medicine.

[29]  Michael P Epstein,et al.  A simple and improved correction for population stratification in case-control studies. , 2007, American journal of human genetics.

[30]  David Reich,et al.  Discerning the Ancestry of European Americans in Genetic Association Studies , 2007, PLoS genetics.

[31]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[32]  K. Konvička,et al.  Matching strategies for genetic association studies in structured populations. , 2004, American journal of human genetics.

[33]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[34]  P. Donnelly,et al.  The effects of human population structure on large genetic association studies , 2004, Nature Genetics.

[35]  Paolo Menozzi,et al.  The History and Geography of Human Genes. Princeton, NJ (Princeton University Press) 1994. , 1994 .

[36]  Casey S. Greene,et al.  Failure to Replicate a Genetic Association May Provide Important Clues About Genetic Architecture , 2009, PloS one.

[37]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.