GRAF-pop: A Fast Distance-Based Method To Infer Subject Ancestry from Multiple Genotype Datasets Without Principal Components Analysis

Inferring subject ancestry using genetic data is an important step in genetic association studies, required for dealing with population stratification. It has become more challenging to infer subject ancestry quickly and accurately since large amounts of genotype data, collected from millions of subjects by thousands of studies using different methods, are accessible to researchers from repositories such as the database of Genotypes and Phenotypes (dbGaP) at the National Center for Biotechnology Information (NCBI). Study-reported populations submitted to dbGaP are often not harmonized across studies or may be missing. Widely-used methods for ancestry prediction assume that most markers are genotyped in all subjects, but this assumption is unrealistic if one wants to combine studies that used different genotyping platforms. To provide ancestry inference and visualization across studies, we developed a new method, GRAF-pop, of ancestry prediction that is robust to missing genotypes and allows researchers to visualize predicted population structure in color and in three dimensions. When genotypes are dense, GRAF-pop is comparable in quality and running time to existing ancestry inference methods EIGENSTRAT, FastPCA, and FlashPCA2, all of which rely on principal components analysis (PCA). When genotypes are not dense, GRAF-pop gives much better ancestry predictions than the PCA-based methods. GRAF-pop employs basic geometric and probabilistic methods; the visualized ancestry predictions have a natural geometric interpretation, which is lacking in PCA-based methods. Since February 2018, GRAF-pop has been successfully incorporated into the dbGaP quality control process to identify inconsistencies between study-reported and computationally predicted populations and to provide harmonized population values in all new dbGaP submissions amenable to population prediction, based on marker genotypes. Plots, produced by GRAF-pop, of summary population predictions are available on dbGaP study pages, and the software, is available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/Software.cgi.

[1]  K. Kidd,et al.  Improving ancestry distinctions among Southwest Asian populations. , 2018, Forensic science international. Genetics.

[2]  Timothy A Thornton,et al.  Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness , 2015, Genetic epidemiology.

[3]  Chia-Yen Chen,et al.  Improved ancestry inference using weights from external reference panels , 2013, Bioinform..

[4]  G. McVean A Genealogical Interpretation of Principal Components Analysis , 2009, PLoS genetics.

[5]  Arcadi Navarro,et al.  The European Genome-phenome Archive of human data consented for biomedical research , 2015, Nature Genetics.

[6]  Alan R. Templeton,et al.  Inference and Analysis of Population Structure Using Genetic Data and Network Theory , 2015, Genetics.

[7]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[8]  Badri Padhukasahasram,et al.  Inferring ancestry from population genomic data and its applications , 2014, Front. Genet..

[9]  Gad Abraham,et al.  FlashPCA2: principal component analysis of biobank-scale genotype datasets , 2016, bioRxiv.

[10]  Jacob M. Keaton,et al.  Population Stratification in Genetic Association Studies , 2017, Current protocols in human genetics.

[11]  Christopher I. Amos,et al.  Ancestry inference using principal component analysis and spatial analysis: a distance-based analysis to account for population substructure , 2017, BMC Genomics.

[12]  Christopher R. Gignoux,et al.  Development of a Panel of Genome-Wide Ancestry Informative Markers to Study Admixture Throughout the Americas , 2012, PLoS genetics.

[13]  Liang Ma,et al.  AIM-SNPtag: a computationally efficient approach for developing ancestry-informative SNP panels , 2018, bioRxiv.

[14]  Gabriel Silva,et al.  An ancestry informative marker set for determining continental origin: validation and extension using human genome diversity panels , 2009, BMC Genetics.

[15]  O. Lao,et al.  Detecting individual ancestry in the human genome , 2015, Investigative Genetics.

[16]  Gad Abraham,et al.  Fast Principal Component Analysis of Large-Scale Genome-Wide Data , 2014, bioRxiv.

[17]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[18]  E. Wijsman,et al.  Estimating and adjusting for ancestry admixture in statistical methods for relatedness inference, heritability estimation, and association testing , 2014, BMC Proceedings.

[19]  M Daszykowski,et al.  Dealing with missing values and outliers in principal component analysis. , 2007, Talanta.

[20]  N. Risch,et al.  Reconstructing genetic ancestry blocks in admixed individuals. , 2006, American journal of human genetics.

[21]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[22]  P. Menozzi,et al.  Synthetic maps of human gene frequencies in Europeans. , 1978, Science.

[23]  K. Kidd,et al.  Progress toward an efficient panel of SNPs for ancestry inference. , 2014, Forensic science international. Genetics.

[24]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[25]  Christopher R. Gignoux,et al.  A Panel of Ancestry Informative Markers for the Complex Five-Way Admixed South African Coloured Population , 2013, PloS one.

[26]  Alejandro A Schäffer,et al.  Quickly identifying identical and closely related subjects in large databases using genotype data , 2017, PloS one.

[27]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[28]  Kenneth Lange,et al.  Efficient analysis of large datasets and sex bias with ADMIXTURE , 2016, BMC Bioinformatics.

[29]  Mark Shriver,et al.  A panel of ancestry informative markers for estimating individual biogeographical ancestry and admixture from four continents: utility and applications , 2008, Human mutation.

[30]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[31]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[32]  Z. Wang,et al.  Massively parallel sequencing of 165 ancestry informative SNPs in two Chinese Tibetan-Burmese minority ethnicities. , 2018, Forensic science international. Genetics.

[33]  S. Leng,et al.  Softwares and methods for estimating genetic ancestry in human populations , 2013, Human Genomics.

[34]  Vikas Bansal,et al.  Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations , 2015, BMC Bioinformatics.

[35]  Ann B. Lee,et al.  Discovering genetic ancestry using spectral graph theory , 2009, Genetic epidemiology.

[36]  Yafang Li,et al.  FastPop: a rapid principal component derived method to infer intercontinental ancestry using genetic data , 2016, BMC Bioinformatics.

[37]  Brendan W. Vaughan,et al.  The 1000 Genomes Project: data management and community access , 2012, Nature Methods.

[38]  Václav Skala,et al.  Barycentric coordinates computation in homogeneous coordinates , 2008, Comput. Graph..

[39]  David Reich,et al.  The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States , 2015, American journal of human genetics.

[40]  P. Smouse,et al.  genalex 6: genetic analysis in Excel. Population genetic software for teaching and research , 2006 .

[41]  Stephen L. Hauser,et al.  Genome-wide patterns of population structure and admixture in West Africans and African Americans , 2009, Proceedings of the National Academy of Sciences.

[42]  Mirela Ben-Chen,et al.  Complex Barycentric Coordinates with Applications to Planar Shape Deformation , 2009, Comput. Graph. Forum.

[43]  Christopher R. Gignoux,et al.  Reconstructing Native American Migrations from Whole-Genome and Whole-Exome Data , 2013, PLoS genetics.

[44]  M. Stephens,et al.  fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets , 2014, Genetics.

[45]  Kenneth Lange,et al.  Enhancements to the ADMIXTURE algorithm for individual ancestry estimation , 2011, BMC Bioinformatics.

[46]  Daniel John Lawson,et al.  Population identification using genetic data. , 2012, Annual review of genomics and human genetics.

[47]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[48]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[49]  J. Novembre,et al.  Recent advances in the study of fine-scale population structure in humans , 2016, bioRxiv.

[50]  A. Ungar Barycentric calculus in euclidean and hyperbolic geometry: a comparative introduction , 2010 .

[51]  Mathieu Desbrun,et al.  Barycentric coordinates for convex sets , 2007, Adv. Comput. Math..