Robust Genome-Wide Ancestry Inference for Heterogeneous Datasets and Ancestry Facial Imaging based on the 1000 Genomes Project

Accurate inference of genomic ancestry is critically important in human genetics, epidemiology, and related fields. Geneticists today have access to multiple heterogeneous population-based datasets from studies collected under different protocols. Therefore, joint analyses of these datasets require robust and consistent inference of ancestry, where a common strategy is to yield an ancestry space generated by a reference dataset. However, such a strategy is sensitive to batch artefacts introduced by different protocols. In this work, we propose a novel robust genome-wide ancestry inference method; referred to as SUGIBS, based on an unnormalized genomic (UG) relationship matrix whose spectral (S) decomposition is generalized by an Identity-by-State (IBS) similarity degree matrix. SUGIBS robustly constructs an ancestry space from a single reference dataset, and provides a robust projection of new samples, from different studies. In experiments and simulations, we show that, SUGIBS is robust against individual outliers and batch artifacts introduced by different genotyping protocols. The performance of SUGIBS is equivalent to the widely used principal component analysis (PCA) on normalized genotype data in revealing the underlying structure of an admixed population and in adjusting for false positive findings in a case-control admixed GWAS. We applied SUGIBS on the 1000 Genome project, as a reference, in combination with a large heterogeneous dataset containing auxiliary 3D facial images, to predict population stratified average or ancestry faces. In addition, we projected eight ancient DNA profiles into the 1000 Genome ancestry space and reconstructed their ancestry face. Based on the visually strong and recognizable human facial phenotype, comprehensive facial illustrations of the populations embedded in the 1000 Genome project are provided. Furthermore, ancestry facial imaging has important applications in personalized and precision medicine along with forensic and archeological DNA phenotyping. Author Summary Estimates of individual-level genomic ancestry are routinely used in human genetics, epidemiology, and related fields. The analysis of population structure and genomic ancestry can yield significant insights in terms of modern and ancient population dynamics, allowing us to address questions regarding the timing of the admixture events, and the numbers and identities of the parental source populations. Unrecognized or cryptic population structure is also an important confounder to correct for in genome-wide association studies (GWAS). However, to date, it remains challenging to work with heterogeneous datasets from multiple studies collected by different laboratories with diverse genotyping and imputation protocols. This work presents a new approach and an accompanying open-source software toolbox that facilitates a robust integrative analysis for population structure and genomic ancestry estimates for heterogeneous datasets. Given that visually evident and easily recognizable patterns of human facial characteristics covary with genomic ancestry, we can generate predicted ancestry faces on both the population and individual levels as we illustrate for the 26 1000 Genome populations and for eight eminent ancient-DNA profiles, respectively.

[1]  Paul Suetens,et al.  Genome-wide mapping of global-to-local genetic effects on human facial shape , 2018, Nature Genetics.

[2]  Mattias Jakobsson,et al.  Population genomics of Mesolithic Scandinavia: Investigating early postglacial migration routes and high-latitude adaptation , 2018, PLoS biology.

[3]  Luísa Pereira,et al.  IPCAPS: an R package for iterative pruning to capture population structure , 2017, bioRxiv.

[4]  R. Mägi,et al.  Improved imputation accuracy of rare and low-frequency variants using population-specific high-coverage WGS-based imputation reference panel , 2017, European Journal of Human Genetics.

[5]  Mattias Jakobsson,et al.  Tracing the peopling of the world through genomics , 2017, Nature.

[6]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[7]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[8]  Rudiger Brauning,et al.  Construction of relatedness matrices using genotyping-by-sequencing data , 2015, BMC Genomics.

[9]  L. Liang,et al.  Improved ancestry estimation for both genotyping and sequencing data using projection procrustes analysis and genotype imputation. , 2015, American journal of human genetics.

[10]  Timothy A Thornton,et al.  Robust Inference of Population Structure for Ancestry Prediction and Correction of Stratification in the Presence of Relatedness , 2015, Genetic epidemiology.

[11]  Peter Claes,et al.  Toward DNA-based facial composites: preliminary results and validation. , 2014, Forensic science international. Genetics.

[12]  Paul Suetens,et al.  Non-rigid surface registration algorithms: Technical details and comparison , 2014 .

[13]  Mattias Jakobsson,et al.  Genomic Diversity and Admixture Differs for Stone-Age Scandinavian Foragers and Farmers , 2014, Science.

[14]  D. Vandermeulen,et al.  The normal-equivalent: a patient-specific assessment of facial harmony. , 2013, International journal of oral and maxillofacial surgery.

[15]  Hong Liu,et al.  Robust methods for population stratification in genome wide association studies , 2013, BMC Bioinformatics.

[16]  Daniel John Lawson,et al.  Population identification using genetic data. , 2012, Annual review of genomics and human genetics.

[17]  P. Claes,et al.  Improved facial outcome assessment using a 3D anthropometric mask. , 2012, International journal of oral and maxillofacial surgery.

[18]  D. Falush,et al.  Inference of Population Structure using Dense Haplotype Data , 2012, PLoS genetics.

[19]  Xiaofeng Zhu,et al.  Genome-wide comparison of African-ancestry populations from CARe and other cohorts reveals signals of natural selection. , 2011, American journal of human genetics.

[20]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[21]  Paul Suetens,et al.  Bayesian estimation of optimal craniofacial reconstructions. , 2010, Forensic science international.

[22]  D. Vandermeulen,et al.  Computerized craniofacial reconstruction: Conceptual framework and review. , 2010, Forensic science international.

[23]  Alkes L. Price,et al.  New approaches to population stratification in genome-wide association studies , 2010, Nature Reviews Genetics.

[24]  Zachary A. Szpiech,et al.  Genome-wide association studies in diverse populations , 2010, Nature Reviews Genetics.

[25]  R. Mägi,et al.  Correction: Genetic Structure of Europeans: A View from the North–East , 2010, PLoS ONE.

[26]  Kathryn Roeder,et al.  A SPECTRAL GRAPH APPROACH TO DISCOVERING GENETIC ANCESTRY. , 2009, The annals of applied statistics.

[27]  Jun Zhang,et al.  Laplacian Eigenfunctions Learn Population Structure , 2009, PloS one.

[28]  David H. Alexander,et al.  Fast model-based estimation of ancestry in unrelated individuals. , 2009, Genome research.

[29]  R. Mägi,et al.  Genetic Structure of Europeans: A View from the North–East , 2009, PloS one.

[30]  John Novembre,et al.  The Population Reference Sample, POPRES: a resource for population, disease, and pharmacological genetics research. , 2008, American journal of human genetics.

[31]  M. Feldman,et al.  Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation , 2008 .

[32]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[33]  Mark D Shriver,et al.  Measuring European population stratification with microarray genotype data. , 2007, American journal of human genetics.

[34]  P. Filzmoser,et al.  Algorithms for Projection-Pursuit Robust Principal Component Analysis , 2007 .

[35]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[36]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[37]  D. Clayton,et al.  Population structure, differential bias and genomic control in a large-scale, case-control association study , 2005, Nature Genetics.

[38]  Ricardo A. Maronna,et al.  Principal Components and Orthogonal Regression Based on Robust Scales , 2005, Technometrics.

[39]  Christophe Croux,et al.  High breakdown estimators for principal components: the projection-pursuit approach revisited , 2005 .

[40]  †The International HapMap Consortium The International HapMap Project , 2003, Nature.

[41]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[42]  R. Recker,et al.  Population admixture: detection by Hardy-Weinberg test and its quantitative effects on linkage-disequilibrium methods for localizing genes underlying complex traits. , 2001, Genetics.

[43]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[44]  D. F. Roberts,et al.  The History and Geography of Human Genes , 1996 .

[45]  R. Cann The history and geography of human genes , 1995, The Journal of Asian Studies.

[46]  F. Rohlf,et al.  Extensions of the Procrustes Method for the Optimal Superimposition of Landmarks , 1990 .

[47]  L. Cavalli-Sforza Population structure and human evolution , 1966, Proceedings of the Royal Society of London. Series B. Biological Sciences.