Large-scale inference of population structure in presence of missingness using PCA

MOTIVATION Principal component analysis (PCA) is a commonly used tool in genetics to capture and visualize population structure. Due to technological advances in sequencing, such as the widely used non-invasive prenatal test, massive datasets of ultra-low coverage sequencing are being generated. These datasets are characterized by having a large amount of missing genotype information. RESULTS We present EMU, a method for inferring population structure in the presence of rampant non-random missingness. We show through simulations that several commonly used PCA methods can not handle missing data arisen from various sources, which leads to biased results as individuals are projected into the PC space based on their amount of missingness. In terms of accuracy, EMU outperforms an existing method that also accommodates missingness while being competitively fast. We further tested EMU on around 100K individuals of the Phase 1 dataset of the Chinese Millionome Project, that were shallowly sequenced to around 0.08x. From this data we are able to capture the population structure of the Han Chinese and to reproduce previous analysis in a matter of CPU hours instead of CPU years. EMU's capability to accurately infer population structure in the presence of missingness will be of increasing importance with the rising number of large-scale genetic datasets. AVAILABILITY EMU is written in Python and is freely available at https://github.com/rosemeis/emu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  A. Albrechtsen,et al.  Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data , 2018, Genetics.

[2]  B. Berger,et al.  Ancient human genomes suggest three ancestral populations for present-day Europeans , 2013, Nature.

[3]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[4]  Pieter B. T. Neerincx,et al.  Supplementary Information Whole-genome sequence variation , population structure and demographic history of the Dutch population , 2022 .

[5]  Gad Abraham,et al.  FlashPCA2: principal component analysis of biobank-scale genotype datasets , 2016, bioRxiv.

[6]  Stefan Behnel,et al.  Cython: The Best of Both Worlds , 2011, Computing in Science & Engineering.

[7]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Wei Hao,et al.  Probabilistic models of genetic variation in structured populations applied to global human studies , 2013, Bioinform..

[10]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[11]  M. Stephens,et al.  Analysis of Population Structure: A Unifying Framework and Novel Methods Based on Sparse Factor Analysis , 2010, PLoS genetics.

[12]  H. Kiers Weighted least squares fitting using ordinary least squares algorithms , 1997 .

[13]  Carson C Chow,et al.  Second-generation PLINK: rising to the challenge of larger and richer datasets , 2014, GigaScience.

[14]  Anders Albrechtsen,et al.  Testing for Hardy–Weinberg equilibrium in structured populations using genotype or low‐depth next generation sequencing data , 2019, Molecular ecology resources.

[15]  D. Reich,et al.  Population Structure and Eigenanalysis , 2006, PLoS genetics.

[16]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[17]  J. Shendure,et al.  Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History , 2018, Cell.

[18]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[19]  David Levine,et al.  A high-performance computing toolset for relatedness and principal component analysis of SNP data , 2012, Bioinform..

[20]  Tom R. Gaunt,et al.  The UK10K project identifies rare variants in health and disease , 2016 .

[21]  P. Donnelly,et al.  The effects of human population structure on large genetic association studies , 2004, Nature Genetics.

[22]  Bjarni V. Halldórsson,et al.  Large-scale whole-genome sequencing of the Icelandic population , 2015, Nature Genetics.

[23]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[24]  Jonathan Scott Friedlaender,et al.  A Human Genome Diversity Cell Line Panel , 2002, Science.

[25]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[26]  Bruce S Weir,et al.  Model-free Estimation of Recent Genetic Relatedness. , 2016, American journal of human genetics.

[27]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[28]  M. Fumagalli,et al.  Assessing the Effect of Sequencing Depth and Sample Size in Population Genetics Inferences , 2013, PloS one.

[29]  Julie Josse,et al.  Handling missing values in exploratory multivariate data analysis methods , 2012 .

[30]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[31]  R. Varadhan,et al.  Simple and Globally Convergent Methods for Accelerating the Convergence of Any EM Algorithm , 2008 .