Detecting selection in low-coverage high-throughput sequencing data using principal component analysis

Background Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. Materials and methods We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Results Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. Conclusion We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.

[1]  S. Palumbi,et al.  Footprints of local adaptation span hundreds of linked genes in the Atlantic silverside genome , 2020, Evolution letters.

[2]  J. Merilä,et al.  Biases in Demographic Modeling Affect Our Understanding of Recent Divergence , 2020, bioRxiv.

[3]  M. Blum,et al.  Performing highly efficient genome scans for local adaptation with R package pcadapt version 4. , 2020, Molecular biology and evolution.

[4]  N. Holroyd,et al.  The global diversity of Haemonchus contortus is shaped by human intervention and climate , 2019, Nature Communications.

[5]  A. Albrechtsen,et al.  A Genotype Likelihood Framework for GWAS with Low Depth Sequencing Data from Admixed Individuals , 2019, bioRxiv.

[6]  Anders Albrechtsen,et al.  Testing for Hardy–Weinberg equilibrium in structured populations using genotype or low‐depth next generation sequencing data , 2019, Molecular ecology resources.

[7]  S. Dyer,et al.  Tracing the ancestry of modern bread wheats , 2019, Nature Genetics.

[8]  Kevin D. Murray,et al.  Landscape drivers of genomic diversity and divergence in woodland Eucalyptus , 2019, bioRxiv.

[9]  R. Nielsen,et al.  Ohana: detecting selection in multiple populations by modelling ancestral admixture components , 2019, bioRxiv.

[10]  J. Shendure,et al.  Genomic Analyses from Non-invasive Prenatal Testing Reveal Genetic Associations, Patterns of Viral Infections, and Chinese Population History , 2018, Cell.

[11]  A. Albrechtsen,et al.  Inferring Population Structure and Admixture Proportions in Low-Depth NGS Data , 2018, Genetics.

[12]  S. Lien,et al.  Ancient chromosomal rearrangement associated with local adaptation of a postglacially colonized population of Atlantic Cod in the northwest Atlantic , 2018, Molecular ecology.

[13]  V. Pichler,et al.  Population genomics of the Asian tiger mosquito, Aedes albopictus: insights into the recent worldwide invasion , 2017, Ecology and evolution.

[14]  S. Sankararaman,et al.  A Comprehensive Map of Genetic Variation in the World’s Largest Ethnic Group—Han Chinese , 2018, Molecular biology and evolution.

[15]  R. Nielsen,et al.  Asian wild rice is a hybrid swarm with extensive gene flow and feralization from domesticated rice , 2017, Genome research.

[16]  Yancy Lo,et al.  Going global by adapting local: A review of recent human adaptation , 2016, Science.

[17]  M. Blum,et al.  Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis , 2016, bioRxiv.

[18]  Sayan Mukherjee,et al.  Fast Principal-Component Analysis Reveals Convergent Evolution of ADH1B in Europe and East Asia. , 2016, American journal of human genetics.

[19]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[20]  Anders Albrechtsen,et al.  ANGSD: Analysis of Next Generation Sequencing Data , 2014, BMC Bioinformatics.

[21]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[22]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[23]  Asan,et al.  Archaeology Augments Tibet's Genetic History--Response , 2010 .

[24]  Asan,et al.  Sequencing of 50 Human Exomes Reveals Adaptation to High Altitude , 2010, Science.

[25]  Matthew W. Hahn,et al.  “Reverse Ecology” and the Power of Population Genomics , 2008, Evolution; international journal of organic evolution.

[26]  Mark Tygert,et al.  A Randomized Algorithm for Principal Component Analysis , 2008, SIAM J. Matrix Anal. Appl..

[27]  R. Kittles,et al.  Genetic evidence for the convergent evolution of light skin in Europeans and East Asians. , 2006, Molecular biology and evolution.

[28]  J. Pritchard,et al.  A Map of Recent Positive Selection in the Human Genome , 2006, PLoS biology.

[29]  Pardis C Sabeti,et al.  Genetic signatures of strong recent positive selection at the lactase gene. , 2004, American journal of human genetics.

[30]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[31]  Wheat and barley Legacy for Breeding Improvement , 2018 .

[32]  Chao Yang,et al.  ARPACK users' guide - solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods , 1998, Software, environments, tools.

[33]  P. Mahalanobis On the generalized distance in statistics , 1936 .