Genome Scans for Selection and Introgression based on k-nearest Neighbor Techniques

In recent years, genome-scan methods have been extensively used to detect local signatures of selection and introgression. Here, we introduce a series of versatile genome-scan methods that are based on non-parametric k-nearest neighbors (kNN) techniques, while incorporating pairwise Fixation Index (FST) estimates and pairwise nucleotide differences (dxy) as features. Simulations were performed for both positive directional selection and introgression, with varying parameters, such as recombination rates, population background histories, the proportion of introgression, and the time of gene flow. We find that kNN-based methods perform remarkably well while yielding stable results almost over the entire range of k. We provide a GitHub repository (pievos101/kNN-Genome-Scans) containing R source code to demonstrate how to apply the proposed methods to real-world genomic data using the population genomics R-package PopGenome.

[1]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[2]  Gregory Ewing,et al.  MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus , 2010, Bioinform..

[3]  M. Slatkin,et al.  Estimation of levels of gene flow from DNA sequence data. , 1992, Genetics.

[4]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[5]  Luca Pagani,et al.  Evidence for a Common Origin of Blacksmiths and Cultivators in the Ethiopian Ari within the Last 4500 Years: Lessons for Clustering-Based Inference , 2015, PLoS genetics.

[6]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[7]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  M. Beaumont,et al.  Evaluating loci for use in the genetic analysis of population structure , 1996, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[9]  O. Gaggiotti,et al.  A new FST‐based method to uncover local adaptation using environmental variables , 2015 .

[10]  M. Lercher,et al.  PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R , 2014, Molecular biology and evolution.

[11]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[12]  D. Balding,et al.  Identifying adaptive genetic divergence among populations from genome scans , 2004, Molecular ecology.

[13]  N. Patterson,et al.  Estimating and interpreting FST: The impact of rare variants , 2013, Genome research.

[14]  W. Stephan,et al.  A critical assessment of storytelling: gene ontology categories and the importance of validating genomic scans. , 2012, Molecular biology and evolution.

[15]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[16]  S WRIGHT,et al.  Genetical structure of populations. , 1950, Nature.

[17]  Keurcien Luu,et al.  Detecting Genomic Signatures of Natural Selection with Principal Component Analysis: Application to the 1000 Genomes Data , 2015, Molecular biology and evolution.

[18]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[19]  Matthew W Hahn,et al.  The Timing and Direction of Introgression Under the Multispecies Network Coalescent , 2018, Genetics.

[20]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[21]  Matthew W. Hahn,et al.  A three-sample test for introgression , 2019 .

[22]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[23]  David Reich,et al.  Testing for ancient admixture between closely related populations. , 2011, Molecular biology and evolution.

[24]  Arthur Zimek,et al.  ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg" , 2019, ArXiv.

[25]  Simon H. Martin,et al.  Evaluating the Use of ABBA–BABA Statistics to Locate Introgressed Loci , 2014, bioRxiv.

[26]  B. Weir Genetic Data Analysis II. , 1997 .

[27]  Daniel Garrigan,et al.  A New Method to Scan Genomes for Introgression in a Secondary Contact Model , 2014, PloS one.

[28]  Martin J. Lercher,et al.  BlockFeST: Bayesian calculation of region-specific FST to detect local adaptation , 2018, Bioinform..

[29]  Hans-Peter Kriegel,et al.  Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , 2012, Data Mining and Knowledge Discovery.

[30]  Aleksandar Lazarevic,et al.  Outlier Detection with Kernel Density Functions , 2007, MLDM.

[31]  Matthew W. Hahn,et al.  Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow , 2014, Molecular ecology.

[32]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[33]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[34]  L. Excoffier,et al.  Detecting loci under selection in a hierarchically structured population , 2009, Heredity.

[35]  Philip L. F. Johnson,et al.  A Draft Sequence of the Neandertal Genome , 2010, Science.

[36]  Mark Stoneking,et al.  Positive selection in East Asians for an EDAR allele that enhances NF-kappaB activation. , 2008, PloS one.

[37]  M. Blum,et al.  Pcadapt: An R Package to Perform Genome Scans for Selection Based on Principal Component Analysis , 2016, bioRxiv.

[38]  Pasi Fränti,et al.  Outlier detection using k-nearest neighbour graph , 2004, ICPR 2004.

[39]  Bastian Pfeifer,et al.  Estimates of introgression as a function of pairwise distances , 2017, BMC Bioinformatics.

[40]  Nicolas Duforet-Frebourg,et al.  Genome Scans for Detecting Footprints of Local Adaptation Using a Bayesian Factor Model , 2014, Molecular biology and evolution.

[41]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[42]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[43]  B. Weir,et al.  ESTIMATING F‐STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE , 1984, Evolution; international journal of organic evolution.

[44]  Michael Olivier,et al.  Genomic and geographic distribution of private SNPs and pathways in human populations. , 2009, Personalized medicine.

[45]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[46]  Matthew W. Hahn,et al.  Powerful methods for detecting introgressed regions from population genomic data , 2016, Molecular ecology.

[47]  O. Gaggiotti,et al.  A Genome-Scan Method to Identify Selected Loci Appropriate for Both Dominant and Codominant Markers: A Bayesian Perspective , 2008, Genetics.