pixy: Unbiased estimation of nucleotide diversity and divergence in the presence of missing data

Population genetic analyses often use summary statistics to describe patterns of genetic variation and provide insight into evolutionary processes. Among the most fundamental of these summary statistics are π and dXY, which are used to describe genetic diversity within and between populations, respectively. Here, we address a widespread issue in π and dXY calculation: systematic bias generated by missing data of various types. Many popular methods for calculating π and dXY operate on data encoded in the Variant Call Format (VCF), which condenses genetic data by omitting invariant sites. When calculating π and dXY using a VCF, it is often implicitly assumed that missing genotypes (including those at sites not represented in the VCF) are homozygous for the reference allele. Here, we show how this assumption can result in substantial downward bias in estimates of π and dXY that is directly proportional to the amount of missing data. We discuss the pervasive nature and importance of this problem in population genetics, and introduce a user-friendly UNIX command line utility, pixy, that solves this problem via an algorithm that generates unbiased estimates of π and dXY in the face of missing data. We compare pixy to existing methods using both simulated and empirical data, and show that pixy alone produces unbiased estimates of π and dXY regardless of the form or amount of missing data. In sum, our software solves a long-standing problem in applied population genetics and highlights the importance of properly accounting for missing data in population genetic analyses.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  Sara E. Miller,et al.  Gene flow and selection interact to promote adaptive divergence in regions of low recombination. , 2017, Molecular ecology.

[3]  M. Noor,et al.  Islands of speciation or mirages in the desert? Examining the role of restricted recombination in maintaining species , 2009, Heredity.

[4]  Yvonne Feierabend,et al.  Population Genetics A Concise Guide , 2016 .

[5]  D. Hartl,et al.  Principles of population genetics , 1981 .

[6]  Matthew W. Hahn,et al.  Reanalysis suggests that genomic islands of speciation are due to reduced diversity, not reduced gene flow , 2014, Molecular ecology.

[7]  Leo P. Kadanoff,et al.  The Unreasonable Effectiveness of , 2000 .

[8]  Sandra Gesing,et al.  VectorBase: an updated bioinformatics resource for invertebrate vectors and other organisms related with human diseases , 2014, Nucleic Acids Res..

[9]  R. Burri Interpreting differentiation landscapes in the light of long‐term linked selection , 2017 .

[10]  Joanna L. Kelley,et al.  Breaking RAD: an evaluation of the utility of restriction site‐associated DNA sequencing for genome scans of adaptation , 2016, Molecular ecology resources.

[11]  Robert J. Elshire,et al.  A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species , 2011, PloS one.

[12]  C. Grossen,et al.  A comparison of genomic islands of differentiation across three young avian species pairs , 2018, Molecular ecology.

[13]  M. Kronforst,et al.  Do Heliconius butterfly species exchange mimicry alleles? , 2013, Biology Letters.

[14]  J. Puritz,et al.  These aren’t the loci you’e looking for: Principles of effective SNP filtering for molecular ecologists , 2018, Molecular ecology.

[15]  Gonçalo R. Abecasis,et al.  The variant call format and VCFtools , 2011, Bioinform..

[16]  Katharine L Korunes,et al.  Inversions shape the divergence of Drosophila pseudoobscura and D. persimilis on multiple timescales , 2019, bioRxiv.

[17]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[18]  Sònia Casillas,et al.  Molecular Population Genetics , 2017, Genetics.

[19]  J. Wakeley Coalescent Theory: An Introduction , 2008 .

[20]  M. Noor,et al.  Islands of speciation or mirages in the desert? Examining the role of restricted recombination in maintaining species , 2010, Heredity.

[21]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[22]  Jerome Kelleher,et al.  Efficient Coalescent Simulation and Genealogical Analysis for Large Sample Sizes , 2015, bioRxiv.

[23]  M. Nei,et al.  Mathematical model for studying genetic variation in terms of restriction endonucleases. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[24]  M. Lercher,et al.  PopGenome: An Efficient Swiss Army Knife for Population Genomic Analyses in R , 2014, Molecular biology and evolution.

[25]  Anders Albrechtsen,et al.  ANGSD: Analysis of Next Generation Sequencing Data , 2014, BMC Bioinformatics.

[26]  J. Uzunović,et al.  Coevolution between transposable elements and recombination , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[27]  Donglin Zeng,et al.  Robust Score Tests With Missing Data in Genomics Studies , 2019, Journal of the American Statistical Association.

[28]  M. Nei,et al.  Sampling variances of heterozygosity and genetic distance. , 1974, Genetics.

[29]  M. Carmena,et al.  Transposable elements map in a conserved pattern of distribution extending from beta-heterochromatin to centromeres in Drosophila melanogaster , 1995, Chromosoma.

[30]  Daniel R. Schrider,et al.  The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference , 2018, bioRxiv.