Characterizing bias in population genetic inferences from low-coverage sequencing data.

The site frequency spectrum (SFS) is of primary interest in population genetic studies, because the SFS compresses variation data into a simple summary from which many population genetic inferences can proceed. However, inferring the SFS from sequencing data is challenging because genotype calls from sequencing data are often inaccurate due to high error rates and if not accounted for, this genotype uncertainty can lead to serious bias in downstream analysis based on the inferred SFS. Here, we compare two approaches to estimate the SFS from sequencing data: one approach infers individual genotypes from aligned sequencing reads and then estimates the SFS based on the inferred genotypes (call-based approach) and the other approach directly estimates the SFS from aligned sequencing reads by maximum likelihood (direct estimation approach). We find that the SFS estimated by the direct estimation approach is unbiased even at low coverage, whereas the SFS by the call-based approach becomes biased as coverage decreases. The direction of the bias in the call-based approach depends on the pipeline to infer genotypes. Estimating genotypes by pooling individuals in a sample (multisample calling) results in underestimation of the number of rare variants, whereas estimating genotypes in each individual and merging them later (single-sample calling) leads to overestimation of rare variants. We characterize the impact of these biases on downstream analyses, such as demographic parameter estimation and genome-wide selection scans. Our work highlights that depending on the pipeline used to infer the SFS, one can reach different conclusions in population genetic inference with the same data set. Thus, careful attention to the analysis pipeline and SFS estimation procedures is vital for population genetic inferences.

[1]  G. A. Watterson On the number of segregating sites in genetical models without recombination. , 1975, Theoretical population biology.

[2]  F. Tajima Evolutionary relationship of DNA sequences in finite populations. , 1983, Genetics.

[3]  F. Tajima Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. , 1989, Genetics.

[4]  W. Li,et al.  Statistical tests of neutrality of mutations. , 1993, Genetics.

[5]  G. Churchill,et al.  Properties of statistical tests of neutrality for DNA polymorphism data. , 1995, Genetics.

[6]  Y. Fu,et al.  Statistical tests of neutrality of mutations against population growth, hitchhiking and background selection. , 1997, Genetics.

[7]  Justin C. Fay,et al.  Hitchhiking under positive Darwinian selection. , 2000, Genetics.

[8]  Colin N. Dewey,et al.  Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans , 2007, PLoS biology.

[9]  Peter Andolfatto,et al.  Hitchhiking effects of recurrent beneficial amino acid substitutions in the Drosophila melanogaster genome. , 2007, Genome research.

[10]  Philip L. F. Johnson,et al.  Accounting for bias from sequencing error in population genetic estimates. , 2007, Molecular biology and evolution.

[11]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[12]  G. Achaz Testing for Neutrality in Samples With Sequencing Errors , 2008, Genetics.

[13]  Michael Lynch,et al.  Estimation of nucleotide diversity, disequilibrium coefficients, and mutation rates from high-coverage genome-sequencing projects. , 2008, Molecular biology and evolution.

[14]  G. Abecasis,et al.  Genotype imputation. , 2009, Annual review of genomics and human genetics.

[15]  Zhaoxia Yu,et al.  Simultaneous genotype calling and haplotype phasing improves genotype accuracy and reduces false-positive associations for genome-wide association studies. , 2009, American journal of human genetics.

[16]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[17]  Michael Lynch,et al.  Estimation of Allele Frequencies From High-Coverage Genome-Sequencing Projects , 2009, Genetics.

[18]  Taylor J. Maxwell,et al.  Inferring population mutation rate and sequencing error rate using the SNP frequency spectrum in a sample of DNA sequences. , 2009, Molecular biology and evolution.

[19]  Ryan D. Hernandez,et al.  Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data , 2009, PLoS genetics.

[20]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[21]  G. Achaz Frequency Spectrum Neutrality Tests: One for All and All for One , 2009, Genetics.

[22]  M. DePristo,et al.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. , 2010, Genome research.

[23]  Nicholas Stiffler,et al.  Population Genomics of Parallel Adaptation in Threespine Stickleback using Sequenced RAD Tags , 2010, PLoS genetics.

[24]  R. Nielsen,et al.  Population genetic inference from genomic sequence variation. , 2010, Genome research.

[25]  M. Beaumont Approximate Bayesian Computation in Evolution and Ecology , 2010 .

[26]  Taylor J. Maxwell,et al.  Estimating population genetic parameters and comparing model goodness-of-fit using DNA sequences with error. , 2010, Genome research.

[27]  Gregory Ewing,et al.  MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus , 2010, Bioinform..

[28]  P. Marjoram,et al.  Inference of Population Mutation Rate and Detection of Segregating Sites from Next-Generation Sequence Data , 2011, Genetics.

[29]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[30]  Heng Li,et al.  A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data , 2011, Bioinform..

[31]  R. Plevin,et al.  Approximate Bayesian Computation in Evolution and Ecology , 2011 .

[32]  D. Halligan,et al.  Inference of Site Frequency Spectra From High-Throughput Sequence Data: Quantification of Selection on Nonsynonymous and Synonymous Sites in Humans , 2011, Genetics.

[33]  Yingrui Li,et al.  Estimation of allele frequency and association mapping using next-generation sequencing data , 2011, BMC Bioinformatics.

[34]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[35]  Leonid Kruglyak,et al.  Chromosome-scale selective sweeps shape Caenorhabditis elegans genomic diversity , 2011, Nature Genetics.

[36]  Claudio J. Verzilli,et al.  An Abundance of Rare Functional Variants in 202 Drug Target Genes Sequenced in 14,002 People , 2012, Science.

[37]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[38]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[39]  Kevin R. Thornton,et al.  The Drosophila melanogaster Genetic Reference Panel , 2012, Nature.

[40]  K. Lindblad-Toh,et al.  The genomic signature of dog domestication reveals adaptation to a starch-rich diet , 2013, Nature.

[41]  Angel Amores,et al.  Stacks: an analysis tool set for population genomics , 2013, Molecular ecology.