Accounting for bias from sequencing error in population genetic estimates.

Sequencing error presents a significant challenge to population genetic analyses using low-coverage sequence in general and single-pass reads in particular. Bias in parameter estimates becomes severe when the level of polymorphism (signal) is low relative to the amount of error (noise). Choosing an arbitrary quality score cutoff yields biased estimates, particularly with newer, non-Sanger sequencing technologies that have different quality score distributions. We propose a rule of thumb to judge when a given threshold will lead to significant bias and suggest alternative approaches that reduce bias.

[1]  M. Daly,et al.  Segmental phylogenetic relationships of inbred mouse strains revealed by fine-scale analysis of sequence variation across 4.6 mb of mouse genome. , 2004, Genome research.

[2]  Garth R. Brown,et al.  Nucleotide diversity and linkage disequilibrium in loblolly pine. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  J. Doob Stochastic processes , 1953 .

[4]  Philip L. F. Johnson,et al.  Patterns of damage in genomic DNA sequences from a Neandertal , 2007, Proceedings of the National Academy of Sciences.

[5]  Nicole A. Lazar,et al.  Statistical Analysis With Missing Data , 2003, Technometrics.

[6]  Dmitrij Frishman,et al.  Deciphering the evolution and metabolism of an anammox bacterium from a community genome , 2006, Nature.

[7]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[8]  F. Tajima Evolutionary relationship of DNA sequences in finite populations. , 1983, Genetics.

[9]  S. Pääbo Ancient DNA: extraction, characterization, molecular cloning, and enzymatic amplification. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[10]  W Miller,et al.  Analysis of the quality and utility of random shotgun sequencing at low redundancies. , 1998, Genome research.

[11]  D. States Molecular sequence accuracy: analysing imperfect data. , 1992, Trends in genetics : TIG.

[12]  Alexander F. Auch,et al.  Metagenomics to Paleogenomics: Large-Scale Sequencing of Mammoth DNA , 2006, Science.

[13]  Hans-Jürgen Bandelt,et al.  Phantom mutation hotspots in human mitochondrial DNA , 2005, Electrophoresis.

[14]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[15]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[16]  E S Lander,et al.  Systematic detection of errors in genetic linkage data. , 1992, Genomics.

[17]  P. Green,et al.  A "quality-first" credo for the Human Genome Project. , 1998, Genome research.

[18]  W. Li,et al.  Statistical tests of neutrality of mutations. , 1993, Genetics.

[19]  Dana C Crawford,et al.  The patterns of natural variation in human genes. , 2005, Annual review of genomics and human genetics.

[20]  F. Tajima Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. , 1989, Genetics.

[21]  B. Weir,et al.  Correlations, descent measures: drift with migration and mutation. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[22]  K. Zamudio,et al.  Unexpectedly low genetic divergences among populations of the threatened bog turtle (Glyptemys muhlenbergii) , 2007, Conservation Genetics.

[23]  S WRIGHT,et al.  Genetical Structure of Populations , 1950, British medical journal.

[24]  Jillian F. Banfield,et al.  Genome dynamics in a natural archaeal population , 2007, Proceedings of the National Academy of Sciences.

[25]  Philip L. F. Johnson,et al.  Inference of population genetic parameters in metagenomics: a clean look at messy data. , 2006, Genome research.

[26]  G. A. Watterson On the number of segregating sites in genetical models without recombination. , 1975, Theoretical population biology.

[27]  B. Weir,et al.  ESTIMATING F‐STATISTICS FOR THE ANALYSIS OF POPULATION STRUCTURE , 1984, Evolution; international journal of organic evolution.

[28]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[29]  C. Primmer,et al.  Distribution of genetic variation in the growth hormone 1 gene in Atlantic salmon (Salmo salar) populations from Europe and North America , 2004, Molecular ecology.

[30]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[31]  A. Clark,et al.  Sequencing errors and molecular evolutionary analysis. , 1992, Molecular biology and evolution.