Accounting for uncertainty in DNA sequencing data.

Science is defined in part by an honest exposition of the uncertainties that arise in measurements and propagate through calculations and inferences, so that the reliabilities of its conclusions are made apparent. The recent rapid development of high-throughput DNA sequencing technologies has dramatically increased the number of measurements made at the biochemical and molecular level. These data come from many different DNA-sequencing technologies, each with their own platform-specific errors and biases, which vary widely. Several statistical studies have tried to measure error rates for basic determinations, but there are no general schemes to project these uncertainties so as to assess the surety of the conclusions drawn about genetic, epigenetic, and more general biological questions. We review here the state of uncertainty quantification in DNA sequencing applications, describe sources of error, and propose methods that can be used for accounting and propagating these errors and their uncertainties through subsequent calculations.

[1]  James O. Berger,et al.  An overview of robust Bayesian analysis , 1994 .

[2]  Zhi Wei,et al.  An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data , 2013, 1401.2278.

[3]  David Ríos Insua,et al.  Robust Bayesian analysis , 2000 .

[4]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[5]  Kai Wang,et al.  Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress , 2012, Genome Medicine.

[6]  Robert C. Williamson,et al.  Probabilistic arithmetic. I. Numerical methods for calculating convolutions and dependency bounds , 1990, Int. J. Approx. Reason..

[7]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[8]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[9]  Michael C. Schatz,et al.  Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score , 2012, Bioinform..

[10]  J. Shendure,et al.  Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data , 2011, Nature Reviews Genetics.

[11]  Scott Ferson,et al.  Arithmetic with uncertain numbers: rigorous and (often) best possible answers , 2004, Reliab. Eng. Syst. Saf..

[12]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[13]  E. Moreno,et al.  Bayesian robustness for hierarchical ε-contamination models , 1993 .

[14]  V. Kreinovich,et al.  Experimental uncertainty estimation and statistics for data having interval uncertainty. , 2007 .

[15]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[16]  Anders Albrechtsen,et al.  Association Testing for Next‐Generation Sequencing Data Using Score Statistics , 2012, Genetic epidemiology.

[17]  Lior Pachter,et al.  Identification and correction of systematic error in high-throughput sequence data , 2011 .

[18]  Yingrui Li,et al.  Estimation of allele frequency and association mapping using next-generation sequencing data , 2011, BMC Bioinformatics.

[19]  H. Hakonarson,et al.  Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing , 2013, Genome Medicine.

[20]  R. Dickey Identification and correction of copper deficiency , 1968 .

[21]  Yingrui Li,et al.  SOAPindel: Efficient identification of indels from short paired reads , 2013, Genome research.

[22]  P. Walley Statistical Reasoning with Imprecise Probabilities , 1990 .

[23]  W. Pearson,et al.  Current Protocols in Bioinformatics , 2002 .

[24]  H. E. McKean,et al.  Tables of the Incomplete Beta Function , 1968 .

[25]  S. Ferson,et al.  Computing with Confidence: Imprecise Posteriors and Predictive Distributions , 2014 .

[26]  R. Nielsen,et al.  Quantifying Population Genetic Differentiation from Next-Generation Sequencing Data , 2013, Genetics.

[27]  Michael Scott Balch,et al.  Mathematical foundations for a theory of confidence structures , 2012, Int. J. Approx. Reason..

[28]  J. Dekker,et al.  The long-range interaction landscape of gene promoters , 2012, Nature.

[29]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[30]  M. Schatz,et al.  Accurate detection of de novo and transmitted indels within exome-capture data using micro-assembly , 2014, Nature Methods.

[31]  Maitreya J. Dunham,et al.  Species-Level Deconvolution of Metagenome Assemblies with Hi-C–Based Contact Probability Maps , 2014, G3: Genes, Genomes, Genetics.

[32]  Luis R. Pericchi,et al.  Posterior robustness with more than one sampling model , 1994 .

[33]  Shoshana Marcus,et al.  Error correction and assembly complexity of single molecule sequencing reads , 2014, bioRxiv.

[34]  M. J. Frank,et al.  Best-possible bounds for the distribution of a sum — a problem of Kolmogorov , 1987 .

[35]  H. Hakonarson,et al.  SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data , 2011, Nucleic acids research.

[36]  G. Hon,et al.  Next-generation genomics: an integrative approach , 2010, Nature Reviews Genetics.

[37]  P. Richterich,et al.  Estimation of errors in "raw" DNA sequences: a validation study. , 1998, Genome research.

[38]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[39]  Jun Wang,et al.  SNP Calling, Genotype Calling, and Sample Allele Frequency Estimation from New-Generation Sequencing Data , 2012, PloS one.

[40]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[41]  K. Robasky,et al.  The role of replicates for error mitigation in next-generation sequencing , 2013, Nature Reviews Genetics.

[42]  M. Schatz,et al.  Reducing INDEL calling errors in whole genome and exome sequencing data , 2014, Genome Medicine.

[43]  Ramon E. Moore Methods and applications of interval analysis , 1979, SIAM studies in applied mathematics.

[44]  M. Yandell,et al.  A beginner's guide to eukaryotic genome annotation , 2012, Nature Reviews Genetics.

[45]  S. Ferson,et al.  Quality assurance for Monte Carlo risk assessment , 1995, Proceedings of 3rd International Symposium on Uncertainty Modeling and Analysis and Annual Conference of the North American Fuzzy Information Processing Society.

[46]  R. Nielsen,et al.  Estimating inbreeding coefficients from NGS data: Impact on genotype calling and allele frequency estimation , 2013, Genome research.

[47]  Gholson J. Lyon,et al.  SCN8A mutation in a child presenting with seizures and developmental delays , 2016, Cold Spring Harbor molecular case studies.

[48]  Kai Wang,et al.  KBG syndrome involving a single-nucleotide duplication in ANKRD11 , 2016, Cold Spring Harbor molecular case studies.

[49]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[50]  Kari Sentz,et al.  Computing with confidence , 2021, Int. J. Approx. Reason..

[51]  H. Hakonarson,et al.  Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. , 2011, American journal of human genetics.

[52]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[53]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[54]  Davis J. McCarthy,et al.  Count-based differential expression analysis of RNA sequencing data using R and Bioconductor , 2013, Nature Protocols.

[55]  Scott Ferson,et al.  Constructing Probability Boxes and Dempster-Shafer Structures , 2003 .

[56]  Jenna M. Lang,et al.  Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products , 2014, PeerJ.

[57]  T. Korneliussen,et al.  Estimating Individual Admixture Proportions from Next Generation Sequencing Data , 2013, Genetics.

[58]  A. Neumaier Interval methods for systems of equations , 1990 .

[59]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[60]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[61]  Eric S. Lander,et al.  A polygenic burden of rare disruptive mutations in schizophrenia , 2014, Nature.

[62]  Mauricio O. Carneiro,et al.  From FastQ Data to High‐Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline , 2013, Current protocols in bioinformatics.

[63]  G. Alefeld,et al.  Introduction to Interval Computation , 1983 .

[64]  Anders Albrechtsen,et al.  Calculation of Tajima’s D and other neutrality test statistics from low depth next-generation sequencing data , 2013, BMC Bioinformatics.