DUDE-Seq: Fast Universal Denoising of Nucleotide Sequences

We consider the correction of errors from nucleotide sequences produced by next-generation sequencing. The error rate in reads has been increasing with the shift of focus of mainstream sequencers from accuracy to throughput. Denoising in high-throughput sequencing is thus becoming a crucial component for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and provides an effective means for correcting substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput sequencing platforms. Our experimental studies with real and simulated data sets suggest that the proposed DUDE-Seq outperforms existing alternatives in terms of error-correction capabilities, time efficiency, as boosting the reliability of downstream analyses. Further, DUDE-Seq is universally applicable across different sequencing platforms and analysis pipelines by a simple update of the noise model. [availability: http://data.snu.ac.kr/pub/dude-seq]

[1]  R. Knight,et al.  Rapid denoising of pyrosequencing amplicon data: exploiting the rank-abundance distribution , 2010, Nature Methods.

[2]  J. Handelsman,et al.  Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[3]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..

[4]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[5]  Tsachy Weissman,et al.  Universal denoising for the finite-input general-output channel , 2005, IEEE Transactions on Information Theory.

[6]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[7]  Srinivas Aluru,et al.  Repeat-aware modeling and correction of short read errors , 2011, BMC Bioinformatics.

[8]  William Bateson,et al.  Materials for the Study of Variation: Treated with Especial Regard to Discontinuity in the Origin of Species , 1894 .

[9]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10]  J. Handelsman,et al.  Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[11]  Yutaka Suzuki,et al.  Recount: expectation maximization based error correction tool for next generation sequencing data. , 2009, Genome informatics. International Conference on Genome Informatics.

[12]  Mikel Hernaez,et al.  Effect of lossy compression of quality scores on variant calling , 2015, bioRxiv.

[13]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[14]  Paul Greenfield,et al.  Blue: correcting sequencing errors using consensus and context , 2014, Bioinform..

[15]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[16]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[17]  S. Morishita,et al.  Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. , 2009, Genome research.

[18]  David Laehnemann,et al.  Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[19]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[20]  Tsachy Weissman,et al.  Universal discrete denoising: known channel , 2003, IEEE Transactions on Information Theory.

[21]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[22]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[23]  Lior Pachter,et al.  Identification and correction of systematic error in high-throughput sequence data , 2011, BMC Bioinformatics.

[24]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[25]  Lauren M. Bragg,et al.  Fast, accurate error-correction of amplicon pyrosequences using Acacia , 2012, Nature Methods.

[26]  Marcel H. Schulz,et al.  Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[27]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[28]  Andrea K. Bartram,et al.  Generation of Multimillion-Sequence 16S rRNA Gene Libraries from Complex Microbial Communities by Assembling Paired-End Illumina Reads , 2011, Applied and Environmental Microbiology.

[29]  Byunghan Lee,et al.  CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing , 2014, BMC Bioinformatics.

[30]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[31]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[32]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[33]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[34]  Tsachy Weissman,et al.  Algorithms for discrete denoising under channel uncertainty , 2005, IEEE Transactions on Signal Processing.

[35]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[36]  W. T. ASTBURY,et al.  Molecular Biology or Ultrastructural Biology ? , 1961, Nature.

[37]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[38]  Xin Yin,et al.  PREMIER — PRobabilistic error-correction using Markov inference in errored reads , 2013, 2013 IEEE International Symposium on Information Theory.

[39]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[40]  Eun-Cheon Lim,et al.  Trowel: a fast and accurate error correction module for Illumina sequencing reads , 2014, Bioinform..

[41]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[42]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[43]  David M. W. Powers,et al.  Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[44]  Daniel G. Brown,et al.  PANDAseq: paired-end assembler for illumina sequences , 2012, BMC Bioinformatics.

[45]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..