DUDE-Seq: Fast, flexible, and robust denoising of nucleotide sequences

We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.

[1]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[2]  William Bateson,et al.  Materials for the study of variation , 1894 .

[3]  Benedict Paten,et al.  Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[4]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[5]  Alejandro A. Schäffer,et al.  A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[6]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[8]  W. T. ASTBURY,et al.  Molecular Biology or Ultrastructural Biology ? , 1961, Nature.

[9]  Marcel H. Schulz,et al.  Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[10]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[11]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[12]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[13]  J. Handelsman,et al.  Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[14]  Byunghan Lee,et al.  CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing , 2014, BMC Bioinformatics.

[15]  Paul Greenfield,et al.  Blue: correcting sequencing errors using consensus and context , 2014, Bioinform..

[16]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[17]  Tsachy Weissman,et al.  Universal discrete denoising: known channel , 2003, IEEE Transactions on Information Theory.

[18]  Tin Wee Tan,et al.  Coverage analysis in a targeted amplicon-based next-generation sequencing panel for myeloid neoplasms , 2016, Journal of Clinical Pathology.

[19]  Xin Yin,et al.  PREMIER — PRobabilistic error-correction using Markov inference in errored reads , 2013, 2013 IEEE International Symposium on Information Theory.

[20]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[21]  C. Quince,et al.  Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform , 2015, Nucleic acids research.

[22]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[23]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[24]  Eun-Cheon Lim,et al.  Trowel: a fast and accurate error correction module for Illumina sequencing reads , 2014, Bioinform..

[25]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[26]  S. Morishita,et al.  Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. , 2009, Genome research.

[27]  Forest Rohwer,et al.  TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets , 2010, BMC Bioinformatics.

[28]  Daniel G. Brown,et al.  Pollux: platform independent error correction of single and mixed genomes , 2015, BMC Bioinformatics.

[29]  Tsachy Weissman,et al.  Universal denoising for the finite-input general-output channel , 2005, IEEE Transactions on Information Theory.

[30]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[31]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[32]  Daniel G. Brown,et al.  PANDAseq: paired-end assembler for illumina sequences , 2012, BMC Bioinformatics.

[33]  David M. W. Powers,et al.  Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[34]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[35]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[36]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..

[37]  Tsachy Weissman,et al.  Algorithms for discrete denoising under channel uncertainty , 2005, IEEE Transactions on Signal Processing.

[38]  Mikel Hernaez,et al.  Effect of lossy compression of quality scores on variant calling , 2015, bioRxiv.

[39]  David Laehnemann,et al.  Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[40]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[41]  R. Norman,et al.  Microbial phylogenetic profiling with the Pacific Biosciences sequencing platform , 2013, Microbiome.

[42]  Philip Hugenholtz,et al.  Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data , 2013, PLoS Comput. Biol..

[43]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[44]  Andrea K. Bartram,et al.  Generation of Multimillion-Sequence 16S rRNA Gene Libraries from Complex Microbial Communities by Assembling Paired-End Illumina Reads , 2011, Applied and Environmental Microbiology.

[45]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[46]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[47]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[48]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[49]  Saumya Shekhar Jamuar,et al.  Clinical application of next-generation sequencing for Mendelian diseases , 2015, Human Genomics.

[50]  J. Handelsman,et al.  Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[51]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[52]  Khader Shameer,et al.  HORI: a web server to compute Higher Order Residue Interactions in protein structures , 2010, BMC Bioinformatics.

[53]  Lauren M. Bragg,et al.  Fast, accurate error-correction of amplicon pyrosequences using Acacia , 2012, Nature Methods.

[54]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[55]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[56]  Lior Pachter,et al.  Identification and correction of systematic error in high-throughput sequence data , 2011, BMC Bioinformatics.

[57]  William Bateson,et al.  Materials for the Study of Variation: Treated with Especial Regard to Discontinuity in the Origin of Species , 1894 .

[58]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[59]  J. Shendure,et al.  Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[60]  Yutaka Suzuki,et al.  Recount: expectation maximization based error correction tool for next generation sequencing data. , 2009, Genome informatics. International Conference on Genome Informatics.

[61]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[62]  R. Knight,et al.  Rapid denoising of pyrosequencing amplicon data: exploiting the rank-abundance distribution , 2010, Nature Methods.

[63]  Byunghan Lee,et al.  Neural Universal Discrete Denoiser , 2016, NIPS.

[64]  Srinivas Aluru,et al.  Repeat-aware modeling and correction of short read errors , 2011, BMC Bioinformatics.