论文信息 - DUDE-Seq: Fast Universal Denoising of Nucleotide Sequences

DUDE-Seq: Fast Universal Denoising of Nucleotide Sequences

We consider the correction of errors from nucleotide sequences produced by next-generation sequencing. The error rate in reads has been increasing with the shift of focus of mainstream sequencers from accuracy to throughput. Denoising in high-throughput sequencing is thus becoming a crucial component for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and provides an effective means for correcting substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput sequencing platforms. Our experimental studies with real and simulated data sets suggest that the proposed DUDE-Seq outperforms existing alternatives in terms of error-correction capabilities, time efficiency, as boosting the reliability of downstream analyses. Further, DUDE-Seq is universally applicable across different sequencing platforms and analysis pipelines by a simple update of the noise model. [availability: http://data.snu.ac.kr/pub/dude-seq]

[1] R. Knight,et al. Rapid denoising of pyrosequencing amplicon data: exploiting the rank-abundance distribution , 2010, Nature Methods.

[2] J. Handelsman,et al. Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[3] Srinivas Aluru,et al. Reptile: representative tiling for short read error correction , 2010, Bioinform..

[4] Haixu Tang,et al. Fragment assembly with short reads , 2004, Bioinform..

[5] Tsachy Weissman,et al. Universal denoising for the finite-input general-output channel , 2005, IEEE Transactions on Information Theory.

[6] Florent E. Angly,et al. Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[7] Srinivas Aluru,et al. Repeat-aware modeling and correction of short read errors , 2011, BMC Bioinformatics.

[8] William Bateson,et al. Materials for the Study of Variation: Treated with Especial Regard to Discontinuity in the Origin of Species , 1894 .

[9] อนิรุธ สืบสิงห์,et al. Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[10] J. Handelsman,et al. Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[11] Yutaka Suzuki,et al. Recount: expectation maximization based error correction tool for next generation sequencing data. , 2009, Genome informatics. International Conference on Genome Informatics.

[12] Mikel Hernaez,et al. Effect of lossy compression of quality scores on variant calling , 2015, bioRxiv.

[13] Siu-Ming Yiu,et al. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[14] Paul Greenfield,et al. Blue: correcting sequencing errors using consensus and context , 2014, Bioinform..

[15] Srinivas Aluru,et al. A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[16] David R. Kelley,et al. Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[17] S. Morishita,et al. Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. , 2009, Genome research.

[18] David Laehnemann,et al. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[19] Jan Schröder,et al. BIOINFORMATICS ORIGINAL PAPER , 2022 .

[20] Tsachy Weissman,et al. Universal discrete denoising: known channel , 2003, IEEE Transactions on Information Theory.

[21] Lucian Ilie,et al. HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[22] Steven Salzberg,et al. BIOINFORMATICS ORIGINAL PAPER , 2004 .

[23] Lior Pachter,et al. Identification and correction of systematic error in high-throughput sequence data , 2011, BMC Bioinformatics.

[24] Benedict Paten,et al. Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[25] Lauren M. Bragg,et al. Fast, accurate error-correction of amplicon pyrosequences using Acacia , 2012, Nature Methods.

[26] Marcel H. Schulz,et al. Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[27] S. Salzberg,et al. Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[28] Andrea K. Bartram,et al. Generation of Multimillion-Sequence 16S rRNA Gene Libraries from Complex Microbial Communities by Assembling Paired-End Illumina Reads , 2011, Applied and Environmental Microbiology.

[29] Byunghan Lee,et al. CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing , 2014, BMC Bioinformatics.

[30] Jan Schröder,et al. Genome analysis SHREC : a short-read error correction method , 2009 .

[31] Andrew H. Chan,et al. ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[32] Hanlee P. Ji,et al. Next-generation DNA sequencing , 2008, Nature Biotechnology.

[33] Sergey I. Nikolenko,et al. BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[34] Tsachy Weissman,et al. Algorithms for discrete denoising under channel uncertainty , 2005, IEEE Transactions on Signal Processing.

[35] Russell J. Davenport,et al. Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[36] W. T. ASTBURY,et al. Molecular Biology or Ultrastructural Biology ? , 1961, Nature.

[37] M. Metzker. Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[38] Xin Yin,et al. PREMIER — PRobabilistic error-correction using Markov inference in errored reads , 2013, 2013 IEEE International Symposium on Information Theory.

[39] P. Green,et al. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[40] Eun-Cheon Lim,et al. Trowel: a fast and accurate error correction module for Illumina sequencing reads , 2014, Bioinform..

[41] T. Thomas,et al. GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[42] Leena Salmela,et al. Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[43] David M. W. Powers,et al. Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[44] Daniel G. Brown,et al. PANDAseq: paired-end assembler for illumina sequences , 2012, BMC Bioinformatics.

[45] Paul Medvedev,et al. Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..