论文信息 - DUDE-Seq: Fast, flexible, and robust denoising of nucleotide sequences - 字舞流文

DUDE-Seq: Fast, flexible, and robust denoising of nucleotide sequences

We consider the correction of errors from nucleotide sequences produced by next-generation targeted amplicon sequencing. The next-generation sequencing (NGS) platforms can provide a great deal of sequencing data thanks to their high throughput, but the associated error rates often tend to be high. Denoising in high-throughput sequencing has thus become a crucial process for boosting the reliability of downstream analyses. Our methodology, named DUDE-Seq, is derived from a general setting of reconstructing finite-valued source data corrupted by a discrete memoryless channel and effectively corrects substitution and homopolymer indel errors, the two major types of sequencing errors in most high-throughput targeted amplicon sequencing platforms. Our experimental studies with real and simulated datasets suggest that the proposed DUDE-Seq not only outperforms existing alternatives in terms of error-correction capability and time efficiency, but also boosts the reliability of downstream analyses. Further, the flexibility of DUDE-Seq enables its robust application to different sequencing platforms and analysis pipelines by simple updates of the noise model. DUDE-Seq is available at http://data.snu.ac.kr/pub/dude-seq.

Tsachy Weissman | Byunghan Lee | Sungroh Yoon | Taesup Moon | T. Weissman | Sungroh Yoon | Byunghan Lee | Taesup Moon

[1] Russell J. Davenport,et al. Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[2] William Bateson,et al. Materials for the study of variation , 1894 .

[3] Benedict Paten,et al. Improved data analysis for the MinION nanopore sequencer , 2015, Nature Methods.

[4] Sang Joon Kim,et al. A Mathematical Theory of Communication , 2006 .

[5] Alejandro A. Schäffer,et al. A Fast and Symmetric DUST Implementation to Mask Low-Complexity DNA Sequences , 2006, J. Comput. Biol..

[6] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7] Heng Li,et al. Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[8] W. T. ASTBURY,et al. Molecular Biology or Ultrastructural Biology ? , 1961, Nature.

[9] Marcel H. Schulz,et al. Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[10] M. Metzker. Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[11] Gonçalo R. Abecasis,et al. The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[12] Robert C. Edgar,et al. BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[13] J. Handelsman,et al. Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness , 2005, Applied and Environmental Microbiology.

[14] Byunghan Lee,et al. CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing , 2014, BMC Bioinformatics.

[15] Paul Greenfield,et al. Blue: correcting sequencing errors using consensus and context , 2014, Bioinform..

[16] David R. Kelley,et al. Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[17] Tsachy Weissman,et al. Universal discrete denoising: known channel , 2003, IEEE Transactions on Information Theory.

[18] Tin Wee Tan,et al. Coverage analysis in a targeted amplicon-based next-generation sequencing panel for myeloid neoplasms , 2016, Journal of Clinical Pathology.

[19] Xin Yin,et al. PREMIER — PRobabilistic error-correction using Markov inference in errored reads , 2013, 2013 IEEE International Symposium on Information Theory.

[20] Srinivas Aluru,et al. A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[21] C. Quince,et al. Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform , 2015, Nucleic acids research.

[22] J. McPherson,et al. Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[23] P. Green,et al. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[24] Eun-Cheon Lim,et al. Trowel: a fast and accurate error correction module for Illumina sequencing reads , 2014, Bioinform..

[25] T. Thomas,et al. GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[26] S. Morishita,et al. Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. , 2009, Genome research.

[27] Forest Rohwer,et al. TagCleaner: Identification and removal of tag sequences from genomic and metagenomic datasets , 2010, BMC Bioinformatics.

[28] Daniel G. Brown,et al. Pollux: platform independent error correction of single and mixed genomes , 2015, BMC Bioinformatics.

[29] Tsachy Weissman,et al. Universal denoising for the finite-input general-output channel , 2005, IEEE Transactions on Information Theory.

[30] Xiaolong Wu,et al. BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[31] Leena Salmela,et al. Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[32] Daniel G. Brown,et al. PANDAseq: paired-end assembler for illumina sequences , 2012, BMC Bioinformatics.

[33] David M. W. Powers,et al. Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[34] Paul Medvedev,et al. Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[35] Florent E. Angly,et al. Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.

[36] Srinivas Aluru,et al. Reptile: representative tiling for short read error correction , 2010, Bioinform..

[37] Tsachy Weissman,et al. Algorithms for discrete denoising under channel uncertainty , 2005, IEEE Transactions on Signal Processing.

[38] Mikel Hernaez,et al. Effect of lossy compression of quality scores on variant calling , 2015, bioRxiv.

[39] David Laehnemann,et al. Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[40] Jan Schröder,et al. BIOINFORMATICS ORIGINAL PAPER , 2022 .

[41] R. Norman,et al. Microbial phylogenetic profiling with the Pacific Biosciences sequencing platform , 2013, Microbiome.

[42] Philip Hugenholtz,et al. Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data , 2013, PLoS Comput. Biol..

[43] S. Salzberg,et al. Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[44] Andrea K. Bartram,et al. Generation of Multimillion-Sequence 16S rRNA Gene Libraries from Complex Microbial Communities by Assembling Paired-End Illumina Reads , 2011, Applied and Environmental Microbiology.

[45] Jan Schröder,et al. Genome analysis SHREC : a short-read error correction method , 2009 .

[46] Andrew H. Chan,et al. ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[47] Hanlee P. Ji,et al. Next-generation DNA sequencing , 2008, Nature Biotechnology.

[48] Sergey I. Nikolenko,et al. BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[49] Saumya Shekhar Jamuar,et al. Clinical application of next-generation sequencing for Mendelian diseases , 2015, Human Genomics.

[50] J. Handelsman,et al. Metagenomics: genomic analysis of microbial communities. , 2004, Annual review of genetics.

[51] Robert A. Edwards,et al. Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[52] Khader Shameer,et al. HORI: a web server to compute Higher Order Residue Interactions in protein structures , 2010, BMC Bioinformatics.

[53] Lauren M. Bragg,et al. Fast, accurate error-correction of amplicon pyrosequences using Acacia , 2012, Nature Methods.

[54] Lucian Ilie,et al. HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[55] Steven Salzberg,et al. BIOINFORMATICS ORIGINAL PAPER , 2004 .

[56] Lior Pachter,et al. Identification and correction of systematic error in high-throughput sequence data , 2011, BMC Bioinformatics.

[57] William Bateson,et al. Materials for the Study of Variation: Treated with Especial Regard to Discontinuity in the Origin of Species , 1894 .

[58] อนิรุธ สืบสิงห์,et al. Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[59] J. Shendure,et al. Exome sequencing as a tool for Mendelian disease gene discovery , 2011, Nature Reviews Genetics.

[60] Yutaka Suzuki,et al. Recount: expectation maximization based error correction tool for next generation sequencing data. , 2009, Genome informatics. International Conference on Genome Informatics.

[61] Siu-Ming Yiu,et al. COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[62] R. Knight,et al. Rapid denoising of pyrosequencing amplicon data: exploiting the rank-abundance distribution , 2010, Nature Methods.

[63] Byunghan Lee,et al. Neural Universal Discrete Denoiser , 2016, NIPS.

[64] Srinivas Aluru,et al. Repeat-aware modeling and correction of short read errors , 2011, BMC Bioinformatics.