A comparative study of k-spectrum-based error correction methods for next-generation sequencing data analysis

BackgroundInnumerable opportunities for new genomic research have been stimulated by advancement in high-throughput next-generation sequencing (NGS). However, the pitfall of NGS data abundance is the complication of distinction between true biological variants and sequence error alterations during downstream analysis. Many error correction methods have been developed to correct erroneous NGS reads before further analysis, but independent evaluation of the impact of such dataset features as read length, genome size, and coverage depth on their performance is lacking. This comparative study aims to investigate the strength and weakness as well as limitations of some newest k-spectrum-based methods and to provide recommendations for users in selecting suitable methods with respect to specific NGS datasets.MethodsSix k-spectrum-based methods, i.e., Reptile, Musket, Bless, Bloocoo, Lighter, and Trowel, were compared using six simulated sets of paired-end Illumina sequencing data. These NGS datasets varied in coverage depth (10× to 120×), read length (36 to 100 bp), and genome size (4.6 to 143 MB). Error Correction Evaluation Toolkit (ECET) was employed to derive a suite of metrics (i.e., true positives, false positive, false negative, recall, precision, gain, and F-score) for assessing the correction quality of each method.ResultsResults from computational experiments indicate that Musket had the best overall performance across the spectra of examined variants reflected in the six datasets. The lowest accuracy of Musket (F-score = 0.81) occurred to a dataset with a medium read length (56 bp), a medium coverage (50×), and a small-sized genome (5.4 MB). The other five methods underperformed (F-score < 0.80) and/or failed to process one or more datasets.ConclusionsThis study demonstrates that various factors such as coverage depth, read length, and genome size may influence performance of individual k-spectrum-based error correction methods. Thus, efforts have to be paid in choosing appropriate methods for error correction of specific NGS datasets. Based on our comparative study, we recommend Musket as the top choice because of its consistently superior performance across all six testing datasets. Further extensive studies are warranted to assess these methods using experimental datasets generated by NGS platforms (e.g., 454, SOLiD, and Ion Torrent) under more diversified parameter settings (k-mer values and edit distances) and to compare them against other non-k-spectrum-based classes of error correction methods.

[1]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[2]  Sebastian Deorowicz,et al.  KMC 2: Fast and resource-frugal k-mer counting , 2014, Bioinform..

[3]  C. Nusbaum,et al.  Quality scores and SNP detection in sequencing-by-synthesis systems. , 2008, Genome research.

[4]  Joshua S. Paul,et al.  Genotype and SNP calling from next-generation sequencing data , 2011, Nature Reviews Genetics.

[5]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..

[6]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[7]  Eun-Cheon Lim,et al.  Trowel: a fast and accurate error correction module for Illumina sequencing reads , 2014, Bioinform..

[8]  Xin Yin,et al.  PREMIER — PRobabilistic error-correction using Markov inference in errored reads , 2013, 2013 IEEE International Symposium on Information Theory.

[9]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[10]  Gayle M. Wittenberg,et al.  EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data , 2010, J. Comput. Biol..

[11]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[12]  Euan A Ashley,et al.  Performance comparison of whole-genome sequencing platforms , 2011, Nature Biotechnology.

[13]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[14]  Peter Sanders,et al.  Cache-, hash-, and space-efficient bloom filters , 2009, JEAL.

[15]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[16]  Marcel H. Schulz,et al.  Probabilistic error correction for RNA sequencing , 2013, Nucleic acids research.

[17]  Xin Yin,et al.  PREMIER Turbo: Probabilistic error-correction using Markov inference in errored reads using the turbo principle , 2013, 2013 IEEE Global Conference on Signal and Information Processing.

[18]  A. Gnirke,et al.  ALLPATHS 2: small genomes assembled accurately and with high continuity from short paired reads , 2009, Genome Biology.

[19]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[20]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[21]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[22]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[23]  Dominique Lavenier,et al.  DSK: k-mer counting with very low memory usage , 2013, Bioinform..

[24]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[25]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[26]  Lu Wang,et al.  The NIH Human Microbiome Project. , 2009, Genome research.

[27]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[28]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[29]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  W. Pearson Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. , 1991, Genomics.

[31]  Paul Medvedev,et al.  Informed and automated k-mer size selection for genome assembly , 2013, Bioinform..

[32]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[33]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[34]  Yongchao Liu,et al.  CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform , 2012, Bioinform..

[35]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[36]  Dominique Lavenier,et al.  GATB: Genome Assembly & Analysis Tool Box , 2014, Bioinform..

[37]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[38]  C. Furusawa,et al.  Comparison of Sequence Reads Obtained from Three Next-Generation Sequencing Platforms , 2011, PloS one.

[39]  P. Stankiewicz,et al.  Whole-genome sequencing in a patient with Charcot-Marie-Tooth neuropathy. , 2010, The New England journal of medicine.

[40]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[41]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .