BackgroundContinued advances in next generation short-read sequencing technologies are increasing throughput and read lengths, while driving down error rates. Taking advantage of the high coverage sampling used in many applications, several error correction algorithms have been developed to improve data quality further. However, correcting errors in high coverage sequence data requires significant computing resources.MethodsWe propose a different approach to handle erroneous sequence data. Presently, error rates of high-throughput platforms such as the Illumina HiSeq are within 1%. Moreover, the errors are not uniformly distributed in all reads, and a large percentage of reads are indeed error-free. Ability to predict such perfect reads can significantly impact the run-time complexity of applications. We present a simple and fast k-spectrum analysis based method to identify error-free reads. The filtration process to identify and weed out erroneous reads can be customized at several levels of stringency depending upon the downstream application need.ResultsOur experiments show that if around 80% of the reads in a dataset are perfect, then our method retains almost 99.9% of them with more than 90% precision rate. Though filtering out reads identified as erroneous by our method reduces the average coverage by about 7%, we found the remaining reads provide as uniform a coverage as the original dataset. We demonstrate the effectiveness of our approach on an example downstream application: we show that an error correction algorithm, Reptile, which rely on collectively analyzing the reads in a dataset to identify and correct erroneous bases, instead use reads predicted to be perfect by our method to correct the other reads, the overall accuracy improves further by up to 10%.ConclusionsThanks to the continuous technological improvements, the coverage and accuracy of reads from dominant sequencing platforms have now reached an extent where we can envision just filtering out reads with errors, thus making error correction less important. Our algorithm is a first attempt to propose and demonstrate this new paradigm. Moreover, our demonstration is applicable to any error correction algorithm as a downstream application, this in turn gives a new class of error correcting algorithms as a by product.
[1]
Lucian Ilie,et al.
HiTEC: accurate error correction in high-throughput sequencing data
,
2011,
Bioinform..
[2]
Paul Medvedev,et al.
Error correction of high-throughput sequencing datasets with non-uniform coverage
,
2011,
Bioinform..
[3]
Muhammad Tahir,et al.
Review of Genome Sequence Short Read Error Correction Algorithms
,
2013
.
[4]
Joaquín Dopazo,et al.
Qualimap: evaluating next-generation sequencing alignment data
,
2012,
Bioinform..
[5]
Carl Kingsford,et al.
A fast, lock-free approach for efficient parallel counting of occurrences of k-mers
,
2011,
Bioinform..
[6]
Richard Durbin,et al.
Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform
,
2009
.
[7]
Leena Salmela,et al.
Correction of sequencing errors in a mixed set of reads
,
2010,
Bioinform..
[8]
Jan Schröder,et al.
BIOINFORMATICS ORIGINAL PAPER
,
2022
.
[9]
Jan Schröder,et al.
Genome analysis SHREC : a short-read error correction method
,
2009
.
[10]
Andrew H. Chan,et al.
ECHO: a reference-free short-read error correction algorithm.
,
2011,
Genome research.
[11]
Srinivas Aluru,et al.
A survey of error-correction methods for next-generation sequencing
,
2013,
Briefings Bioinform..
[12]
Srinivas Aluru,et al.
Repeat-aware modeling and correction of short read errors
,
2011,
BMC Bioinformatics.
[13]
Wen Huang,et al.
MTML-msBayes: Approximate Bayesian comparative phylogeographic inference from multiple taxa and multiple loci with rate heterogeneity
,
2011,
BMC Bioinformatics.
[14]
Yongchao Liu,et al.
Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data
,
2013,
Bioinform..
[15]
Zhiyong Lu,et al.
Benchmarking of the 2010 BioCreative Challenge III text-mining competition by the BioGRID and MINT interaction databases
,
2011
.
[16]
Srinivas Aluru,et al.
Reptile: representative tiling for short read error correction
,
2010,
Bioinform..