REFECT: a novel paradigm for correcting short reads

Sequencing technology has advanced rapidly. Millions to billions of short reads are sequenced from a DNA molecule in a single run by parallelizing the whole procedure. Since it is a very cost effective procedure and can be performed in a laboratory environment within a brief period of time, we see an explosion of the biological sequencing data. But there is a tradeoff between the abundance and accuracy of the sequencing reads. The limitations of the sequencing technology result in errors in the reads. The errors could be substitution(s), insertions and/or deletions in a single base or multiple bases. Although the errors are being greatly reduced with the advancement of the modern technology, it is still a serious concern as of today. The sequence assembler often fails to sequence the entire genome because of the errors in the reads. By identifying and correcting the erroneous bases of the reads, not only can we achieve high quality data but also the computational complexity of many biological applications can be greatly reduced. Traditional approaches employ overlaps among the reads to correct them. Biologists have successfully sequenced thousands of species and this effort is growing continuously. As a result, the list of species for which references are available is growing rapidly. Considering this fact we have developed a novel hybrid error correcting algorithm called HECTOR (Hybrid Error CorrecTOR). It employs both referential and de novo error correction techniques to correct errors in reads. We have done extensive experiments to reveal that HECTOR is indeed an effective error correction algorithm.

[1]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[2]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[3]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[4]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[5]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[6]  Sanguthevar Rajasekaran,et al.  NRRC: A Non-referential Reads Compression Algorithm , 2015, ISBRA.

[7]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[8]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[10]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[11]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[12]  Jeremy Buhler,et al.  Finding motifs using random projections , 2001, RECOMB.

[13]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[14]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[15]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[16]  Michael Q. Zhang,et al.  Using quality scores and longer reads improves accuracy of Solexa read mapping , 2008, BMC Bioinformatics.

[17]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[18]  Weiguo Liu,et al.  Accelerating error correction in high-throughput short-read DNA sequencing data with CUDA , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[20]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[21]  E. Arner,et al.  Correcting errors in shotgun sequences. , 2003, Nucleic acids research.

[22]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[23]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[24]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..