A* fast and scalable high-throughput sequencing data error correction via oligomers

Next-generation sequencing (NGS) technologies have superseded traditional Sanger sequencing approach in many experimental settings, given their tremendous yield and affordable cost. Nowadays it is possible to sequence any microbial organism or meta-genomic sample within hours, and to obtain a whole human genome in weeks. Nonetheless, NGS technologies are error-prone. Correcting errors is a challenge due to multiple factors, including the data sizes, the machine-specific and non-at-random characteristics of errors, and the error distributions. Errors in NGS experiments can hamper the subsequent data analysis and inference. This work proposes an error correction method based on the de Bruijn graph that permits its execution on Gigabyte-sized data sets using normal desktop/laptop computers, ideal for genome sizes in the Megabase range, e.g. bacteria. The implementation makes extensive use of hashing techniques, and implements an A* algorithm for optimal error correction, minimizing the distance between an erroneous read and its possible replacement with the Needleman-Wunsch score. Our approach outperforms other popular methods both in terms of random access memory usage and computing times.

[1]  C. Thermes,et al.  Ten years of next-generation sequencing technology. , 2014, Trends in genetics : TIG.

[2]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..

[3]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[4]  Jie Ding,et al.  Estimation of sequencing error rates in short reads , 2012, BMC Bioinformatics.

[5]  Ron Shamir,et al.  A computational method for resequencing long DNA targets by universal oligonucleotide arrays , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Paul Greenfield,et al.  Blue: correcting sequencing errors using consensus and context , 2014, Bioinform..

[7]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[8]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[9]  Pavel Skums,et al.  Efficient error correction for next-generation sequencing of viral amplicons , 2012, BMC Bioinformatics.

[10]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[11]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[12]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[13]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[14]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[15]  B. Langmead,et al.  Aligning Short Sequencing Reads with Bowtie , 2010, Current protocols in bioinformatics.

[16]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[17]  Lucian Ilie,et al.  Correcting Illumina data , 2015, Briefings Bioinform..

[18]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[19]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[20]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[21]  Michael B. Eisen,et al.  Improving transcriptome assembly through error correction of high-throughput sequence reads , 2013, PeerJ.

[22]  Gayle M. Wittenberg,et al.  EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data , 2010, J. Comput. Biol..

[23]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[24]  Robert C. Edgar,et al.  Error filtering, pair assembly and error correction for next-generation sequencing reads , 2015, Bioinform..

[25]  Yuk Yee Leung,et al.  CoRAL: predicting non-coding RNAs from small RNA-sequencing data , 2013, Nucleic acids research.

[26]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[27]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[28]  Franco Milicchio,et al.  HErCoOl: High-Throughput Error Correction by Oligomers , 2014, 2014 IEEE 27th International Symposium on Computer-Based Medical Systems.