HErCoOl: High-Throughput Error Correction by Oligomers

Next-generation sequencing (NGS) technologies are marking the foundations for a new paradigm in genomics and transcriptomics. Nowadays is possible to sequence any microbial organism or meta-genomic sample within hours, and to obtain a whole human genome in less than a month. The sequencing prices are decreasing dramatically, opening to actual personalised medicine. NGS technologies however are error-prone, and correcting errors is a challenge due to multiple factors, including the data sizes (gigabyte scale) and the machine-specific, non-at-random, characteristics of errors and error distributions. Several approaches have been proposed, but yet the problem is a challenge, especially when analysing mixtures of (closely related) species, e.g., highly variable viruses infecting in a host as a swarm, like hepatitis C or human immunodeficiency virus. This work presents a novel error correction algorithm based on k-mer strings with their associated overlap graph, along with an open-source, multi-threaded, implementation. The algorithm, named Her Cool (High-throughput Error Correction by Oligomers), needs minimal tuning, only an overall error rate and -optionally- information about the genome sizes. Her Cool was compared against other state-of-the art methods, using empirical NGS data obtained with Roche 454 technology, focusing the benchmarks on mixtures of related species. Results show that Her Cool improves significantly over the current methods, and the parallelisation scales well with the size of input NGS genome producing long sequence reads, such as Roche 454 or Ion Torrent. Her Cool provides a fast and efficient error correction of NGS data, especially for mixed samples. Its platform-independent, open-source, multi-threaded implementation assures flexibility for being employed and integrated in any NGS data analysis software.

[1]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  M. Metzker Sequencing technologies — the next generation , 2010, Nature Reviews Genetics.

[3]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[4]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[5]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[6]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[7]  Yuk Yee Leung,et al.  CoRAL: predicting non-coding RNAs from small RNA-sequencing data , 2013, Nucleic acids research.

[8]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[9]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[10]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[11]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[12]  Matthew B. Kerby,et al.  Landscape of next-generation sequencing technologies. , 2011, Analytical chemistry.

[13]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[14]  F. Sanger,et al.  A Rapid Method for Determining Sequences in DNA by Primed Synthesis with DNA Polymerase , 1989 .

[15]  Gayle M. Wittenberg,et al.  EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data , 2010, J. Comput. Biol..

[16]  Pavel Skums,et al.  Efficient error correction for next-generation sequencing of viral amplicons , 2012, BMC Bioinformatics.

[17]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[18]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[19]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[20]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..

[21]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.