EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data

Genomic sequencing techniques introduce experimental errors into reads which can mislead sequence assembly efforts and complicate the diagnostic process. Here we present a method for detecting and removing sequencing errors from reads generated in genomic shotgun sequencing projects prior to sequence assembly. For each input read, the set of all length k substrings (k-mers) it contains are calculated. The read is evaluated based on the frequency with which each k-mer occurs in the complete data set (k-count). For each read, k-mers are clustered using the variable-bandwidth mean-shift algorithm. Based on the k-count of the cluster center, clusters are classified as error regions or non-error regions. For the 23 real and simulated data sets tested (454 and Solexa), our algorithm detected error regions that cover 99% of all errors. A heuristic algorithm is then applied to detect the location of errors in each putative error region. A read is corrected by removing the errors, thereby creating two or more smaller, error-free read fragments. After performing error removal, the error-rate for all data sets tested decreased (∼35-fold reduction, on average). EDAR has comparable accuracy to methods that correct rather than remove errors and when the error rate is greater than 3% for simulated data sets, it performs better. The performance of the Velvet assembler is generally better with error-removed data. However, for short reads, splitting at the location of errors can be problematic. Following error detection with error correction, rather than removal, may improve the assembly results.

[1]  William R. Jeck,et al.  Pseudomonas syringae from the rice pathogen De novo assembly using low-coverage short read sequence data , 2009 .

[2]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[3]  E. Lander,et al.  Genomic mapping by fingerprinting random clones: a mathematical analysis. , 1988, Genomics.

[4]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[5]  Josephine A. Reinhardt,et al.  De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. , 2009, Genome research.

[6]  David Hernández,et al.  De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer. , 2008, Genome research.

[7]  W. Grody,et al.  Keeping up with the next generation: massively parallel sequencing in clinical diagnostics. , 2008, The Journal of molecular diagnostics : JMD.

[8]  G. Weinstock,et al.  The Atlas genome assembly system. , 2004, Genome research.

[9]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[10]  B. Berger,et al.  ARACHNE: a whole-genome shotgun assembler. , 2002, Genome research.

[11]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[12]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[13]  D. Comaniciu,et al.  The variable bandwidth mean shift and data-driven scale selection , 2001, Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001.

[14]  Steven L Salzberg,et al.  Automated correction of genome sequence errors. , 2004, Nucleic acids research.

[15]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[16]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Haixu Tang,et al.  Fragment assembly with short reads , 2004, Bioinform..

[18]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[19]  B. Silverman Density estimation for statistics and data analysis , 1986 .

[20]  E. Arner,et al.  Correcting errors in shotgun sequences. , 2003, Nucleic acids research.

[21]  K. Voelkerding,et al.  Next-generation sequencing: from basic research to diagnostics. , 2009, Clinical chemistry.

[22]  Gabor T. Marth,et al.  Whole-genome sequencing and variant discovery in C. elegans , 2008, Nature Methods.

[23]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[24]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[25]  Dorin Comaniciu,et al.  The Variable Bandwidth Mean Shift and Data-Driven Scale Selection , 2001, ICCV.