Evaluation of the impact of Illumina error correction tools on de novo genome assembly

BackgroundRecently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods.ResultsFor twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy.ConclusionsWe confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.

[1]  Dick de Ridder,et al.  ACE: accurate correction of errors using K-mer tries , 2015, Bioinform..

[2]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[3]  Marcel H. Schulz,et al.  Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[4]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[5]  David Laehnemann,et al.  Denoising DNA deep sequencing data—high-throughput sequencing errors and their correction , 2015, Briefings Bioinform..

[6]  G. Boyle,et al.  Mechanisms Contributing to Differential Regulation of PAX3 Downstream Target Genes in Normal Human Epidermal Melanocytes versus Melanoma Cells , 2015, PloS one.

[7]  Sanguthevar Rajasekaran,et al.  EC: an efficient error correction algorithm for short reads , 2015, BMC Bioinformatics.

[8]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[9]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[10]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[11]  Lucian Ilie,et al.  Correcting Illumina data , 2015, Briefings Bioinform..

[12]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[13]  G. Petsko The blue marble , 2011, Genome Biology.

[14]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[15]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[16]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[17]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[18]  Paul Greenfield,et al.  Blue: correcting sequencing errors using consensus and context , 2014, Bioinform..

[19]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[20]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[21]  Eun-Cheon Lim,et al.  Trowel: a fast and accurate error correction module for Illumina sequencing reads , 2014, Bioinform..

[22]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[23]  Panos Kalnis,et al.  Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data , 2015, Bioinform..

[24]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[25]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[26]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[27]  Daniel G. Brown,et al.  Pollux: platform independent error correction of single and mixed genomes , 2015, BMC Bioinformatics.

[28]  Ignacio Blanquer,et al.  Objective review of de novo stand‐alone error correction methods for NGS data , 2016 .

[29]  James A. Yorke,et al.  QuorUM: An Error Corrector for Illumina Reads , 2013, PloS one.

[30]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[31]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[32]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[33]  Jian Ma,et al.  BLESS 2: accurate, memory-efficient and fast error correction method , 2016, Bioinform..

[34]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[35]  Justin Zobel,et al.  Gossamer - a resource-efficient de novo assembler , 2012, Bioinform..

[36]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[37]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.

[38]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[39]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[40]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.