Illumina error correction near highly repetitive DNA regions improves de novo genome assembly

BackgroundSeveral standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly.ResultsWe propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster.ConclusionsBrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector.

[1]  Piet Demeester,et al.  Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.

[2]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[3]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[4]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[5]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[6]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[7]  Marcel H. Schulz,et al.  Fiona: a parallel and automatic strategy for read error correction , 2014, Bioinform..

[8]  C. Nusbaum,et al.  Comprehensive variation discovery in single human genomes , 2014, Nature Genetics.

[9]  Nicolas Dierckxsens,et al.  NOVOPlasty: de novo assembly of organelle genomes from whole genome data. , 2016, Nucleic acids research.

[10]  G. Boyle,et al.  Mechanisms Contributing to Differential Regulation of PAX3 Downstream Target Genes in Normal Human Epidermal Melanocytes versus Melanoma Cells , 2015, PloS one.

[11]  Zhao Yang,et al.  A Comparative Analysis of Community Detection Algorithms on Artificial Networks , 2016, Scientific Reports.

[12]  Christus,et al.  A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins , 2022 .

[13]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[14]  Jian Ma,et al.  BLESS 2: accurate, memory-efficient and fast error correction method , 2016, Bioinform..

[15]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[16]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[17]  Siu-Ming Yiu,et al.  IDBA - A Practical Iterative de Bruijn Graph De Novo Assembler , 2010, RECOMB.

[18]  Yves Van de Peer,et al.  BrownieAligner: accurate alignment of Illumina sequencing data to de Bruijn graphs , 2018, BMC Bioinform..

[19]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[20]  Paul Greenfield,et al.  Blue: correcting sequencing errors using consensus and context , 2014, Bioinform..

[21]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[22]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[23]  B. Langmead,et al.  Lighter: fast and memory-efficient sequencing error correction without counting , 2014, Genome Biology.

[24]  Panos Kalnis,et al.  Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data , 2015, Bioinform..

[25]  Sebastian Deorowicz,et al.  RECKONER: read error corrector based on KMC , 2016, Bioinform..

[26]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[27]  Daniel G. Brown,et al.  Pollux: platform independent error correction of single and mixed genomes , 2015, BMC Bioinformatics.

[28]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[29]  Adam M Phillippy,et al.  Long-read, whole-genome shotgun sequence data for five model organisms , 2014, Scientific Data.

[30]  James A. Yorke,et al.  QuorUM: An Error Corrector for Illumina Reads , 2013, PloS one.

[31]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[32]  Eun-Cheon Lim,et al.  Trowel: a fast and accurate error correction module for Illumina sequencing reads , 2014, Bioinform..

[33]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[34]  Dick de Ridder,et al.  ACE: accurate correction of errors using K-mer tries , 2015, Bioinform..

[35]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[36]  G. Petsko The blue marble , 2011, Genome Biology.

[37]  Sergey I. Nikolenko,et al.  BayesHammer: Bayesian clustering for error correction in single-cell sequencing , 2012, BMC Genomics.

[38]  Patrick Mardulyn,et al.  NOVOPlasty: de novo assembly of organelle genomes from whole genome data. , 2016, Nucleic acids research.

[39]  Lucian Ilie,et al.  Correcting Illumina data , 2015, Briefings Bioinform..

[40]  Umer Zeeshan Ijaz,et al.  Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data , 2016, BMC Bioinformatics.

[41]  Piet Demeester,et al.  Evaluation of the impact of Illumina error correction tools on de novo genome assembly , 2017, BMC Bioinformatics.

[42]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[43]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[44]  Jean-Loup Guillaume,et al.  Stable Community Cores in Complex Networks , 2012, CompleNet.