MapReduce for accurate error correction of next-generation sequencing data

Motivation Next-generation sequencing platforms have produced huge amounts of sequence data. This is revolutionizing every aspect of genetic and genomic research. However, these sequence datasets contain quite a number of machine-induced errors-e.g. errors due to substitution can be as high as 2.5%. Existing error-correction methods are still far from perfect. In fact, more errors are sometimes introduced than correct corrections, especially by the prevalent k-mer based methods. The existing methods have also made limited exploitation of on-demand cloud computing. Results We introduce an error-correction method named MEC, which uses a two-layered MapReduce technique to achieve high correction performance. In the first layer, all the input sequences are mapped to groups to identify candidate erroneous bases in parallel. In the second layer, the erroneous bases at the same position are linked together from all the groups for making statistically reliable corrections. Experiments on real and simulated datasets show that our method outperforms existing methods remarkably. Its per-position error rate is consistently the lowest, and the correction gain is always the highest. Availability and Implementation The source code is available at bioinformatics.gxu.edu.cn/ngs/mec. Contacts wongls@comp.nus.edu.sg or jinyan.li@uts.edu.au. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[3]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[4]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5]  Leena Salmela,et al.  Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[6]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[8]  Dick de Ridder,et al.  ACE: accurate correction of errors using K-mer tries , 2015, Bioinform..

[9]  Andrew H. Chan,et al.  ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[10]  Lucian Ilie,et al.  Correcting Illumina data , 2015, Briefings Bioinform..

[11]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[12]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[13]  Heng Li,et al.  BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[14]  Panos Kalnis,et al.  Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data , 2015, Bioinform..

[15]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[16]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[17]  Lucian Ilie,et al.  RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[18]  N. Lennon,et al.  Characterizing and measuring bias in sequence data , 2013, Genome Biology.

[19]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[20]  Xiaolong Wu,et al.  BLESS: Bloom filter-based error correction solution for high-throughput sequencing reads , 2014, Bioinform..

[21]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[22]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[23]  Michael Hiller,et al.  Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly , 2016, Briefings Bioinform..

[24]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[25]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[26]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[27]  Mark J. Clement,et al.  Effects of error-correction of heterozygous next-generation sequencing data , 2014, BMC Bioinformatics.

[28]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[29]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[30]  S. Koren,et al.  Assembly algorithms for next-generation sequencing data. , 2010, Genomics.

[31]  Paul Medvedev,et al.  Error correction of high-throughput sequencing datasets with non-uniform coverage , 2011, Bioinform..

[32]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[33]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[34]  Sara Goodwin,et al.  Oxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic genome , 2015, bioRxiv.

[35]  Ignacio Blanquer,et al.  Objective review of de novo stand‐alone error correction methods for NGS data , 2016 .

[36]  Yongchao Liu,et al.  DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI , 2011, BMC Bioinformatics.

[37]  E. Mardis Next-generation sequencing platforms. , 2013, Annual review of analytical chemistry.

[38]  Srinivas Aluru,et al.  Reptile: representative tiling for short read error correction , 2010, Bioinform..