论文信息 - MapReduce for accurate error correction of next-generation sequencing data

MapReduce for accurate error correction of next-generation sequencing data

Motivation Next-generation sequencing platforms have produced huge amounts of sequence data. This is revolutionizing every aspect of genetic and genomic research. However, these sequence datasets contain quite a number of machine-induced errors-e.g. errors due to substitution can be as high as 2.5%. Existing error-correction methods are still far from perfect. In fact, more errors are sometimes introduced than correct corrections, especially by the prevalent k-mer based methods. The existing methods have also made limited exploitation of on-demand cloud computing. Results We introduce an error-correction method named MEC, which uses a two-layered MapReduce technique to achieve high correction performance. In the first layer, all the input sequences are mapped to groups to identify candidate erroneous bases in parallel. In the second layer, the erroneous bases at the same position are linked together from all the groups for making statistically reliable corrections. Experiments on real and simulated datasets show that our method outperforms existing methods remarkably. Its per-position error rate is consistently the lowest, and the correction gain is always the highest. Availability and Implementation The source code is available at bioinformatics.gxu.edu.cn/ngs/mec. Contacts wongls@comp.nus.edu.sg or jinyan.li@uts.edu.au. Supplementary information Supplementary data are available at Bioinformatics online.

[1] Richard Durbin,et al. Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[2] Jan Schröder,et al. BIOINFORMATICS ORIGINAL PAPER , 2022 .

[3] Srinivas Aluru,et al. A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[4] Scott Shenker,et al. Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[5] Leena Salmela,et al. Correction of sequencing errors in a mixed set of reads , 2010, Bioinform..

[6] P. Pevzner,et al. An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7] Jan Schröder,et al. Genome analysis SHREC : a short-read error correction method , 2009 .

[8] Dick de Ridder,et al. ACE: accurate correction of errors using K-mer tries , 2015, Bioinform..

[9] Andrew H. Chan,et al. ECHO: a reference-free short-read error correction algorithm. , 2011, Genome research.

[10] Lucian Ilie,et al. Correcting Illumina data , 2015, Briefings Bioinform..

[11] David R. Kelley,et al. Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[12] Alexey A. Gurevich,et al. QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[13] Heng Li,et al. BFC: correcting Illumina sequencing errors , 2015, Bioinform..

[14] Panos Kalnis,et al. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data , 2015, Bioinform..

[15] R. Durbin,et al. Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[16] J. Bonfield,et al. Finishing the euchromatic sequence of the human genome , 2004, Nature.

[17] Lucian Ilie,et al. RACER: Rapid and accurate correction of errors in reads , 2013, Bioinform..

[18] N. Lennon,et al. Characterizing and measuring bias in sequence data , 2013, Genome Biology.