FLAS: fast and high-throughput algorithm for PacBio long-read self-correction

MOTIVATION The third generation PacBio long reads have greatly facilitated sequencing projects with very large read lengths, but they contain about 15% sequencing errors and need error correction. For the projects with long reads only, it is challenging to make correction with fast speed, and also challenging to correct a sufficient amount of read bases, i.e. to achieve high throughput self-correction. MECAT is currently among the fastest self-correction algorithms, but its throughput is relatively small (Xiao et al., 2017). RESULTS Here we introduce FLAS, a wrapper algorithm of MECAT, to achieve high throughput long read self-correction while keeping MECAT's fast speed. FLAS finds additional alignments from MECAT prealigned long reads to improve the correction throughput, and removes misalignments for accuracy. In addition, FLAS also uses the corrected long read regions to correct the uncorrected ones to further improve the throughput. In our performance tests on E. coli, S. cerevisiae, A. thaliana and human long reads, FLAS can achieve 22.0-50.6% larger throughput than MECAT. FLAS is 2-13x faster compared to the self-correction algorithms other than MECAT, and its throughput is also 9.8-281.8% larger. The FLAS corrected long reads can be assembled into contigs of 13.1-29.8% larger N50 sizes than MECAT. AVAILABILITY The FLAS software can be downloaded for free from this site: https://github.com/baoe/flas.

[1]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[2]  Shoshana Marcus,et al.  Error correction and assembly complexity of single molecule sequencing reads , 2014, bioRxiv.

[3]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[4]  Julian Parkhill,et al.  The extant World War 1 dysentery bacillus NCTC1: a genomic analysis , 2014, The Lancet.

[5]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[6]  Piet Demeester,et al.  Jabba: hybrid error correction for long sequencing reads , 2015, Algorithms for Molecular Biology.

[7]  Laura F. Landweber,et al.  The Architecture of a Scrambled Genome Reveals Massive Levels of Genomic Rearrangement during Development , 2014, Cell.

[8]  Michael C. Schatz,et al.  Third-generation sequencing and the future of genomics , 2016, bioRxiv.

[9]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[10]  S. Salzberg,et al.  Versatile and open software for comparing large genomes , 2004, Genome Biology.

[11]  Wing Hung Wong,et al.  Characterization of the human ESC transcriptome by hybrid sequencing , 2013, Proceedings of the National Academy of Sciences.

[12]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[13]  Kin-Fan Au,et al.  PacBio Sequencing and Its Applications , 2015, Genom. Proteom. Bioinform..

[14]  Ilan Shomorony,et al.  HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution , 2016, bioRxiv.

[15]  Roberto Grossi,et al.  Circular sequence comparison: algorithms and applications , 2016, Algorithms for Molecular Biology.

[16]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[17]  Feng Luo,et al.  MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads , 2017, Nature Methods.

[18]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[19]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[20]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[21]  S. Turner,et al.  Real-Time DNA Sequencing from Single Polymerase Molecules , 2009, Science.

[22]  Ergude Bao,et al.  HALC: High throughput algorithm for long read error correction , 2017, BMC Bioinformatics.

[23]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[24]  Mark J. P. Chaisson,et al.  Resolving the complexity of the human genome using single-molecule sequencing , 2014, Nature.

[25]  Esko Ukkonen,et al.  Accurate self-correction of errors in long reads using de Bruijn graphs , 2016, Bioinform..

[26]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[27]  Faraz Hach,et al.  CoLoRMap: Correcting Long Reads by Mapping short reads , 2016, Bioinform..

[28]  Jean-Michel Claverie,et al.  Pandoraviruses: Amoeba Viruses with Genomes Up to 2.5 Mb Reaching That of Parasitic Eukaryotes , 2013, Science.