Jabba: hybrid error correction for long sequencing reads

BackgroundThird generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned.ResultsIn this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented.ConclusionJabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.

[1]  Paul Greenfield,et al.  Blue: correcting sequencing errors using consensus and context , 2014, Bioinform..

[2]  Jan Schröder,et al.  Genome analysis SHREC : a short-read error correction method , 2009 .

[3]  David R. Kelley,et al.  Quake: quality-aware detection and correction of sequencing errors , 2010, Genome Biology.

[4]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[5]  Jan Schröder,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .

[6]  Eugene W. Myers,et al.  Efficient Local Alignment Discovery amongst Noisy Long Reads , 2014, WABI.

[7]  Yongchao Liu,et al.  Long read alignment based on maximal exact match seeds , 2012, Bioinform..

[8]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[9]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[10]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[11]  Michael S. Waterman,et al.  An extreme value theory for long head runs , 1986 .

[12]  Bernard De Baets,et al.  essaMEM: finding maximal exact matches using enhanced sparse suffix arrays , 2013, Bioinform..

[13]  Hervé Moreau,et al.  An improved genome of the model marine alga Ostreococcus tauri unfolds by assessing Illumina de novo assemblies , 2014, BMC Genomics.

[14]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[15]  Panos Kalnis,et al.  Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data , 2015, Bioinform..

[16]  M. Schatz,et al.  Algorithms Gage: a Critical Evaluation of Genome Assemblies and Assembly Material Supplemental , 2008 .

[17]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[18]  Srinivas Aluru,et al.  A survey of error-correction methods for next-generation sequencing , 2013, Briefings Bioinform..

[19]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[20]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[21]  Mark Schilling,et al.  The Surprising Predictability of Long Runs , 2012 .

[22]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[23]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[24]  Lucian Ilie,et al.  HiTEC: accurate error correction in high-throughput sequencing data , 2011, Bioinform..

[25]  C. DeLisi,et al.  Phenotypic connections in surprising places , 2010, Genome Biology.

[26]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[27]  Lours,et al.  An Extreme Value Theory for Sequence Matching , 2022 .

[28]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[29]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[30]  Thomas Hackl,et al.  proovread: large-scale high-accuracy PacBio correction through iterative short read consensus , 2014, Bioinform..

[31]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[32]  Bernard De Baets,et al.  A Long Fragment Aligner called ALFALFA , 2015, BMC Bioinformatics.

[33]  Yun Zhang,et al.  A systematic comparison of genome-scale clustering algorithms , 2011, BMC Bioinformatics.