Improved long read correction for de novo assembly using an FM-index

Long read sequencing is changing the landscape of genomic research, especially de novo assembly. Despite the high error rate inherent to long read technologies, increased read lengths dramatically improve the continuity and accuracy of genome assemblies. However, the cost and throughput of these technologies limits their application to complex genomes. One solution is to decrease the cost and time to assemble novel genomes by leveraging “hybrid” assemblies that use long reads for scaffolding and short reads for accuracy. To this end, we describe a novel application of a multi-string Burrows-Wheeler Transform with auxiliary FM-index to correct errors in long read sequences using a set of complementary short reads. We show that our method efficiently produces significantly higher quality corrected sequence than existing hybrid error-correction methods. We demonstrate the effectiveness of our method compared to state-of-the-art hybrid and long-read only de novo assembly methods.

[1]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[2]  W. Wong,et al.  Improving PacBio Long Read Accuracy by Short Read Alignment , 2012, PloS one.

[3]  Leonard McMillan,et al.  Merging of multi-string BWTs with applications , 2014, Bioinform..

[4]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[5]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[6]  J. Landolin,et al.  Assembling large genomes with single-molecule sequencing and locality-sensitive hashing , 2014, Nature Biotechnology.

[7]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[8]  de Ng Dick Bruijn A combinatorial problem , 1946 .

[9]  Leonard McMillan,et al.  Short read error correction using an FM-index , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[10]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[11]  Kathleen Marchal,et al.  A network-based approach to identify substrate classes of bacterial glycosyltransferases , 2014, BMC Genomics.

[12]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[13]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[14]  Masahiro Kasahara,et al.  Performance comparison of second- and third-generation sequencers using a bacterial genome with two chromosomes , 2014, BMC Genomics.

[15]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[16]  Vineet Bafna,et al.  Cerulean: A Hybrid Assembly Using High Throughput Short and Long Reads , 2013, WABI.

[17]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[18]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[19]  Heng Li Fast construction of FM-index for long sequence reads , 2014, Bioinform..

[20]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[21]  S. Salzberg,et al.  Repetitive DNA and next-generation sequencing: computational challenges and solutions , 2011, Nature Reviews Genetics.

[22]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[23]  Shoshana Marcus,et al.  Error correction and assembly complexity of single molecule sequencing reads , 2014, bioRxiv.