Computing the BWT and LCP array of a Set of Strings in External Memory

Indexing very large collections of strings, such as those produced by the widespread next generation sequencing technologies, heavily relies on multi-string generalization of the Burrows-Wheeler Transform (BWT): recent developments in this field have resulted in external memory algorithms, motivated by the large requirements of in-memory approaches. The related problem of computing the Longest Common Prefix (LCP) array of a set of strings is often instrumental in several algorithms: for example, to compute the suffix-prefix overlaps among strings, which is an essential step for many genome assembly algorithms. In this paper we propose a new external memory method to simultaneously build the BWT and the LCP array on a collection of $m$ strings of length $k$ with $O(mkl)$ time and I/O complexity, using $O(k + m)$ main memory, where $l$ is the maximum value in the LCP array.

[1]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2010, LATIN.

[2]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[3]  Antonio Restivo,et al.  An extension of the Burrows-Wheeler Transform , 2007, Theor. Comput. Sci..

[4]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[5]  Heng Li Fast construction of FM-index for long sequence reads , 2014, Bioinform..

[6]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[7]  Giovanna Rosone,et al.  Lightweight LCP Construction for Next-Generation Sequencing Datasets , 2013, WABI.

[8]  Giovanna Rosone,et al.  The Burrows-Wheeler Transform between Data Compression and Combinatorics on Words , 2013, CiE.

[9]  Paola Bonizzoni,et al.  A New Lightweight Algorithm to compute the BWT and the LCP array of a Set of Strings , 2016, ArXiv.

[10]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[11]  Giovanna Rosone,et al.  Lightweight BWT Construction for Very Large String Collections , 2011, CPM.

[12]  Jouni Sirén,et al.  Compressed Suffix Arrays for Massive Data , 2009, SPIRE.

[13]  Jared T. Simpson,et al.  Efficient construction of an assembly string graph using the FM-index , 2010, Bioinform..

[14]  Ge Nong,et al.  Linear Suffix Array Construction by Almost Pure Induced-Sorting , 2009, 2009 Data Compression Conference.

[15]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[16]  Heng Li,et al.  Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly , 2012, Bioinform..

[17]  Paola Bonizzoni,et al.  An External-Memory Algorithm for String Graph Construction , 2014, Algorithmica.

[18]  Leonard McMillan,et al.  Merging of multi-string BWTs with applications , 2014, Bioinform..

[19]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[20]  Giovanna Rosone,et al.  Lightweight LCP construction for very large collections of strings , 2016, J. Discrete Algorithms.

[21]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[22]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[23]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[24]  Antonio Restivo,et al.  An Extension of the Burrows Wheeler Transform and Applications to Sequence Comparison and Data Compression , 2005, CPM.

[25]  P. Bork,et al.  A human gut microbial gene catalogue established by metagenomic sequencing , 2010, Nature.

[26]  Susana Ladra,et al.  Approximate all-pairs suffix/prefix overlaps , 2012, Inf. Comput..

[27]  Paola Bonizzoni,et al.  LSG: An External-Memory Tool to Compute String Graphs for Next-Generation Sequencing Data Assembly , 2016, J. Comput. Biol..

[28]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..