External memory BWT and LCP computation for sequence collections with applications

BackgroundSequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.ResultsWe propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs.ConclusionsWe prove that our algorithm performs $${\mathcal {O}}(n\, \mathsf {maxlcp})$$O(nmaxlcp) sequential I/Os, where n is the total length of the collection and $$\mathsf {maxlcp}$$maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.

[1]  Giovanna Rosone,et al.  The colored longest common prefix array computed via sequential scans , 2018, SPIRE.

[2]  Enno Ohlebusch,et al.  Compressed suffix trees: Efficient computation and storage of LCP-values , 2013, JEAL.

[3]  Paola Bonizzoni,et al.  Constructing String Graphs in External Memory , 2014, WABI.

[4]  Juha Kärkkäinen,et al.  Fast Lightweight Suffix Array Construction and Checking , 2003, CPM.

[5]  Jeffrey Scott Vitter,et al.  External memory algorithms and data structures: dealing with massive data , 2001, CSUR.

[6]  Giovanna Rosone,et al.  Lightweight LCP construction for very large collections of strings , 2016, J. Discrete Algorithms.

[7]  Enno Ohlebusch,et al.  Efficient algorithms for the all-pairs suffix-prefix problem and the all-pairs substring-prefix problem , 2010, Inf. Process. Lett..

[8]  Leonard McMillan,et al.  Merging of multi-string BWTs with applications , 2014, Bioinform..

[9]  Gonzalo Navarro,et al.  Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time , 2016, SODA.

[10]  Christina Boucher,et al.  Variable-Order de Bruijn Graphs , 2014, 2015 Data Compression Conference.

[11]  Enno Ohlebusch,et al.  Space-Efficient Construction of the Burrows-Wheeler Transform , 2013, SPIRE.

[12]  Cristina Dutra de Aguiar Ciferri,et al.  Generalized enhanced suffix array construction in external memory , 2017, Algorithms for Molecular Biology.

[13]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[14]  Guilherme P. Telles,et al.  An improved algorithm for the all-pairs suffix-prefix problem , 2016, J. Discrete Algorithms.

[15]  Marco Previtali,et al.  Bidirectional Variable-Order de Bruijn Graphs , 2016, LATIN.

[16]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.

[17]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[18]  Liang Zhao,et al.  Computing Burrows-Wheeler Similarity Distributions for String Collections , 2018, SPIRE.

[19]  Djamal Belazzougui,et al.  Linear time construction of compressed text indices in compact space , 2014, STOC.

[20]  Christina Boucher,et al.  Succinct Colored de Bruijn Graphs , 2016, bioRxiv.

[21]  Kunihiko Sadakane,et al.  Succinct de Bruijn Graphs , 2012, WABI.

[22]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[23]  Paola Bonizzoni,et al.  An External-Memory Algorithm for String Graph Construction , 2014, Algorithmica.

[24]  Roman Dementiev,et al.  STXXL: standard template library for XXL data sets , 2008 .

[25]  Juha Kärkkäinen,et al.  Engineering a Lightweight External Memory Suffix Array Construction Algorithm , 2017, ICABD.

[26]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[27]  Jeffrey Scott Vitter,et al.  Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[28]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[29]  Adam M. Phillippy,et al.  MUMmer4: A fast and versatile genome alignment system , 2018, PLoS Comput. Biol..

[30]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[31]  Juha Kärkkäinen,et al.  LCP Array Construction in External Memory , 2014, SEA.

[32]  Ge Nong,et al.  Practical linear-time O(1)-workspace suffix sorting for constant alphabets , 2013, TOIS.

[33]  Giovanni Manzini,et al.  Lightweight BWT and LCP Merging via the Gap Algorithm , 2017, SPIRE.

[34]  Giovanna Rosone,et al.  Lightweight algorithms for constructing and inverting the BWT of string collections , 2013, Theor. Comput. Sci..

[35]  Leonard McMillan,et al.  Constructing burrows-wheeler transforms of large string collections via merging , 2014, BCB.

[36]  Johann van der Merwe,et al.  A survey on peer-to-peer key management for mobile ad hoc networks , 2007, CSUR.

[37]  Gad M. Landau,et al.  An Efficient Algorithm for the All Pairs Suffix-Prefix Problem , 1992, Inf. Process. Lett..

[38]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[39]  Giovanni Manzini,et al.  Two Space Saving Tricks for Linear Time LCP Array Computation , 2004, SWAT.

[40]  Alexandru I. Tomescu,et al.  Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing , 2015 .

[41]  Antonio Restivo,et al.  An extension of the Burrows-Wheeler Transform , 2007, Theor. Comput. Sci..

[42]  Guilherme P. Telles,et al.  Inducing enhanced suffix arrays for string collections , 2017, Theor. Comput. Sci..