LCP Array Construction in External Memory

One of the most important data structures for string processing—the suffix array—needs to be augmented with the longest-common-prefix (LCP) array in numerous applications. We describe the first external memory algorithm for constructing the LCP array given the suffix array as input. The only previous way to compute the LCP array for data that is bigger than the RAM is to use an external memory suffix array construction algorithm (SACA) with complex modifications to produce the LCP array as a by-product. Compared to the best prior method, our algorithm needs much less disk space (by more than a factor of three) and is significantly faster. Furthermore, our algorithm can be combined with any SACA, including a better one developed in the future.

[1]  Giovanni Manzini,et al.  Two Space Saving Tricks for Linear Time LCP Array Computation , 2004, SWAT.

[2]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[3]  Sen Zhang,et al.  Two Efficient Algorithms for Linear Time Suffix Array Construction , 2011, IEEE Transactions on Computers.

[4]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[5]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2011, J. Discrete Algorithms.

[6]  Giovanna Rosone,et al.  Lightweight LCP Construction for Next-Generation Sequencing Datasets , 2013, WABI.

[7]  Vitaly Osipov,et al.  Inducing Suffix and LCP Arrays in External Memory , 2013, ALENEX.

[8]  Veli Mäkinen Compact Suffix Array - A Space-Efficient Full-Text Index , 2003, Fundam. Informaticae.

[9]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[10]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[11]  Sen Zhang,et al.  Suffix Array Construction in External Memory Using D-Critical Substrings , 2014, TOIS.

[12]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[13]  Yi Wu,et al.  Induced Sorting Suffixes in External Memory , 2015, TOIS.

[14]  Cristina Dutra de Aguiar Ciferri,et al.  External Memory Generalized Suffix and LCP Arrays Construction , 2013, CPM.

[15]  Enno Ohlebusch,et al.  Fast and Lightweight LCP-Array Construction Algorithms , 2011, ALENEX.

[16]  MäkinenVeli Compact Suffix Array A Space-Efficient Full-Text Index , 2003 .

[17]  Jouni Sirén Sampled Longest Common Prefix Array , 2010, CPM.

[18]  Yi Wu,et al.  Induced Sorting Suffixes in External Memory with Better Design and Less Space , 2015, SPIRE.

[19]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[20]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[21]  Julian Shun,et al.  Fast Parallel Computation of Longest Common Prefixes , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  Peter Sanders,et al.  STXXL: standard template library for XXL data sets , 2008, Softw. Pract. Exp..

[23]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[24]  Juha Kärkkäinen,et al.  Engineering a Lightweight External Memory Suffix Array Construction Algorithm , 2017, ICABD.

[25]  Maxime Crochemore,et al.  Occurrence and Substring Heuristics for i-Matching , 2003, Fundam. Informaticae.

[26]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[27]  Peter Sanders Algorithm Engineering - An Attempt at a Definition , 2009, Efficient Algorithms.

[28]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[29]  Jeffrey Scott Vitter,et al.  Algorithms and Data Structures for External Memory , 2008, Found. Trends Theor. Comput. Sci..

[30]  Juha Kärkkäinen,et al.  Lempel-Ziv Parsing in External Memory , 2014, 2014 Data Compression Conference.

[31]  Sean Keely,et al.  Parallel suffix array and least common prefix for the GPU , 2013, PPoPP '13.

[32]  Johann van der Merwe,et al.  A survey on peer-to-peer key management for mobile ad hoc networks , 2007, CSUR.

[33]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[34]  Simon J. Puglisi,et al.  Space-Time Tradeoffs for Longest-Common-Prefix Array Computation , 2008, ISAAC.

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Johannes Fischer,et al.  Inducing the LCP-Array , 2011, WADS.

[37]  Juha Kärkkäinen,et al.  Lightweight Lempel-Ziv Parsing , 2013, SEA.

[38]  Juha Kärkkäinen,et al.  Parallel External Memory Suffix Sorting , 2015, CPM.

[39]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[40]  Juha Kärkkäinen,et al.  Tighter Bounds for the Sum of Irreducible LCP Values , 2015, CPM.

[41]  Mikkel Thorup,et al.  RAM-Efficient External Memory Sorting , 2015, Algorithmica.