Lightweight Lempel-Ziv Parsing

We introduce a new approach to LZ77 factorization that uses \(\O(n/d)\) words of working space and \(\O(dn)\) time for any d ≥ 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior, and particularly so at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.

[1]  Enno Ohlebusch,et al.  Lempel-Ziv Factorization Revisited , 2011, CPM.

[2]  Gonzalo Navarro,et al.  Indexing text using the Ziv-Lempel trie , 2002, J. Discrete Algorithms.

[3]  Eugene L. Lawler,et al.  Sublinear approximate string matching and biological applications , 1994, Algorithmica.

[4]  Gonzalo Navarro,et al.  LZ77-Like Compression with Fast Random Access , 2010, 2010 Data Compression Conference.

[5]  Volker Heun,et al.  A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[6]  Gonzalo Navarro,et al.  Self-indexing Based on LZ77 , 2011, CPM.

[7]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[8]  Tatiana Starikovskaya Computing Lempel-Ziv Factorization Online , 2012, MFCS.

[9]  Gonzalo Navarro,et al.  Indexing Highly Repetitive Collections , 2012, IWOCA.

[10]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[11]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2010, LATIN.

[12]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[13]  Juha Kärkkäinen,et al.  Linear Time Lempel-Ziv Factorization: Simple, Fast, Small , 2012, CPM.

[14]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[15]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[16]  Simon J. Puglisi,et al.  Lempel-Ziv factorization: Simple, fast, practical , 2013, ALENEX.

[17]  Maxime Crochemore String-Matching on Ordered Alphabets , 1992, Theor. Comput. Sci..

[18]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[19]  Gonzalo Navarro,et al.  Practical Compressed Suffix Trees , 2010, SEA.

[20]  Kunihiko Sadakane,et al.  An Online Algorithm for Finding the Longest Previous Factors , 2008, ESA.

[21]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[22]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[23]  Gang Chen,et al.  Lempel–Ziv Factorization Using Less Time & Space , 2008, Math. Comput. Sci..

[24]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[25]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[26]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[27]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[28]  Enno Ohlebusch,et al.  Computing Matching Statistics and Maximal Exact Matches on Compressed Full-Text Indexes , 2010, SPIRE.

[29]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.