Range Predecessor and Lempel-Ziv Parsing

The Lempel-Ziv parsing of a string (LZ77 for short) is one of the most important and widely-used algorithmic tools in data compression and string processing. We show that the Lempel-Ziv parsing of a string of length $n$ on an alphabet of size $\sigma$ can be computed in $O(n\log\log\sigma)$ time ($O(n)$ time if we allow randomization) using $O(n\log\sigma)$ bits of working space; that is, using space proportional to that of the input string in bits. The previous fastest algorithm using $O(n\log\sigma)$ space takes $O(n(\log\sigma+\log\log n))$ time. We also consider the important rightmost variant of the problem, where the goal is to associate with each phrase of the parsing its most recent occurrence in the input string. We solve this problem in $O(n(1 + (\log\sigma/\sqrt{\log n}))$ time, using the same working space as above. The previous best solution for rightmost parsing uses $O(n(1+\log\sigma/\log\log n))$ time and $O(n\log n)$ space. As a bonus, in our solution for rightmost parsing we provide a faster construction method for efficient 2D orthogonal range reporting, which is of independent interest.

[1]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[2]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[3]  Giovanni Manzini,et al.  On compressing the textual web , 2010, WSDM '10.

[4]  Rajeev Raman,et al.  Optimal Trade-Offs for Succinct String Indexes , 2010, ICALP.

[5]  Meng He,et al.  Indexing Compressed Text , 2003 .

[6]  Juha Kärkkäinen,et al.  Linear Time Lempel-Ziv Factorization: Simple, Fast, Small , 2012, CPM.

[7]  Hideo Bannai,et al.  Space Efficient Linear Time Lempel-Ziv Factorization for Small Alphabets , 2014, 2014 Data Compression Conference.

[8]  Tatiana Starikovskaya Computing Lempel-Ziv Factorization Online , 2012, MFCS.

[9]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[10]  Lucian Ilie,et al.  A comparison of index-based lempel-Ziv LZ77 factorization algorithms , 2012, CSUR.

[11]  Paolo Ferragina,et al.  On the Bit-Complexity of Lempel-Ziv Compression , 2009, SIAM J. Comput..

[12]  Jeffrey Scott Vitter,et al.  Fast Construction of Wavelet Trees , 2014, SPIRE.

[13]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[14]  Dmitry Kosolobov Faster Lightweight Lempel-Ziv Parsing , 2015, MFCS.

[15]  Timothy M. Chan,et al.  Orthogonal range searching on the RAM, revisited , 2011, SoCG '11.

[16]  Maxime Crochemore,et al.  Improved Algorithms for the Range Next Value Problem and Applications , 2008, STACS.

[17]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[18]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[19]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[20]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[21]  Rajeev Raman,et al.  Succinct Indices for Range Queries with Applications to Orthogonal Range Maxima , 2012, ICALP.

[22]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[23]  Juha Kärkkäinen Repetition-Based Text Indexes , 1999 .

[24]  Enno Ohlebusch,et al.  Lempel-Ziv Factorization Revisited , 2011, CPM.

[25]  Moshe Lewenstein,et al.  Range Non-overlapping Indexing and Successive List Indexing , 2007, WADS.

[26]  N. Jesper Larsson Most Recent Match Queries in On-Line Suffix Trees , 2014, CPM.

[27]  Juha Kärkkäinen,et al.  Lightweight Lempel-Ziv Parsing , 2013, SEA.

[28]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[29]  Moshe Lewenstein Orthogonal Range Searching for Text Indexing , 2013, Space-Efficient Data Structures, Streams, and Algorithms.

[30]  Maxim A. Babenko,et al.  Wavelet Trees Meet Suffix Trees , 2015, SODA.

[31]  Hideo Bannai,et al.  Faster Compact On-Line Lempel-Ziv Factorization , 2014, STACS.

[32]  Timothy M. Chan,et al.  Counting inversions, offline orthogonal range counting, and related problems , 2010, SODA '10.

[33]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[34]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[35]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[36]  Gonzalo Navarro,et al.  Wavelet trees for all , 2012, J. Discrete Algorithms.

[37]  Gonzalo Navarro,et al.  Sorted Range Reporting , 2012, SWAT.

[38]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[39]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[40]  Gad M. Landau,et al.  Online timestamped text indexing , 2002, Inf. Process. Lett..

[41]  Wing-Kai Hon,et al.  Improved data structures for the orthogonal range successor problem , 2011, Comput. Geom..

[42]  William F. Smyth,et al.  A taxonomy of suffix array construction algorithms , 2007, CSUR.

[43]  Yasuo Tabei,et al.  Queries on LZ-Bounded Encodings , 2014, 2015 Data Compression Conference.

[44]  Wojciech Rytter,et al.  A Linear-Time Algorithm for Seeds Computation , 2011, SODA.

[45]  Djamal Belazzougui,et al.  Linear time construction of compressed text indices in compact space , 2014, STOC.