Lempel-Ziv Parsing in External Memory

In the 35 years since its discovery, the Lempel-Ziv factorization (or LZ77 parsing) has become a fundamental method for data compression and string processing. In many applications, computation of the factorization is a time-space bottleneck. However, and despite the increasing need to apply LZ77 to massive data sets (for both storage and indexing), no algorithm to date scales to inputs that exceed the size of RAM. In this paper we describe the first algorithms for computing the LZ77 parsing efficiently using external memory.

[1]  Gregory Kucherov,et al.  mreps: efficient and flexible detection of tandem repeats in DNA , 2003, Nucleic Acids Res..

[2]  Juha Kärkkäinen,et al.  Lightweight Lempel-Ziv Parsing , 2013, SEA.

[3]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4]  Giovanni Manzini,et al.  On compressing the textual web , 2010, WSDM '10.

[5]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[6]  Simon J. Puglisi,et al.  Faster Approximate Pattern Matching in Compressed Repetitive Texts , 2011, ISAAC.

[7]  Gonzalo Navarro,et al.  New Lower and Upper Bounds for Representing Sequences , 2011, ESA.

[8]  Gonzalo Navarro,et al.  Self-indexing Based on LZ77 , 2011, CPM.

[9]  Lucian Ilie,et al.  A comparison of index-based lempel-Ziv LZ77 factorization algorithms , 2012, CSUR.

[10]  Gonzalo Navarro,et al.  LZ77-Like Compression with Fast Random Access , 2010, 2010 Data Compression Conference.

[11]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[12]  Gregory Kucherov,et al.  Finding Approximate Repetitions under Hamming Distance , 2001, ESA.

[13]  Peter Sanders,et al.  STXXL: standard template library for XXL data sets , 2008, Softw. Pract. Exp..

[14]  Maxime Crochemore String-Matching on Ordered Alphabets , 1992, Theor. Comput. Sci..

[15]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[16]  Juha Kärkkäinen,et al.  Linear Time Lempel-Ziv Factorization: Simple, Fast, Small , 2012, CPM.

[17]  Enno Ohlebusch,et al.  Lempel-Ziv Factorization Revisited , 2011, CPM.

[18]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[19]  Travis Gagie,et al.  Grammar-Based Compression in a Streaming Model , 2009, LATA.

[20]  Julian Shun,et al.  Practical Parallel Lempel-Ziv Factorization , 2013, 2013 Data Compression Conference.

[21]  Juha Kärkkäinen,et al.  Crochemore's String Matching Algorithm: Simplification, Extensions, Applications , 2013, Stringology.

[22]  Lucian Ilie,et al.  A Simple Algorithm for Computing the Lempel Ziv Factorization , 2008, Data Compression Conference (dcc 2008).

[23]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[24]  Gonzalo Navarro,et al.  Alphabet Partitioning for Compressed Rank/Select and Applications , 2010, ISAAC.

[25]  Gregory Kucherov,et al.  Finding maximal repetitions in a word in linear time , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[26]  Maxime Crochemore,et al.  Computing the Maximal-Exponent Repeats of an Overlap-Free String in Linear Time , 2012, SPIRE.

[27]  Vitaly Osipov,et al.  Inducing Suffix and LCP Arrays in External Memory , 2013, ALENEX.