Effective Construction of Relative Lempel-Ziv Dictionaries

Web crawls generate vast quantities of text, retained and archived by the search services that initiate them. To store such data and to allow storage costs to be minimized, while still providing some level of random access to the compressed data, efficient and effective compression techniques are critical. The Relative Lempel Ziv (RLZ) scheme provides fast decompression and retrieval of documents from within large compressed collections, and even with a relatively small RAM-resident dictionary, is competitive relative to adaptive compression schemes. To date, the dictionaries required by RLZ compression have been formed from concatenations of substrings regularly sampled from the underlying document collection, then pruned in a manner that seeks to retain only the high-use sections. In this work, we develop new dictionary design heuristics, based on effective construction, rather than on pruning; we identify dictionary construction as a (string) covering problem. To avoid the complications of string covering algorithms on large collections, we focus on k-mers and their frequencies. First, with a reservoir sampler, we efficiently identify the most common k-mers. Then, since a collection typically comprises regions of local similarity, we select in each "epoch" a segment whose k-mers together achieve, locally, the highest coverage score. The dictionary is formed from the concatenation of these epoch-derived segments. Our selection process is inspired by the greedy approach to the Set Cover problem. Compared with the best existing pruning method, CARE, our scheme has a similar construction time, but achieves better compression effectiveness. Over several multi-gigabyte document collections, there are relative gains of up to 27%.

[1]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Lise Getoor,et al.  On Maximum Coverage in the Streaming Model & Application to Multi-topic Blog-Watch , 2009, SDM.

[4]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[5]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[6]  Justin Zobel,et al.  Sample selection for dictionary-based corpus compression , 2011, SIGIR '11.

[7]  Kim-Hung Li,et al.  Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[8]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[9]  Justin Zobel,et al.  Compact Auxiliary Dictionaries for Incremental Compression of Large Repositories , 2014, CIKM.

[10]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[11]  Justin Zobel,et al.  Principled dictionary pruning for low-memory corpus compression , 2014, SIGIR.

[12]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[13]  Justin Zobel,et al.  Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[14]  Graham Cormode,et al.  Set cover algorithms for very large datasets , 2010, CIKM.

[15]  Alistair Moffat,et al.  Lazy and Eager Approaches for the Set Cover Problem , 2014, ACSC.

[16]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[17]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[18]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[19]  Alistair Moffat,et al.  Access Time Tradeoffs in Archive Compression , 2015, AIRS.

[20]  Aravind Srinivasan,et al.  Randomized Distributed Edge Coloring via an Extension of the Chernoff-Hoeffding Bounds , 1997, SIAM J. Comput..

[21]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.