Principled dictionary pruning for low-memory corpus compression

Compression of collections, such as text databases, can both reduce space consumption and increase retrieval efficiency, through better caching and better exploitation of the memory hierarchy. A promising technique is relative Lempel-Ziv coding, in which a sample of material from the collection serves as a static dictionary; in previous work, this method demonstrated extremely fast decoding and good compression ratios, while allowing random access to individual items. However, there is a trade-off between dictionary size and compression ratio, motivating the search for a compact, yet similarly effective, dictionary. In previous work it was observed that, since the dictionary is generated by sampling, some of it (selected substrings) may be discarded with little loss in compression. Unfortunately, simple dictionary pruning approaches are ineffective. We develop a formal model of our approach, based on generating an optimal dictionary for a given collection within a memory bound. We generate measures for identification of low-value substrings in the dictionary, and show on a variety of sizes of text collection that halving the dictionary size leads to only marginal loss in compression ratio. This is a dramatic improvement on previous approaches.

[1]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[2]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[3]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[4]  Neri Merhav,et al.  A measure of relative entropy between individual sequences with application to universal classification , 1993, IEEE Trans. Inf. Theory.

[5]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[6]  Walter F. Tichy,et al.  Delta algorithms: an empirical analysis , 1998, TSEM.

[7]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[8]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[9]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[10]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[11]  Hugh E. Williams,et al.  General-purpose compression for efficient retrieval , 2001, J. Assoc. Inf. Sci. Technol..

[12]  Nasir D. Memon,et al.  Cluster-based delta compression of a collection of files , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002..

[13]  Hugh E. Williams,et al.  A general-purpose compression scheme for large collections , 2002, TOIS.

[14]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[15]  Torsten Suel,et al.  Improved file synchronization techniques for maintaining large replicated collections over slow networks , 2004, Proceedings. 20th International Conference on Data Engineering.

[16]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[17]  Fred Douglis,et al.  Redundancy Elimination Within Large Collections of Files , 2004, USENIX Annual Technical Conference, General Track.

[18]  Szymon Grabowski,et al.  Revisiting dictionary‐based compression , 2005, Softw. Pract. Exp..

[19]  Sebastian Deorowicz,et al.  Revisiting dictionary-based compression: Research Articles , 2005 .

[20]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[21]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[22]  Gang Chen,et al.  Lempel–Ziv Factorization Using Less Time & Space , 2008, Math. Comput. Sci..

[23]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[24]  Charles L. A. Clarke,et al.  Information Retrieval - Implementing and Evaluating Search Engines , 2010 .

[25]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[26]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[27]  Giovanni Manzini,et al.  On compressing the textual web , 2010, WSDM '10.

[28]  Ricardo Baeza-Yates,et al.  Modern Information Retrieval - the concepts and technology behind search, Second edition , 2011 .

[29]  Justin Zobel,et al.  Collection-based compression using discovered long matching strings , 2011, CIKM '11.

[30]  Justin Zobel,et al.  Reference Sequence Construction for Relative Compression of Genomes , 2011, SPIRE.

[31]  Justin Zobel,et al.  Relative Lempel-Ziv Factorization for Efficient Storage and Retrieval of Web Collections , 2011, Proc. VLDB Endow..

[32]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[33]  Justin Zobel,et al.  Sample selection for dictionary-based corpus compression , 2011, SIGIR '11.

[34]  Justin Zobel,et al.  Iterative Dictionary Construction for Compression of Large DNA Data Sets , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Philip Shilane,et al.  WAN-optimized replication of backup datasets using stream-informed delta compression , 2012, TOS.