A New Searchable Variable-to-Variable Compressor

Word-based compression over natural language text has shown to be a good choice to trade compression ratio and speed, obtaining compression ratios close to 30% and very fast decompression. Additionally, it permits fast searches over the compressed text using Boyer-Moore type algorithms. Such compressors are based on processing fixed source symbols (words) and assigning them variable-byte-length codewords, thus following a fixed-to-variable approach. We present a new variable-to-variable compressor (v2vdc) that uses words and phrases as the source symbols, which are encoded with a variable-length scheme. The phrases are chosen using the longest common prefix information on the suffix array of the text, so as to favor long and frequent phrases. We obtain compression ratios close to those of p7zip and ppmdi, overcoming bzip2, and 8-10 percentage points less than the equivalent word-based compressor. V2vdc is in addition among the fastest to decompress, and allows efficient direct search of the compressed text, in some cases the fastest to date as well.

[1]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[2]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[3]  Jan Platos,et al.  Word-Based Text Compression , 2008, ArXiv.

[4]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[5]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[6]  Craig G. Nevill-Manning,et al.  Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[7]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[8]  Wojciech Rytter Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2003, Theor. Comput. Sci..

[9]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[10]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[11]  Gonzalo Navarro,et al.  Lightweight natural language text compression , 2006, Information Retrieval.

[12]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[13]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[14]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[15]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[16]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[17]  Giovanni Manzini,et al.  Engineering a Lightweight Suffix Array Construction Algorithm , 2002, ESA.

[18]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[19]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[20]  Gonzalo Navarro,et al.  Dynamic lightweight text compression , 2010, TOIS.