Improving Parse Trees for Efficient Variable-to-Fixed Length Codes

We address the problem of improving variable-length-to-fixed-length codes (VF codes). A VF code that we deal here with is an encoding scheme that parses an input text into variable length substrings and then assigns a fixed length codeword to each parsed substring. VF codes have favourable properties for fast decoding and fast compressed pattern matching, but they are worse in compression ratio than the latest compression methods. The compression ratio of a VF code depends on the parse tree used as a dictionary. To gain a better compression ratio we present several improvement methods for constructing parse trees. All of them are heuristical solutions since it is intractable to construct the optimal parse tree. We compared our methods with the previous VF codes, and showed experimentally that their compression ratios reach to the level of state-of-the-art compression methods.

[1]  Shmuel Tomi Klein,et al.  Using Fibonacci Compression Codes as Alternatives to Dense Codes , 2008, Data Compression Conference (dcc 2008).

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Serap A. Savari,et al.  Variable-to-fixed length codes for predictable sources , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[4]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[5]  Satoshi Yoshida,et al.  On Performance of Compressed Pattern Matching on VF Codes , 2011, 2011 Data Compression Conference.

[6]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[7]  Ayumi Shinohara,et al.  Collage system: a unifying framework for compressed pattern matching , 2003, Theor. Comput. Sci..

[8]  Yehoshua Perl,et al.  Is text compression by prefixes and suffixes practical? , 1983, Acta Informatica.

[9]  Shmuel Tomi Klein,et al.  Improved Variable-to-Fixed Length Codes , 2008, SPIRE.

[10]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[11]  Khalid Sayood Lossless Compression Handbook , 2003 .

[12]  Robert Giegerich,et al.  From Ukkonen to McCreight and Weiner: A Unifying View of Linear-Time Suffix Tree Construction , 1997, Algorithmica.

[13]  Craig G. Nevill-Manning,et al.  Compression by induction of hierarchical grammars , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[14]  Gonzalo Navarro,et al.  Dynamic lightweight text compression , 2010, TOIS.

[15]  Marek Chrobak,et al.  The greedy algorithm for the minimum common string partition problem , 2005, TALG.

[16]  Hidetoshi Yokoo,et al.  Average-sense optimality and competitive optimality for almost instantaneous VF codes , 2001, IEEE Trans. Inf. Theory.

[17]  Shmuel Tomi Klein Improving Static Compression Schemes by Alphabet Extension , 2000, CPM.

[18]  Serap A. Savari,et al.  Generalized Tunstall codes for sources with memory , 1997, IEEE Trans. Inf. Theory.

[19]  Gonzalo Navarro,et al.  (S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases , 2003, SPIRE.

[20]  Gonzalo Navarro,et al.  A New Searchable Variable-to-Variable Compressor , 2010, 2010 Data Compression Conference.

[21]  Gonzalo Navarro,et al.  An Efficient Compression Code for Text Databases , 2003, ECIR.

[22]  Brian Parker Tunstall,et al.  Synthesis of noiseless compression codes , 1967 .

[23]  Laurent Lyaudet NP-hard and linear variants of hypergraph partitioning , 2010, Theor. Comput. Sci..

[24]  Y. L. Goh,et al.  InAlAs Avalanche Photodiode With Type-II Superlattice Absorber for Detection Beyond 2 $\mu\hbox{m}$ , 2011, IEEE Transactions on Electron Devices.

[25]  Frans M. J. Willems,et al.  Variable to fixed-length codes for Markov sources , 1987, IEEE Trans. Inf. Theory.

[26]  Hiroshi Sakamoto,et al.  Context-sensitive grammar transform: Compression and pattern matching , 2008 .

[27]  Alberto Apostolico,et al.  Robust transmission of unbounded strings using Fibonacci representations , 1987, IEEE Trans. Inf. Theory.

[28]  Shmuel Tomi Klein,et al.  Complexity aspects of guessing prefix codes , 2005, Algorithmica.

[29]  A. Moffat,et al.  Offline dictionary-based compression , 2000, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[30]  Takuya Kida Suffix Tree Based VF-Coding for Compressed Pattern Matching , 2009, 2009 Data Compression Conference.

[31]  Satoshi Yoshida,et al.  Training Parse Trees for Efficient VF Coding , 2010, SPIRE.

[32]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[33]  Takuya KIDA STVF Code: An Efficient VF Coding using Frequency-base-pruned Suffix Tree , 2009 .