Off-line dictionary-based compression

Dictionary-based modeling is a mechanism used in many practical compression schemes. In most implementations of dictionary-based compression the encoder operates on-line, incrementally inferring its dictionary of available phrases from previous parts of the message. An alternative approach is to use the full message to infer a complete dictionary in advance, and include an explicit representation of the dictionary as part of the compressed message. In this investigation, we develop a compression scheme that is a combination of a simple but powerful phrase derivation method and a compact dictionary encoding. The scheme is highly efficient, particularly in decompression, and has characteristics that make it a favorable choice when compressed data is to be searched directly. We describe data structures and algorithms that allow our mechanism to operate in linear time and space.

[1]  Hugh E. Williams,et al.  General-purpose compression for efficient retrieval , 2001 .

[2]  Frank Rubin,et al.  Experiments in text file compression , 1976, CACM.

[3]  Alistair Moffat,et al.  Exploiting clustering in inverted file compression , 1996, Proceedings of Data Compression Conference - DCC '96.

[4]  Udi Manber,et al.  A text compression scheme that allows fast searching directly in the compressed file , 1994, TOIS.

[5]  Glen G. Langdon,et al.  A note on the Ziv-Lempel model for compressing individual sequences , 1983, IEEE Trans. Inf. Theory.

[6]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[7]  J. Gerard Wolff,et al.  Recoding of Natural Language for Economy of Transmission of Storage , 1978, Comput. J..

[8]  Ian H. Witten,et al.  The relationship between greedy parsing and symbolwise text compression , 1994, JACM.

[9]  Paul G. Howard,et al.  The design and analysis of efficient lossless data compression systems , 1993 .

[10]  Craig G. Nevill-Manning,et al.  Compression and Explanation Using Hierarchical Grammars , 1997, Comput. J..

[11]  Alistair Moffat,et al.  On the implementation of minimum redundancy prefix codes , 1997, IEEE Trans. Commun..

[12]  J. Wolff AN ALGORITHM FOR THE SEGMENTATION OF AN ARTIFICIAL LANGUAGE ANALOGUE , 1975 .

[13]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[14]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[15]  Mark N. Wegman,et al.  Variations on a theme by Ziv and Lempel , 1985 .

[16]  Timothy C. Bell,et al.  A hybrid approach to text compression , 1994, Proceedings of IEEE Data Compression Conference (DCC'94).

[17]  Jon Louis Bentley,et al.  Data compression using long common strings , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[18]  James A. Storer,et al.  Data Compression: Methods and Theory , 1987 .

[19]  Alistair Moffat,et al.  Housekeeping for prefix coding , 2000, IEEE Trans. Commun..