Lempel-Ziv compression of structured text

We describe a novel Lempel-Ziv approach suitable for compressing structured documents, called LZCS, which takes advantage of redundant information that can appear in the structure. The main idea is that frequently repeated subtrees may exist and these can be replaced by a backward reference to their first occurrence. The main advantage is that compressed documents generated by LZCS are easy to display, access at random, and navigate. In a second stage, processed documents can be further compressed using some semiadaptive technique, so that random access and navigability remain possible. LZCS is especially efficient to compress collections of highly structured data, such as XML forms, invoices, e-commerce and web-service exchange documents. The comparison against structure-based and standard compressors shows that LZCS is a competitive choice for this type of documents, while the others are not well-suited to support navigation or random access.

[1]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[2]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[4]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[5]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[6]  Václav Snásel,et al.  Word-Based Compression Methods and Indexing for Text Retrieval Systems , 1999, ADBIS.

[7]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[8]  James Cheney Compressing XML with multiplexed hierarchical PPM models , 2001, Proceedings DCC 2001. Data Compression Conference.

[9]  Alistair Moffat,et al.  Re-store: a system for compressing, browsing, and searching large documents , 2001, Proceedings Eighth Symposium on String Processing and Information Retrieval.

[10]  Jayant R. Haritsa,et al.  XGrind: a query-friendly XML compressor , 2002, Proceedings 18th International Conference on Data Engineering.

[11]  Gonzalo Navarro,et al.  SCM: Structural Contexts Model for Improving Compression in Semistructured Text Databases , 2003, SPIRE.

[12]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.