Multi-stream word-based compression algorithm

In this article, we present a novel word-based lossless compression algorithm for text files which uses a semi-static model. We named our algorithm as Multi-stream Word-based Compression Algorithm (MWCA), because it stores the compressed forms of the words in three individual streams depending on their frequencies in the text. It also stores two dictionaries and a bit vector as a side information. In our experiments MWCA obtains compression ratio over 3,23 bpc on average and 2,88 bpc on files larger than 50 MB. If a variable length encoder like Huffman Coding is used after MWCA, given ratios will reduce to 2,63 and 2,44 bpc respectively. With the advantage of its multi-stream structure MWCA could become a good solution especially for storing and searching big text data.

[1]  Alistair Moffat,et al.  Fast file search using text compression , 1997 .

[2]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[3]  Altan Mesut,et al.  A new compression algorithm for fast text search , 2016 .

[4]  Gonzalo Navarro,et al.  Dynamic lightweight text compression , 2010, TOIS.

[5]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[6]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[7]  Gonzalo Navarro,et al.  Word-Based Statistical Compressors as Natural Language Compression Boosters , 2008, Data Compression Conference (dcc 2008).

[8]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[9]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[10]  Peter Deutsch,et al.  DEFLATE Compressed Data Format Specification version 1.3 , 1996, RFC.

[11]  Gonzalo Navarro,et al.  Lightweight natural language text compression , 2006, Information Retrieval.

[12]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[13]  Gonzalo Navarro,et al.  New adaptive compressors for natural language text , 2008, Softw. Pract. Exp..

[14]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[15]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[16]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.