n-Gram-Based Text Compression

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.

[1]  Jan Platos,et al.  Word-Based Text Compression , 2008, ArXiv.

[2]  Ahmed Patel,et al.  Rapid lossless compression of short text messages , 2015, Comput. Stand. Interfaces.

[3]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[5]  Jan Platos,et al.  Compression of small text files , 2008, Adv. Eng. Informatics.

[6]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[7]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[8]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[9]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[10]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[11]  Sanjay Misra,et al.  A lossless text compression technique using syllable based morphology , 2011, Int. Arab J. Inf. Technol..

[12]  Michal Zemlicka,et al.  Compression of small text files using syllables , 2006, Data Compression Conference (DCC'06).

[13]  Hussein Al-Bahadili,et al.  An adaptive character wordlength algorithm for data compression , 2008, Comput. Math. Appl..

[14]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[15]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[16]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[17]  Václav Snásel,et al.  A syllable-based method for Vietnamese text compression , 2016, IMCOM.

[18]  Michal Zemlicka,et al.  Text Compression: Syllables , 2005, DATESO.

[19]  Václav Snásel,et al.  Word-Based Compression Methods and Indexing for Text Retrieval Systems , 1999, ADBIS.

[20]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[21]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[22]  Vaclav Snasel,et al.  Trigram-Based Vietnamese Text Compression , 2016 .

[23]  A. H. Robinson,et al.  Results of a prototype television bandwidth compression scheme , 1967 .

[24]  Jan Lansky,et al.  Genetic Algorithms in Syllable-Based Text Compression , 2007, DATESO.

[25]  W. Marsden I and J , 2012 .

[26]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.