A syllable-based method for Vietnamese text compression

Text compression is a technique to reduce the size of text file and increase the transfer rate as well as save storage space. Many approaches have been proposed to tackle this problem in several languages such as: English, Chinese, Turkey, Japanese, French, etc. In this paper, we propose a method to compress Vietnamese text using syllables based on morphology and dictionaries. Our method firstly splits a morphosyllable to a consonant and a syllable then we encode it based on dictionaries of consonants and syllables. In our method, based on characteristics of Vietnamese language with six tone-marks, we build six different dictionaries of syllables. We collect a testing set of 20 different text files with different sizes to demonstrate our system. Experimental results show that our system achieves good performance with the compression ratio around 73%. In comparison with WinZIP version 19.51 and WinRAR version 5.212, our method achieves a higher compression ratio while the size of text file is small. So that, our method can apply efficiency to compress for short text such as: SMS messages, text messages on social networks.

[1]  Jan Platos,et al.  Word-Based Text Compression , 2008, ArXiv.

[2]  Ian H. Witten,et al.  Data Compression Using Adaptive Coding and Partial String Matching , 1984, IEEE Trans. Commun..

[3]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[4]  Nigel Collier,et al.  Named entity recognition in Vietnamese using classifier voting , 2007, TALIP.

[5]  Filippo Mignosi,et al.  Note on the greedy parsing optimality for dictionary-based text compression , 2012, Theor. Comput. Sci..

[6]  G. Blelloch Introduction to Data Compression * , 2022 .

[7]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[8]  Gonzalo Navarro,et al.  Boosting Text Compression with Word-Based Statistical Encoding , 2012, Comput. J..

[9]  Sanjay Misra,et al.  A lossless text compression technique using syllable based morphology , 2011, Int. Arab J. Inf. Technol..

[10]  Václav Snásel,et al.  Word-Based Compression Methods and Indexing for Text Retrieval Systems , 1999, ADBIS.

[11]  Mill Johannes G.A. Van,et al.  Transmission Of Information , 1961 .

[12]  Michal Zemlicka,et al.  Compression of small text files using syllables , 2006, Data Compression Conference (DCC'06).

[13]  Ahmed Patel,et al.  Rapid lossless compression of short text messages , 2015, Comput. Stand. Interfaces.

[14]  Charles F. Hockett,et al.  A mathematical theory of communication , 1948, MOCO.

[15]  Khalid Sayood,et al.  Introduction to Data Compression, Third Edition (Morgan Kaufmann Series in Multimedia Information and Systems) , 2005 .

[16]  Michal Zemlicka,et al.  Text Compression: Syllables , 2005, DATESO.

[17]  Jan Lansky,et al.  Genetic Algorithms in Syllable-Based Text Compression , 2007, DATESO.

[18]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[19]  Václav Snásel,et al.  Word-based compression methods for large text documents , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[20]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[21]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[22]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .