A lossless text compression technique using syllable based morphology

In this paper, we present a new lossless text compression technique which utilizes syllable-based morphology of multi-syllabic languages. The proposed algorithm is designed to partition words into its syllables and then to produce their shorter bit representations for compression. The method has six main components namely source file, filtering unit, syllable unit, compression unit, dictionary file and target file. The number of bits in coding syllables depends on the number of entries in the dictionary file. The proposed algorithm is implemented and tested using 20 different texts of different lengths collected from different fields. The results indicated a compression of up to 43%.

[1]  Przemyslaw Skibinski Two-level directory based compression , 2005, Data Compression Conference.

[2]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[3]  Gökhan Dalkiliç,et al.  Word-Based Fixed and Flexible List Compression , 2005, ISCIS.

[4]  Ismail Hakki Toroslu,et al.  A genetic algorithm approach for verification of the syllable-based text compression technique , 1997, J. Inf. Sci..

[5]  Lynne J. Cahill Syllable-based Morphology , 1990, COLING.

[6]  Jan Lansky,et al.  Syllable-based Compression for XML Documents , 2006, DATESO.

[7]  Václav Snásel,et al.  Word-based compression methods for large text documents , 1999, Proceedings DCC'99 Data Compression Conference (Cat. No. PR00096).

[8]  R. Nigel Horspool Improving LZW (data compression algorithm) , 1991, [1991] Proceedings. Data Compression Conference.

[9]  R. Nigel Horspool,et al.  Constructing word-based text compression algorithms , 1992, Data Compression Conference, 1992..

[10]  Jan Lansky,et al.  Syllable-Based Burrows-Wheeler Transform , 2007, DATESO.

[11]  Jan Lansky,et al.  Compression of Concatenated Web Pages Using XBW , 2008, SOFSEM.

[12]  K. Ibrahim Akman,et al.  A new text compression technique based on language structure , 1995, J. Inf. Sci..

[13]  Michal Zemlicka,et al.  Compression of small text files using syllables , 2006, Data Compression Conference (DCC'06).

[14]  Jan Lansky,et al.  Genetic Algorithms in Syllable-Based Text Compression , 2007, DATESO.

[15]  Michal Zemlicka,et al.  Text Compression: Syllables , 2005, DATESO.

[16]  Banu Diri,et al.  Content Based Compression of Turkish Documents , 2001 .

[17]  Bernard Comrie,et al.  The World's Major Languages , 1987 .

[18]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[19]  R. Nigel Horspool,et al.  Data Compression Using Dynamic Markov Modelling , 1987, Comput. J..

[20]  Malek Adjouadi,et al.  A Synergistic Text Compression Method-STCM , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.