Using clustering to improve WLZ77 compression

Many types of information retrieval systems (IRS) are created and more and more documents are stored in them too. The fundamental process of IRS is building of textual database, and compression of the documents stored in the database. One possibility for compression of textual data is word-based compression. Several algorithms for word-based compression algorithms based on Huffman encoding, LZW or BWT algorithm was proposed. In this paper, we describe word-based compression method based on LZ77 algorithm. IRS can also perform cluster analysis of textual database to improve quality of answers to userspsila queries. The information retrieved from the clustering can be very helpful in compression. Word-based compression using information about cluster hierarchy is presented in this paper. Experimental results which are provided at the end of the paper were achieved not only using well-known word-based compression algorithms WBW and WLZW but also using quite new WLZ77 algorithm.

[1]  Jan Martinovic,et al.  Vector model improvement by FCA and Topic Evolution , 2005, DATESO.

[2]  Alistair Moffat,et al.  Word-based block-sorting text compression , 2001, Proceedings 24th Australian Computer Science Conference. ACSC 2001.

[3]  Fionn Murtagh,et al.  On Ultrametricity, Data Coding, and Computation , 2004, J. Classif..

[4]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[5]  Václav Snásel,et al.  Query Expansion and Evolution of Topic in Information Retrieval Systems , 2004, DATESO.

[6]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[7]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[8]  Jan Platos,et al.  Word-Based Text Compression , 2008, ArXiv.

[9]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[10]  Jaroslav Pokornýa Word-based Compression Methods with Empty Words and Nonwords for Text Retrieval Systems , 2007 .

[11]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[12]  Alistair Moffat,et al.  Parsing strategies for BWT compression , 2001, Proceedings DCC 2001. Data Compression Conference.

[13]  T. Bell,et al.  Better OPM/L Text Compression , 1986, IEEE Trans. Commun..

[14]  Jan Martinovic,et al.  Improvement of Text Compression Parameters Using Cluster Analysis , 2007, DATESO.

[15]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[16]  Ian H. Witten,et al.  Arithmetic coding revisited , 1998, TOIS.

[17]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[18]  Jan Martinovic,et al.  Document Classification Based on the Topic Evaluation and Its Usage in Data Compression , 2007, 2007 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology - Workshops.