An Efficient Text Compression Algorithm - Data Mining Perspective

The paper explores a novel compression perspective of Data Mining. Frequent Pattern Mining, an important phase of Association Rule Mining is employed in the process of Huffman Encoding for Lossless Text Compression. Conventional Apriori algorithm has been refined to employ efficient pruning strategies to optimize the number of patterns employed in encoding. Detailed simulations of the proposed algorithms in relation to Conventional Huffman Encoding has been done over benchmark datasets and results indicate significant gains in compression ratio.

[1]  Alistair Moffat,et al.  Implementing the PPM data compression scheme , 1990, IEEE Trans. Commun..

[2]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[3]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[4]  Gerd Stumme,et al.  Mining frequent patterns with counting inference , 2000, SKDD.

[5]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[6]  Christian Borgelt,et al.  Frequent item set mining , 2012, WIREs Data Mining Knowl. Discov..

[7]  Sebastian Deorowicz,et al.  Universal lossless data compression algorithms , 2003 .

[8]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[9]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[10]  Christian Borgelt,et al.  Keeping things simple: finding frequent item sets by recursive elimination , 2005 .

[11]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[12]  Bart Goethals,et al.  Survey on Frequent Pattern Mining , 2003 .

[13]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[14]  Ian H. Witten,et al.  Arithmetic coding for data compression , 1987, CACM.

[15]  Jeffrey Scott Vitter,et al.  Design and analysis of dynamic Huffman codes , 1987, JACM.

[16]  Ananth Grama,et al.  Data Mining: From Serendipity to Science - Guest Editors' Introduction , 1999, Computer.

[17]  Dick Pountain Focus on Algorithms: Spraying and Smudging , 1987 .

[18]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[19]  Claude E. Shannon,et al.  The mathematical theory of communication , 1950 .

[20]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[21]  E. F. Moore,et al.  Variable-length binary encodings , 1959 .

[22]  David Salomon,et al.  Data Compression: The Complete Reference , 2006 .

[23]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[24]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[25]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .