Mining Compressing Sequential Patterns

Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NP-Hard and belongs to the class of inapproximable problems. We propose two heuristic algorithms to mining compressing patterns. The first uses a two-phase approach similar to Krimp for itemset data. To overcome performance with the required candidate generation we propose GoKrimp, an effective greedy algorithm that directly mines compressing patterns. We conduct an empirical study on six real-life datasets to compare the proposed algorithms by run time, compressibility, and classification accuracy using the patterns found as features for SVM classifiers.

[1]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Dmitriy Fradkin,et al.  Margin-closed frequent sequential pattern mining , 2010, UP '10.

[3]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[4]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[5]  Christos Faloutsos,et al.  On data mining, compression, and Kolmogorov complexity , 2007, Data Mining and Knowledge Discovery.

[6]  FaloutsosChristos,et al.  On data mining, compression, and Kolmogorov complexity , 2007 .

[7]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[8]  Dmitriy Fradkin,et al.  Robust Mining of Time Intervals with Semi-interval Partial Order Patterns , 2010, SDM.

[9]  Karl Rihaczek,et al.  1. WHAT IS DATA MINING? , 2019, Data Mining for the Social Sciences.

[10]  Fabian Mörchen,et al.  Unsupervised pattern mining from symbolic temporal data , 2007, SKDD.

[11]  Jilles Vreeken,et al.  Identifying the components , 2009, Data Mining and Knowledge Discovery.

[12]  Jilles Vreeken,et al.  Slim: Directly Mining Descriptive Patterns , 2012, SDM.

[13]  David J. Groggel,et al.  Practical Nonparametric Statistics , 2000, Technometrics.

[14]  Ola Svensson,et al.  Inapproximability Results for Maximum Edge Biclique, Minimum Linear Arrangement, and Sparsest Cut , 2011, SIAM J. Comput..

[15]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2008, IEEE Trans. Knowl. Data Eng..

[16]  Tijl De Bie,et al.  Maximum entropy models and subjective interestingness: an application to tiles in binary databases , 2010, Data Mining and Knowledge Discovery.

[17]  Tijl De Bie,et al.  A framework for mining interesting pattern sets , 2010, UP '10.

[18]  Jilles Vreeken,et al.  The long and the short of it: summarising event sequences with serial episodes , 2012, KDD.

[19]  Jilles Vreeken,et al.  Finding Good Itemsets by Packing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[20]  Gemma C. Garriga,et al.  Randomization Techniques for Graphs , 2009, SDM.

[21]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[22]  Jiawei Han Mining useful patterns: my evolutionary view , 2010, UP '10.

[23]  Jilles Vreeken,et al.  Making pattern mining useful , 2010, SKDD.

[24]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[25]  Fabian Mörchen,et al.  Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression , 2010, Knowledge and Information Systems.

[26]  Arno Siebes,et al.  StreamKrimp: Detecting Change in Data Streams , 2008, ECML/PKDD.

[27]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[28]  Lawrence B. Holder,et al.  Substucture Discovery in the SUBDUE System , 1994, KDD Workshop.

[29]  Li Wei,et al.  Compression-based data mining of sequential data , 2007, Data Mining and Knowledge Discovery.

[30]  David Haussler,et al.  On the Complexity of Iterated Shuffle , 1984, J. Comput. Syst. Sci..

[31]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[32]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[33]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[34]  Toon Calders,et al.  Mining Compressing Sequential Patterns , 2014, Stat. Anal. Data Min..

[35]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[36]  Paulo J. Azevedo,et al.  Time Series Motifs Statistical Significance , 2011, SDM.

[37]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.