Mining Compressing Sequential Patterns

Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the minimumdescription length MDL principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL-based algorithms for mining non-redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two-phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending a given pattern. This technique improves the efficiency of the GoKrimp algorithm significantly while it still preserves the quality of the set of patterns. We conduct an empirical study on eight datasets to show the effectiveness of our approach in comparison to the state-of-the-art algorithms in terms of interpretability of the extracted patterns, run time, compression ratio, and classification accuracy using the discovered patterns as features for different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013

[1]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[2]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[3]  Tijl De Bie,et al.  A framework for mining interesting pattern sets , 2010, SIGKDD Explor..

[4]  Jilles Vreeken,et al.  Finding Good Itemsets by Packing Data , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[6]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[7]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[8]  Dmitriy Fradkin,et al.  Margin-closed frequent sequential pattern mining , 2010, UP '10.

[9]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[10]  Paulo J. Azevedo,et al.  Time Series Motifs Statistical Significance , 2011, SDM.

[11]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[12]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[13]  Tijl De Bie,et al.  Maximum entropy models and subjective interestingness: an application to tiles in binary databases , 2010, Data Mining and Knowledge Discovery.

[14]  Dmitriy Fradkin,et al.  Robust Mining of Time Intervals with Semi-interval Partial Order Patterns , 2010, SDM.

[15]  Jilles Vreeken,et al.  Making pattern mining useful , 2010, SKDD.

[16]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[17]  Fabian Mörchen,et al.  Efficient mining of all margin-closed itemsets with applications in temporal knowledge discovery and classification by compression , 2010, Knowledge and Information Systems.

[18]  Arno Siebes,et al.  StreamKrimp: Detecting Change in Data Streams , 2008, ECML/PKDD.

[19]  Christos Faloutsos,et al.  Fully automatic cross-associations , 2004, KDD.

[20]  Jilles Vreeken,et al.  The long and the short of it: summarising event sequences with serial episodes , 2012, KDD.

[21]  Ola Svensson,et al.  Inapproximability Results for Maximum Edge Biclique, Minimum Linear Arrangement, and Sparsest Cut , 2011, SIAM J. Comput..

[22]  Paul M. B. Vitányi,et al.  Clustering by compression , 2003, IEEE Transactions on Information Theory.

[23]  Li Wei,et al.  Compression-based data mining of sequential data , 2007, Data Mining and Knowledge Discovery.

[24]  David Haussler,et al.  On the Complexity of Iterated Shuffle , 1984, J. Comput. Syst. Sci..

[25]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[26]  Gemma C. Garriga,et al.  Randomization Techniques for Graphs , 2009, SDM.

[27]  Jiawei Han Mining useful patterns: my evolutionary view , 2010, UP '10.

[28]  Lawrence B. Holder,et al.  Substucture Discovery in the SUBDUE System , 1994, KDD Workshop.

[29]  Toon Calders,et al.  Mining Compressing Sequential Patterns , 2012, Stat. Anal. Data Min..

[30]  Jilles Vreeken,et al.  Slim: Directly Mining Descriptive Patterns , 2012, SDM.

[31]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2008, IEEE Trans. Knowl. Data Eng..

[32]  Fabian Mörchen,et al.  Unsupervised pattern mining from symbolic temporal data , 2007, SKDD.

[33]  Jilles Vreeken,et al.  Identifying the components , 2009, Data Mining and Knowledge Discovery.

[34]  Amy Sue Bix Hard Times in the New Economy , 2004 .

[35]  Christos Faloutsos,et al.  On data mining, compression, and Kolmogorov complexity , 2007, Data Mining and Knowledge Discovery.

[36]  W. J. Conover,et al.  Practical Nonparametric Statistics , 1972 .