Prefix-projection global constraint and top-k approach for sequential pattern mining

Sequential pattern mining (SPM) is an important data mining problem with broad applications. SPM is a hard problem due to the huge number of intermediate subsequences to be considered. State of the art approaches for SPM (e.g., PrefixSpan Pei et al. 2001) are largely based on the pattern-growth approach, where for each frequent prefix subsequence, only its related suffix subsequences need to be considered, and the database is recursively projected into smaller ones. Many authors have promoted the use of constraints to focus on the most promising patterns according to the interests of the end user. The top-k SPM problem is also used to cope with the difficulty of thresholding and to control the number of solutions. State of the art methods developed for SPM and top-k SPM, though efficient, are locked into a rather rigid search strategy, and suffer from the lack of declarativity and flexibility. Indeed, adding new constraints usually amounts to changing the data-structures used in the core of the algorithm, and combining these new constraints often require new developments. Recent works (e.g. Kemmar et al. 2014; Négrevergne and Guns 2015) have investigated the use of Constraint Programming (CP) for SPM. However, despite their nice declarative aspects, all these modelings have scaling problems, due to the huge size of their constraint networks. To address this issue, we propose the Prefix-Projection global constraint, which encapsulates both the subsequence relation as well as the frequency constraint. Its filtering algorithm relies on the principle of projected databases which allows to keep in the variables domain, only values leading to a frequent pattern in the database. Prefix-Projection filtering algorithm enforces domain consistency on the variable succeeding the current frequent prefix in polynomial time. This global constraint also allows for a straightforward implementation of additional constraints such as size, item membership, regular expressions and any combination of them. Experimental results show that our approach clearly outperforms existing CP approaches and competes well with the state-of-the-art methods on large datasets for mining frequent sequential patterns, sequential patterns under various constraints, and top-k sequential patterns. Unlike existing CP methods, our approach achieves a better scalability.

[1]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.

[2]  Guizhen Yang,et al.  Computational aspects of mining maximal frequent patterns , 2006, Theor. Comput. Sci..

[3]  Anton Dries,et al.  Dominance Programming for Itemset Mining , 2013, 2013 IEEE 13th International Conference on Data Mining.

[4]  Tias Guns,et al.  Constraint-Based Sequence Mining Using Constraint Programming , 2015, CPAIOR.

[5]  Patrice Boizumault,et al.  A Global Constraint for Mining Sequential Patterns with GAP Constraint , 2016, CPAIOR.

[6]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[7]  Mohammed J. Zaki Sequence mining in categorical domains: incorporating constraints , 2000, CIKM '00.

[8]  Toby Walsh,et al.  Handbook of Constraint Programming , 2006, Handbook of Constraint Programming.

[9]  Chedy Raïssi,et al.  Mining Dominant Patterns in the Sky , 2011, 2011 IEEE 11th International Conference on Data Mining.

[10]  Patrice Boizumault,et al.  PREFIX-PROJECTION Global Constraint for Sequential Pattern Mining , 2015, CP.

[11]  Patrice Boizumault,et al.  Mining (Soft-) Skypatterns Using Dynamic CSP , 2014, CPAIOR.

[12]  Nicolas Beldiceanu,et al.  Introducing global constraints in CHIP , 1994 .

[13]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[14]  Jiawei Han,et al.  TSP: Mining top-k closed sequential patterns , 2004, Knowledge and Information Systems.

[15]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[16]  Jian Pei,et al.  Mining sequential patterns with constraints in large databases , 2002, CIKM '02.

[17]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[18]  Marie-Christine Jaulent,et al.  Sequential pattern mining to discover relations between genes and rare diseases , 2012, 2012 25th IEEE International Symposium on Computer-Based Medical Systems (CBMS).

[19]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[20]  Antonio Gomariz,et al.  TKS: Efficient Mining of Top-K Sequential Patterns , 2013, ADMA.

[21]  Unil Yun,et al.  Mining top-k frequent patterns with combination reducing techniques , 2013, Applied Intelligence.

[22]  Antonio Gomariz,et al.  SPMF: a Java open-source pattern mining library , 2014, J. Mach. Learn. Res..

[23]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[24]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[25]  Bart Goethals,et al.  Sequence Mining Automata: A New Technique for Mining Frequent Sequences under Regular Expressions , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[26]  Jiawei Han,et al.  TFP: an efficient algorithm for mining top-k frequent closed itemsets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[27]  Gilles Pesant,et al.  A Regular Language Membership Constraint for Finite Sequences of Variables , 2004, CP.

[28]  Patrice Boizumault,et al.  Mining Relevant Sequence Patterns with CP-Based Framework , 2014, 2014 IEEE 26th International Conference on Tools with Artificial Intelligence.

[29]  Emmanuel Coquery,et al.  A SAT-Based Approach for Discovering Frequent, Closed and Maximal Patterns in a Sequence , 2012, ECAI.

[30]  Jiawei Han,et al.  TSP: mining top-K closed sequential patterns , 2003, Third IEEE International Conference on Data Mining.

[31]  Kyuseok Shim,et al.  Mining Sequential Patterns with Regular Expression Constraints , 2002, IEEE Trans. Knowl. Data Eng..

[32]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[33]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[34]  Ada Wai-Chee Fu,et al.  Mining frequent itemsets without support threshold: with and without item constraints , 2004, IEEE Transactions on Knowledge and Data Engineering.

[35]  Jean-Philippe Métivier,et al.  A Constraint Programming Approach for Mining Sequential Patterns in a Sequence Database , 2013, ArXiv.

[36]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[37]  Ming Li,et al.  Efficient Mining of Gap-Constrained Subsequences and Its Various Applications , 2012, TKDD.

[38]  Luc De Raedt,et al.  Itemset mining: A constraint programming perspective , 2011, Artif. Intell..

[39]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..