Frequent Closed Sequence Mining without Candidate Maintenance

Previous studies have presented convincing arguments that a frequent pattern mining algorithm should not mine all frequent patterns but only the closed ones because the latter leads to not only a more compact yet complete result set but also better efficiency. However, most of the previously developed closed pattern mining algorithms work under the candidate maintenance-and- test paradigm, which is inherently costly in both runtime and space usage when the support threshold is low or the patterns become long. In this paper, we present BIDE, an efficient algorithm for mining frequent closed sequences without candidate maintenance. It adopts a novel sequence closure checking scheme called Bl-Directional Extension and prunes the search space more deeply compared to the previous algorithms by using the BackScan pruning method. A thorough performance study with both sparse and dense, real, and synthetic data sets has demonstrated that BIDE significantly outperforms the previous algorithm: It consumes an order(s) of magnitude less memory and can be more than an order of magnitude faster. It is also linearly scalable in terms of database size.

[1]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[2]  Yuanyuan Zhou,et al.  CP-Miner: finding copy-paste and related bugs in large-scale software code , 2006, IEEE Transactions on Software Engineering.

[3]  Yuanyuan Zhou,et al.  Association Proceedings of the Third USENIX Conference on File and Storage Technologies San Francisco , CA , USA March 31 – April 2 , 2004 , 2004 .

[4]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[5]  Sridhar Ramaswamy,et al.  Cyclic association rules , 1998, Proceedings 14th International Conference on Data Engineering.

[6]  George Karypis,et al.  SLPMiner: an algorithm for finding frequent sequential patterns using length-decreasing support constraint , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  Jia-Dong Ren,et al.  Mining Weighted Closed Sequential Patterns in Large Databases , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[8]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[9]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[10]  David A. Padua,et al.  Parallel mining of closed sequential patterns , 2005, KDD '05.

[11]  Chao Liu,et al.  Data Mining for Software Engineering , 2009, Computer.

[12]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[13]  Jian Pei,et al.  Mining sequential patterns with constraints in large databases , 2002, CIKM '02.

[14]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[15]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[16]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[17]  George Karypis,et al.  Evaluation of Techniques for Classifying Biological Sequences , 2002, PAKDD.

[18]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[19]  Florent Masseglia,et al.  The PSP Approach for Mining Sequential Patterns , 1998, PKDD.

[20]  Philip S. Yu,et al.  Efficiently mining frequent closed partial orders , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[21]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[22]  Rakesh Agarwal,et al.  Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[23]  Heikki Mannila,et al.  Discovering Frequent Episodes in Sequences , 1995, KDD.

[24]  Jiawei Han,et al.  Mining top-k frequent closed patterns without minimum support , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[25]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[26]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[27]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[28]  James Bailey,et al.  Mining minimal distinguishing subsequence patterns with gap constraints , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[29]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[30]  Gemma C. Garriga,et al.  Summarizing Sequential Data with Closed Partial Orders , 2005, SDM.

[31]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[32]  Ke Wang,et al.  Frequent-subsequence-based prediction of outer membrane proteins , 2003, KDD '03.

[33]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[34]  M. Sternberg,et al.  Automated structure-based prediction of functional sites in proteins: applications to assessing the validity of inheriting protein function from homology in genome annotation and to protein docking. , 2001, Journal of molecular biology.

[35]  D. Higgins,et al.  Finding flexible patterns in unaligned protein sequences , 1995, Protein science : a publication of the Protein Society.

[36]  Jiawei Han,et al.  TSP: mining top-K closed sequential patterns , 2003, Third IEEE International Conference on Data Mining.

[37]  Jian Pei,et al.  MAPO: mining API usages from open source repositories , 2006, MSR '06.