A binary decision diagram based approach for mining frequent subsequences

Sequential pattern mining is an important problem in data mining. State of the art techniques for mining sequential patterns, such as frequent subsequences, are often based on the pattern-growth approach, which recursively projects conditional databases. Explicitly creating database projections is thought to be a major computational bottleneck, but we will show in this paper that it can be beneficial when the appropriate data structure is used. Our technique uses a canonical directed acyclic graph as the sequence database representation, which can be represented as a binary decision diagram (BDD). In this paper, we introduce a new type of BDD, namely a sequence BDD (SeqBDD), and show how it can be used for efficiently mining frequent subsequences. A novel feature of the SeqBDD is its ability to share results between similar intermediate computations and avoid redundant computation. We perform an experimental study to compare the SeqBDD technique with existing pattern growth techniques, that are based on other data structures such as prefix trees. Our results show that a SeqBDD can be half as large as a prefix tree, especially when many similar sequences exist. In terms of mining time, it can be substantially more efficient when the support is low, the number of patterns is large, or the input sequences are long and highly similar.

[1]  Shin-ichi Minato,et al.  Zero-Suppressed BDDs for Set Manipulation in Combinatorial Problems , 1993, 30th ACM/IEEE Design Automation Conference.

[2]  Karem A. Sakallah,et al.  ZBDD-Based Backtrack Search SAT Solver , 2002, IWLS.

[3]  Dennis Shasha,et al.  DNA sequence classification via an expectation maximization algorithm and neural networks: a case study , 2001, IEEE Trans. Syst. Man Cybern. Part C.

[4]  Hiroki Arimura,et al.  Efficient Method of Combinatorial Item Set Analysis Based on Zero-Suppressed BDDs , 2005, International Workshop on Challenges in Web Information Retrieval and Integration.

[5]  Randal E. Bryant,et al.  Graph-Based Algorithms for Boolean Function Manipulation , 1986, IEEE Transactions on Computers.

[6]  Yi Lu,et al.  Mining Web Log Sequential Patterns with Position Coded Pre-Order Linked WAP-Tree , 2005, Data Mining and Knowledge Discovery.

[7]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[8]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[9]  Dimitrios I. Fotiadis,et al.  An optimized sequential pattern matching methodology for sequence classification , 2009, Knowledge and Information Systems.

[10]  Jian Pei,et al.  Mining sequential patterns with constraints in large databases , 2002, CIKM '02.

[11]  Soon Myoung Chung,et al.  A scalable algorithm for mining maximal frequent sequences using a sample , 2008, Knowledge and Information Systems.

[12]  Florent Masseglia,et al.  The PSP Approach for Mining Sequential Patterns , 1998, PKDD.

[13]  Paulo J. Azevedo,et al.  Protein Sequence Classification Through Relevant Sequence Mining and Bayes Classifiers , 2005, EPIA.

[14]  Ke Wang,et al.  Frequent-subsequence-based prediction of outer membrane proteins , 2003, KDD '03.

[15]  Shin-ichi Minato,et al.  Zero-suppressed BDDs and their applications , 2001, International Journal on Software Tools for Technology Transfer.

[16]  Shin-ichi Minato Efficient combinatorial item set analysis based on zero-suppressed BDDs , 2005 .

[17]  Yi Liu,et al.  PLWAP sequential mining: open source code , 2005 .

[18]  Srinivasan Parthasarathy,et al.  Cache-conscious Frequent Pattern Mining on a Modern Processor , 2005, VLDB.

[19]  James Bailey,et al.  Mining minimal distinguishing subsequence patterns with gap constraints , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[20]  Yang Wang,et al.  Frequent Subsequence-Based Protein Localization , 2006, BioDM.

[21]  Shin-ichi Minato,et al.  Finding All Simple Disjoint Decompositions in Frequent Itemset Data , 2005 .

[22]  Suh-Yin Lee,et al.  Efficient mining of sequential patterns with time constraints by delimited pattern growth , 2005, Knowledge and Information Systems.

[23]  James Bailey,et al.  Fast mining of high dimensional expressive contrast patterns using zero-suppressed binary decision diagrams , 2006, KDD '06.

[24]  Jian Pei,et al.  Mining Access Patterns Efficiently from Web Logs , 2000, PAKDD.

[25]  Jean-François Boulicaut,et al.  Looking for monotonicity properties of a similarity constraint on sequences , 2006, SAC '06.

[26]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[27]  Ayumi Shinohara,et al.  A Practical Algorithm to Find the Best Subsequence Patterns , 2000, Discovery Science.

[28]  Christel Baier,et al.  Symbolic Reasoning with Weighted and Normalized Decision Diagrams , 2006, Calculemus.

[29]  Hiroki Arimura,et al.  Frequent Pattern Mining and Knowledge Indexing Based on Zero-Suppressed BDDs , 2006, KDID.

[30]  Randal E. Bryant,et al.  Verification of Arithmetic Circuits with Binary Moment Diagrams , 1995, 32nd Design Automation Conference.

[31]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[32]  Thomas Zeugmann,et al.  N-Gram Analysis Based on Zero-Suppressed BDDs , 2006, JSAI.

[33]  John Andrews,et al.  Quantitative fault tree analysis using Binary Decision Diagrams , 1996 .

[34]  Christoph Meinel,et al.  Efficient Boolean Manipulation With OBDD's can be Extended to FBDD's , 1994, IEEE Trans. Computers.

[35]  Jianyong Wang,et al.  Mining sequential patterns by pattern-growth: the PrefixSpan approach , 2004, IEEE Transactions on Knowledge and Data Engineering.

[36]  Ricardo A. Baeza-Yates,et al.  Searching Subsequences , 1991, Theor. Comput. Sci..

[37]  Jiawei Han,et al.  TSP: mining top-K closed sequential patterns , 2003, Third IEEE International Conference on Data Mining.

[38]  James Bailey,et al.  Are Zero-suppressed Binary Decision Diagrams Good for Mining Frequent Patterns in High Dimensional Datasets? , 2007, AusDM.

[39]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.