Discovering Frequent Closed Partial Orders from Strings

Mining knowledge about ordering from sequence data is an important problem with many applications, such as bioinformatics, Web mining, network management, and intrusion detection. For example, if many customers follow a partial order in their purchases of a series of products, the partial order can be used to predict other related customers' future purchases and develop marketing campaigns. Moreover, some biological sequences (e.g., microarray data) can be clustered based on the partial orders shared by the sequences. Given a set of items, a total order of a subset of items can be represented as a string. A string database is a multiset of strings. In this paper, we identify a novel problem of mining frequent closed partial orders from strings. Frequent closed partial orders capture the nonredundant and interesting ordering information from string databases. Importantly, mining frequent closed partial orders can discover meaningful knowledge that cannot be disclosed by previous data mining techniques. However, the problem of mining frequent closed partial orders is challenging. To tackle the problem, we develop Frecpo (for frequent closed partial order), a practically efficient algorithm for mining the complete set of frequent closed partial orders from large string databases. Several interesting pruning techniques are devised to speed up the search. We report an extensive performance study on both real data sets and synthetic data sets to illustrate the effectiveness and the efficiency of our approach

[1]  Jian Pei,et al.  Mining sequential patterns with constraints in large databases , 2002, CIKM '02.

[2]  H. Mannila,et al.  Discovering all most specific sentences , 2003, TODS.

[3]  Ramakrishnan Srikant,et al.  Mining Sequential Patterns: Generalizations and Performance Improvements , 1996, EDBT.

[4]  Jiawei Han,et al.  BIDE: efficient mining of frequent closed sequences , 2004, Proceedings. 20th International Conference on Data Engineering.

[5]  Gemma C. Garriga,et al.  Summarizing Sequential Data with Closed Partial Orders , 2005, SDM.

[6]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[7]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[8]  Aristides Gionis,et al.  Fragments of order , 2003, KDD '03.

[9]  Guizhen Yang,et al.  The complexity of mining maximal frequent itemsets and maximal frequent patterns , 2004, KDD.

[10]  Dimitrios Gunopulos,et al.  Mining Process Models from Workflow Logs , 1998, EDBT.

[11]  Arbee L. P. Chen,et al.  An efficient algorithm for mining frequent sequences by a new strategy without support counting , 2004, Proceedings. 20th International Conference on Data Engineering.

[12]  Wei Wang,et al.  OP-cluster: clustering by tendency in high dimensional space , 2003, Third IEEE International Conference on Data Mining.

[13]  Richard M. Karp,et al.  Discovering local structure in gene expression data: the order-preserving submatrix problem , 2002, RECOMB '02.

[14]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[15]  Eugene L. Lawler,et al.  The recognition of Series Parallel digraphs , 1979, SIAM J. Comput..

[16]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[17]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[18]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[19]  Heikki Mannila,et al.  Discovery of Frequent Episodes in Event Sequences , 1997, Data Mining and Knowledge Discovery.

[20]  Umeshwar Dayal,et al.  PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth , 2001, ICDE 2001.

[21]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[22]  Jiawei Han,et al.  TSP: mining top-K closed sequential patterns , 2003, Third IEEE International Conference on Data Mining.

[23]  Heikki Mannila,et al.  Global partial orders from sequential data , 2000, KDD '00.

[24]  Philip S. Yu,et al.  Efficiently mining frequent closed partial orders , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[25]  Xifeng Yan,et al.  CloSpan: Mining Closed Sequential Patterns in Large Datasets , 2003, SDM.

[26]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[27]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[28]  Rynson W. H. Lau,et al.  Knowledge and Data Engineering for e-Learning Special Issue of IEEE Transactions on Knowledge and Data Engineering , 2008 .

[29]  Wil M. P. van der Aalst,et al.  Workflow mining: discovering process models from event logs , 2004, IEEE Transactions on Knowledge and Data Engineering.

[30]  Kyuseok Shim,et al.  SPIRIT: Sequential Pattern Mining with Regular Expression Constraints , 1999, VLDB.

[31]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[32]  Philip S. Yu,et al.  Mining long sequential patterns in a noisy environment , 2002, SIGMOD '02.

[33]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[34]  Jian Pei,et al.  ApproxMAP: Approximate Mining of Consensus Sequential Patterns , 2003, SDM.

[35]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[36]  Vladimir Gurvich,et al.  On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets , 2002, STACS.

[37]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.