Computational aspects of mining maximal frequent patterns

In this paper we study the complexity-theoretic aspects of mining maximal frequent patterns, from the perspective of counting the number of all distinct solutions. We present the first formal proof that the problem of counting the number of maximal frequent itemsets in a database of transactions, given an arbitrary support threshold, is #P-complete, thereby providing theoretical evidence that the problem of mining maximal frequent itemsets is NP-hard. We also extend our complexity analysis to other similar data mining problems that deal with complex data structures, such as sequences, trees, and graphs. We investigate several variants of these mining problems in which the patterns of interest are subsequences, subtrees, or subgraphs, and show that the associated problems of counting the number of maximal frequent patterns are all either #P-complete or #P-hard.

[1]  Vladimir Gurvich,et al.  On the Complexity of Generating Maximal Frequent and Minimal Infrequent Sets , 2002, STACS.

[2]  Heikki Mannila,et al.  Ordered and Unordered Tree Inclusion , 1995, SIAM J. Comput..

[3]  H. Mannila,et al.  Discovering all most specific sentences , 2003, TODS.

[4]  Kyuseok Shim,et al.  Mining Sequential Patterns with Regular Expression Constraints , 2002, IEEE Trans. Knowl. Data Eng..

[5]  Mohammed J. Zaki,et al.  ADMIT: anomaly-based data mining for intrusions , 2002, KDD.

[6]  Jian Pei,et al.  Mining sequential patterns with constraints in large databases , 2002, CIKM '02.

[7]  Qiang Yang,et al.  Mining plans for customer-class transformation , 2003, Third IEEE International Conference on Data Mining.

[8]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[9]  Sen Zhang,et al.  Clustering and Classifying Enzymes in Metabolic Pathways: Some Preliminary Results , 2002, BIOKDD.

[10]  Mohammed J. Zaki,et al.  Theoretical Foundations of Association Rules , 2007 .

[11]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[12]  Ronald L. Rivest,et al.  Introduction to Algorithms, Second Edition , 2001 .

[13]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[14]  Harry B. Hunt,et al.  The Complexity of Planar Counting Problems , 1998, SIAM J. Comput..

[15]  Raymond Chi-Wing Wong,et al.  MPIS: maximal-profit item selection with cross-selling considerations , 2003, Third IEEE International Conference on Data Mining.

[16]  Mohammed J. Zaki Efficiently mining frequent trees in a forest , 2002, KDD.

[17]  Hannu Toivonen,et al.  Proceedings of the 2nd ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD 2002), July 23rd, 2002, Edmonton, Alberta, Canada , 2002, BIOKDD.

[18]  Jiawei Han,et al.  CloseGraph: mining closed frequent graph patterns , 2003, KDD '03.

[19]  D. Matula Subtree Isomorphism in O(n5/2) , 1978 .

[20]  Ramakrishnan Srikant,et al.  Mining sequential patterns , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[21]  George Karypis,et al.  SLPMiner: an algorithm for finding frequent sequential patterns using length-decreasing support constraint , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[22]  Zhigang Li,et al.  Efficient data mining for maximal frequent subtrees , 2003, Third IEEE International Conference on Data Mining.

[23]  Jian Pei,et al.  Mining phenotypes and informative genes from gene expression data , 2003, KDD '03.

[24]  Takashi Washio,et al.  An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data , 2000, PKDD.

[25]  Leslie G. Valiant,et al.  The Complexity of Enumeration and Reliability Problems , 1979, SIAM J. Comput..

[26]  M. Crochemore,et al.  On-line construction of suffix trees , 2002 .

[27]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[28]  R. K. Shyamasundar,et al.  Introduction to algorithms , 1996 .

[29]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[30]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[31]  Ramesh C Agarwal,et al.  Depth first generation of long patterns , 2000, KDD '00.

[32]  J. Scott Provan,et al.  The Complexity of Counting Cuts and of Computing the Probability that a Graph is Connected , 1983, SIAM J. Comput..

[33]  George Karypis,et al.  Frequent subgraph discovery , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[34]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[35]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[36]  Oren Etzioni,et al.  To buy or not to buy: mining airfare data to minimize ticket purchase price , 2003, KDD '03.

[37]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[38]  Leslie G. Valiant,et al.  The Complexity of Computing the Permanent , 1979, Theor. Comput. Sci..

[39]  Ganesh Ramesh,et al.  Feasible itemset distributions in data mining: theory and application , 2003, PODS '03.

[40]  Philip K. Chan,et al.  Learning nonstationary models of normal network traffic for detecting novel attacks , 2002, KDD.

[41]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[42]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[43]  Srinivasan Parthasarathy,et al.  Incremental and interactive sequence mining , 1999, CIKM '99.

[44]  Christos H. Papadimitriou,et al.  Computational complexity , 1993 .

[45]  Zvi M. Kedem,et al.  Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set , 2002, IEEE Trans. Knowl. Data Eng..

[46]  Matthew Richardson,et al.  Mining knowledge-sharing sites for viral marketing , 2002, KDD.

[47]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[48]  Mohammed J. Zaki,et al.  Mining Protein Contact Maps , 2002, BIOKDD.

[49]  Umeshwar Dayal,et al.  FreeSpan: frequent pattern-projected sequential pattern mining , 2000, KDD '00.

[50]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[51]  Salil P. Vadhan,et al.  The Complexity of Counting in Sparse, Regular, and Planar Graphs , 2002, SIAM J. Comput..

[52]  Christos H. Papadimitriou,et al.  NP-Completeness: A Retrospective , 1997, ICALP.

[53]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.