Ambiguous Frequent Itemset Mining and Polynomial Delay Enumeration

Mining frequently appearing patterns in a database is a basic problem in recent informatics, especially in data mining. Particularly, when the input database is a collection of subsets of an itemset, called transaction, the problem is called the frequent itemset mining problem, and it has been extensively studied. The items in a frequent itemset appear in many records simultaneously, thus they can be considered to be a cluster with respect to these records. However, in this sense, the condition that every item appears in each record is quite strong. We should allow for several missing items in these records. In this paper, we approach this problem from the algorithm theory, and consider the model that can be solved efficiently and possibly valuable in practice. We introduce ambiguous frequent itemsets which allow missing items in their occurrence records. More precisely, for given thresholds ? and s, an ambiguous frequent itemset P has a transaction set τ, | τ | ≥ σ, such that on average, transactions in τ include ratio θ of items of P. We formulate the problem of enumerating ambiguous frequent itemsets, and propose an efficient polynomial delay polynomial space algorithm. The practical performance is evaluated by computational experiments. Our algorithm can be naturally extended to the weighted version of the problem. The weighted version is a natural extension of the ordinary frequent itemset to weighted transaction databases, and is equivalent to finding submatrices with large average weights in their cells. An implementation is available at the author's homepage.

[1]  Andrew B. Nobel,et al.  Mining approximate frequent itemsets from noisy data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[2]  Kaizhong Zhang,et al.  Combinatorial pattern discovery for scientific data: some preliminary results , 1994, SIGMOD '94.

[3]  Takeaki Uno,et al.  An Efficient Algorithm for Enumerating Pseudo Cliques , 2007, ISAAC.

[4]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[5]  Hiroki Arimura,et al.  LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets , 2004, FIMI.

[6]  Cheng Yang,et al.  Efficient discovery of error-tolerant frequent itemsets in high dimensions , 2001, KDD '01.

[7]  David Avis,et al.  Reverse Search for Enumeration , 1996, Discret. Appl. Math..

[8]  Hiroki Arimura,et al.  Efficient Substructure Discovery from Large Semi-Structured Data , 2001, IEICE Trans. Inf. Syst..

[9]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[10]  Hiroki Arimura,et al.  An Efficient Polynomial Delay Algorithm for Pseudo Frequent Itemset Mining , 2007, Discovery Science.

[11]  Jean-François Boulicaut,et al.  Mining Formal Concepts with a Bounded Number of Exceptions from Transactional Data , 2004, KDID.

[12]  Ayumi Shinohara,et al.  Discovering Most Classificatory Patterns for Very Expressive Pattern Classes , 2003, Discovery Science.

[13]  Hiroki Arimura,et al.  An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases , 2004, Discovery Science.

[14]  Francesco Bonchi,et al.  Knowledge Discovery in Inductive Databases, 4th International Workshop, KDID 2005, Porto, Portugal, October 3, 2005, Revised Selected and Invited Papers , 2006, KDID.

[15]  Mohammed J. Zaki,et al.  Theoretical Foundations of Association Rules , 2007 .

[16]  Hiroki Arimura,et al.  Optimized Substructure Discovery for Semi-structured Data , 2002, PKDD.

[17]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[18]  Heikki Mannila,et al.  Dense itemsets , 2004, KDD.