A new classification of datasets for frequent itemsets

The discovery of frequent patterns is a famous problem in data mining. While plenty of algorithms have been proposed during the last decade, only a few contributions have tried to understand the influence of datasets on the algorithms behavior. Being able to explain why certain algorithms are likely to perform very well or very poorly on some datasets is still an open question. In this setting, we describe a thorough experimental study of datasets with respect to frequent itemsets. We study the distribution of frequent itemsets with respect to itemsets size together with the distribution of three concise representations: frequent closed, frequent free and frequent essential itemsets. For each of them, we also study the distribution of their positive and negative borders whenever possible. The main outcome of these experiments is a new classification of datasets invariant w.r.t. minsup variations and robust to explain efficiency of several implementations.

[1]  Catriel Beeri,et al.  A Proof Procedure for Data Dependencies , 1984, JACM.

[2]  Fabrizio Silvestri,et al.  kDCI: a Multi-Strategy Algorithm for Mining Frequent Sets , 2003, FIMI.

[3]  Ganesh Ramesh,et al.  Distribution-based synthetic database generation techniques for itemset mining , 2005, 9th International Database Engineering & Application Symposium (IDEAS'05).

[4]  Gösta Grahne,et al.  Efficiently Using Prefix-trees in Mining Frequent Itemsets , 2003, FIMI.

[5]  Elke A. Rundensteiner,et al.  Discovery of High-Dimensional. , 2003, ICDE 2003.

[6]  Jean-Marc Petit,et al.  Zigzag: a new algorithm for mining large inclusion dependencies in databases , 2003, Third IEEE International Conference on Data Mining.

[7]  Toon Calders,et al.  Minimal k-Free Representations of Frequent Sets , 2003, PKDD.

[8]  Hiroki Arimura,et al.  An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases , 2004, Discovery Science.

[9]  U. M. Feyyad Data mining and knowledge discovery: making sense out of data , 1996 .

[10]  Hongjun Lu,et al.  AFOPT: An Efficient Implementation of Pattern Growth Approach , 2003, FIMI.

[11]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[12]  Salvatore Orlando,et al.  Statistical properties of transactional databases , 2004, SAC '04.

[13]  Marzena Kryszkiewicz,et al.  Concise Representation of Frequent Patterns Based on Generalized Disjunction-Free Generators , 2002, PAKDD.

[14]  Heikki Mannila,et al.  Design of Relational Databases , 1992 .

[15]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[16]  Ronald Fagin,et al.  Inclusion dependencies and their interaction with functional dependencies , 1982, PODS.

[17]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[18]  Ganesh Ramesh,et al.  Feasible itemset distributions in data mining: theory and application , 2003, PODS '03.

[19]  Jean-Marc Petit,et al.  ABS: Adaptive Borders Search of frequent itemsets , 2004, FIMI.

[20]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[21]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[22]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[23]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.

[24]  Heikki Mannila,et al.  Discovering functional and inclusion dependencies in relational databases , 1992, Int. J. Intell. Syst..

[25]  Johannes Gehrke,et al.  MAFIA: A Performance Study of Mining Maximal Frequent Itemsets , 2003, FIMI.

[26]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[27]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[28]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[29]  Christophe Rigotti,et al.  A condensed representation to find frequent patterns , 2001, PODS '01.

[30]  Lotfi Lakhal,et al.  Essential Patterns: A Perfect Cover of Frequent Patterns , 2005, DaWaK.

[31]  Heikki Mannila,et al.  Levelwise Search and Borders of Theories in Knowledge Discovery , 1997, Data Mining and Knowledge Discovery.

[32]  Gerd Stumme,et al.  Mining frequent patterns with counting inference , 2000, SKDD.