Minimum Description Length Principle: Generators Are Preferable to Closed Patterns

The generators and the unique closed pattern of an equivalence class of itemsets share a common set of transactions. The generators are the minimal ones among the equivalent itemsets, while the closed pattern is the maximum one. As a generator is usually smaller than the closed pattern in cardinality, by the Minimum Description Length Principle, the generator is preferable to the closed pattern in inductive inference and classification. To efficiently discover frequent generators from a large dataset, we develop a depth-first algorithm called Gr-growth. The idea is novel in contrast to traditional breadth-first bottom-up generator-mining algorithms. Our extensive performance study shows that Gr-growth is significantly faster (an order or even two orders of magnitudes when the support thresholds are low) than the existing generator mining algorithms. It can be also faster than the state-of-the-art frequent closed itemset mining algorithms such as FPclose and CLOSET+.

[1]  Ming Li,et al.  An Introduction to Kolmogorov Complexity and Its Applications , 1997, Texts in Computer Science.

[2]  Hiroki Arimura,et al.  LCM ver. 2: Efficient Mining Algorithms for Frequent/Closed/Maximal Itemsets , 2004, FIMI.

[3]  Christian Borgelt Efficient Implementations of Apriori and Eclat Christian Borgelt , 2003 .

[4]  Marzena Kryszkiewicz,et al.  Dataless Transitions Between Concise Representations of Frequent Patterns , 2004, Journal of Intelligent Information Systems.

[5]  Bart Goethals,et al.  Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations (FIMI'03) , 2003 .

[6]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[7]  Ming Li,et al.  Applying MDL to Learning Best Model Granularity , 2000, ArXiv.

[8]  Viet Phan Luong,et al.  The Closed Keys Base of Frequent Itemsets , 2002, DaWaK.

[9]  Jian Pei,et al.  CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[10]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[11]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[12]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[13]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[14]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[15]  Anthony K. H. Tung,et al.  Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[16]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[17]  Gerd Stumme,et al.  Mining frequent patterns with counting inference , 2000, SKDD.

[18]  Gösta Grahne,et al.  Fast algorithms for frequent itemset mining using FP-trees , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Ming Li,et al.  Applying MDL to learn best model granularity , 2000, Artif. Intell..

[20]  Jean-François Boulicaut,et al.  Free-Sets: A Condensed Representation of Boolean Data for the Approximation of Frequency Queries , 2004, Data Mining and Knowledge Discovery.