Fast vertical mining using diffsets

A number of vertical mining algorithms have been proposed recently for association mining, which have shown to be very effective and usually outperform horizontal approaches. The main advantage of the vertical format is support for fast frequency counting via intersection operations on transaction ids (tids) and automatic pruning of irrelevant data. The main problem with these approaches is when intermediate results of vertical tid lists become too large for memory, thus affecting the algorithm scalability.In this paper we present a novel vertical data representation called Diffset, that only keeps track of differences in the tids of a candidate pattern from its generating frequent patterns. We show that diffsets drastically cut down the size of memory required to store intermediate results. We show how diffsets, when incorporated into previous vertical mining methods, increase the performance significantly.

[1]  Daniel M. Yellin An Algorithm for Dynamic Subset and Intersection Testing , 1994, Theor. Comput. Sci..

[2]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[3]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[4]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[5]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[6]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[7]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[8]  Dimitrios Gunopulos,et al.  Discovering All Most Specific Sentences by Randomized Algorithms , 1997, ICDT.

[9]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[10]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[11]  MotwaniRajeev,et al.  Dynamic itemset counting and implication rules for market basket data , 1997 .

[12]  Zvi M. Kedem,et al.  Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set , 1998, EDBT.

[13]  Jun-Lin Lin,et al.  Mining association rules: anti-skew algorithms , 1998, Proceedings 14th International Conference on Data Engineering.

[14]  Sunita Sarawagi,et al.  Integrating association rule mining with relational database systems: alternatives and implications , 1998, SIGMOD '98.

[15]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[16]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[17]  Nandit Soparkar,et al.  Data organization and access for efficient data mining , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[18]  Mohammed J. Zaki Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[19]  Mohammed J. Zaki Generating non-redundant association rules , 2000, KDD '00.

[20]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[21]  Jian Pei,et al.  CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[22]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[23]  Devavrat Shah,et al.  Turbo-charging vertical mining of large databases , 2000, SIGMOD '00.

[24]  Ramesh C Agarwal,et al.  Depth first generation of long patterns , 2000, KDD '00.

[25]  Johannes Gehrke,et al.  MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  Mohammed J. Zaki,et al.  Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[27]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[28]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.