论文信息 - Advances in frequent itemset mining implementations: report on FIMI'03

Advances in frequent itemset mining implementations: report on FIMI'03

1. WHY ORGANIZE FIMI? Since the introduction of association rule mining in 1993 by Agrawal Imielinski and Swami [3], the frequent itemset mining (FIM) tasks have received a great deal of attention. Within the last decade, a phenomenal number of algorithms have been developed for mining all [3; 5; 20; 4; 27; 24; 29; 34; 10; 22; 19; 32], closed [25; 6; 12; 26; 30; 23; 31; 28; 33] and maximal frequent itemsets [18; 21; 7; 2; 1; 11; 36; 16; 17]. Every new paper claims to run faster than previously existing algorithms, based on their experimental testing, which is oftentimes quite limited in scope, since many of the original algorithms are not available due to intellectual property and copyright issues. Zheng, Kohavi and Mason [35] observed that the performance of several of these algorithms is not always as claimed by its authors, when tested on some different datasets. Also, from personal experience, we noticed that even different implementations of the same algorithm could behave quite differently for various datasets and parameters. Given this proliferation of FIM algorithms, and sometimes contradictory claims, there is a pressing need to benchmark, characterize and understand the algorithmic performance space. We would like to understand why and under what conditions one algorithm outperforms another. This means testing the methods for a wide variety of parameters, and on different datasets spanning dense and sparse, real and synthetic, small and large, and so on. Given the experimental, algorithmic nature of FIM (and most of data mining in general), it is crucial that other researchers be able to independently verify the claims made in a new paper. Unfortunately, the FIM community (with few exceptions) has a very poor track record in this regard. Many new algorithms are not available even as an executable, let alone the source code. How many times have we heard “this is proprietary software, and not available.” This is not the way other sciences work. Independent verifiability is the hallmark of sciences like physics, chemistry, biology, and so on. One may argue, that the nature of research is different, they have detailed experimental procedure that can be replicated, while we have algorithms, and there is more than one way to code an algorithm. However, a good example to emulate is the bioinformatics community. They have espoused the open-source paradigm with more alacrity than we have. It is quite common for journals and conferences in bioinformatics to require that software be available. For example, here is a direct quote from the journal Bioinformatics (http://bioinformatics.oupjournals.org/):

Bart Goethals | Mohammed J. Zaki | Bart Goethals

[1] Zvi M. Kedem,et al. Pincer-Search: A New Algorithm for Discovering the Maximum Frequent Set , 1998, EDBT.

[2] Mohammed J. Zaki. Scalable Algorithms for Association Mining , 2000, IEEE Trans. Knowl. Data Eng..

[3] Anthony K. H. Tung,et al. Carpenter: finding closed patterns in long biological datasets , 2003, KDD '03.

[4] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[5] Mohammed J. Zaki,et al. Fast vertical mining using diffsets , 2003, KDD '03.

[6] Gerd Stumme,et al. Mining frequent patterns with counting inference , 2000, SKDD.

[7] Tom Brijs,et al. Profiling high frequency accident locations using associations rules , 2002 .

[8] Rajeev Motwani,et al. Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[9] Ferenc Bodon,et al. A fast APRIORI implementation , 2003, FIMI.

[10] Ramesh C Agarwal,et al. Depth first generation of long patterns , 2000, KDD '00.

[11] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[12] Mohammed J. Zaki,et al. Efficiently mining maximal frequent itemsets , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[13] Philip S. Yu,et al. An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[14] Dan A. Simovici,et al. Galois Connections and Data Mining , 2000, J. Univers. Comput. Sci..

[15] Johannes Gehrke,et al. MAFIA: a maximal frequent itemset algorithm for transactional databases , 2001, Proceedings 17th International Conference on Data Engineering.

[16] Geert Wets,et al. Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[17] Jian Pei,et al. CLOSET+: searching for the best strategies for mining frequent closed itemsets , 2003, KDD '03.

[18] Tomasz Imielinski,et al. Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[19] Hannu Toivonen,et al. Sampling Large Databases for Association Rules , 1996, VLDB.

[20] Dimitrios Gunopulos,et al. Discovering All Most Specific Sentences by Randomized Algorithms , 1997, ICDT.

[21] Nicolas Pasquier,et al. Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[22] Mohammed J. Zaki,et al. CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[23] Heikki Mannila,et al. Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[24] Bart Goethals,et al. Efficient frequent pattern mining , 2002 .

[25] Shamkant B. Navathe,et al. An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[26] Roberto J. Bayardo,et al. Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[27] Rakesh Agarwal,et al. Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[28] Devavrat Shah,et al. Turbo-charging vertical mining of large databases , 2000, SIGMOD '00.

[29] G. Grahne,et al. High Performance Mining of Maximal Frequent Itemsets Gösta , 2003 .

[30] Arun N. Swami,et al. Set-oriented mining for association rules in relational databases , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[31] Ron Kohavi,et al. Real world performance of association rule algorithms , 2001, KDD '01.

[32] Srinivasan Parthasarathy,et al. New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[33] Charu C. Aggarwal,et al. Towards long pattern generation in dense databases , 2001, SKDD.

[34] Wesley W. Chu,et al. SmartMiner: a depth first algorithm guided by tail information for mining maximal frequent itemsets , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[35] Jian Pei,et al. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[36] Jun-Lin Lin,et al. Mining association rules: anti-skew algorithms , 1998, Proceedings 14th International Conference on Data Engineering.