论文信息 - Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database

Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database

Frequent itemsets mining is a common and useful task in data mining. But most of the current mining algorithms can’t be used in very large text database. In this paper, we propose a novel and efficient parallel algorithm parTFI which is used to find top-k frequent itemsets with specified minimum length in very large text database. Base on a simple data structure H-struct, parTFI uses a novel logical vertical data partition technique to mine top-k frequent itemsets at each mining server parallel. Our performance study shows that when processing very large sparse text database, parTFI outperforms Apriori and FP-growth, two efficient frequent iemsets mining algorithms, even when both are running with the better tuned min_support. Furthermore, by creating H-struct dynamically, parTFI can suit even huge dataset that most other algorithms can’t process.

[1] Mohammed J. Zaki,et al. CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[2] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[3] Osmar R. Zaïane,et al. Text document categorization by term association , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4] Martin Ester,et al. Frequent term-based text clustering , 2002, KDD.

[5] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[6] Mohammed J. Zaki. Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[7] Ramakrishnan Srikant,et al. Fast algorithms for mining association rules , 1998, VLDB 1998.

[8] Rakesh Agrawal,et al. Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[9] Hongjun Lu,et al. H-mine: hyper-structure mining of frequent patterns in large databases , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10] Ron Kohavi,et al. Real world performance of association rule algorithms , 2001, KDD '01.