Parallel Mining of Top-K Frequent Itemsets in Very Large Text Database

Frequent itemsets mining is a common and useful task in data mining. But most of the current mining algorithms can’t be used in very large text database. In this paper, we propose a novel and efficient parallel algorithm parTFI which is used to find top-k frequent itemsets with specified minimum length in very large text database. Base on a simple data structure H-struct, parTFI uses a novel logical vertical data partition technique to mine top-k frequent itemsets at each mining server parallel. Our performance study shows that when processing very large sparse text database, parTFI outperforms Apriori and FP-growth, two efficient frequent iemsets mining algorithms, even when both are running with the better tuned min_support. Furthermore, by creating H-struct dynamically, parTFI can suit even huge dataset that most other algorithms can’t process.

[1]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[2]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[3]  Osmar R. Zaïane,et al.  Text document categorization by term association , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[4]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[5]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[6]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[7]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[8]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[9]  Hongjun Lu,et al.  H-mine: hyper-structure mining of frequent patterns in large databases , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Ron Kohavi,et al.  Real world performance of association rule algorithms , 2001, KDD '01.