Tidset-Based Parallel FP-tree Algorithm for the Frequent Pattern Mining Problem on PC Clusters

Mining association rules from a transaction-oriented database is a problem in data mining. Frequent patterns are essential for generating association rules, time series analysis, classification, etc. There are two categories of algorithms for data mining, the generate-and-test approach (Apriori-like) and the pattern growth approach (FP-tree). Recently, many methods have been proposed for solving this problem based on an FP-tree as a replacement for Apriori-like algorithms, because these need to scan the database many times. However, even for the pattern growth method, the execution time takes long when the database is large or the given support is low. Parallel- distributed computing is good strategy for solving this problem. Some parallel algorithms have been proposed, however, the execution time increases rapidly when the database increases or when the given minimum threshold is small. In this study, an efficient parallel- distributed mining algorithm based on an FP-tree structure - the Tidset-based Parallel FP-tree (TPFP-tree) - is proposed. In order to exchange transactions efficiently, transaction identification set (Tidset) was used to directly choose transactions without scanning databases. The algorithm was verified on a Linux cluster with 16 computing nodes. It was also compared with a PFP-tree algorithm. The dataset generated by IBM's Quest Synthetic Data Generator to verify the performance of algorithms was used. The experimental results showed that this algorithm can reduce the execution time when the database grows. Moreover, it was also observed that this algorithm had better scalability than the PFP-tree.

[1]  Peiyi Tang,et al.  Parallelizing Frequent Itemset Mining with FP-Trees , 2006, Computers and Their Applications.

[2]  Philip S. Yu,et al.  Distributed data mining in a chain store database of short transactions , 2002, KDD.

[3]  Vladimir Gorodetsky,et al.  Multi-agent technology for distributed data mining and classification , 2003, IEEE/WIC International Conference on Intelligent Agent Technology, 2003. IAT 2003..

[4]  Ashfaq Khokhar,et al.  Frequent Pattern Mining on Message Passing Multiprocessor Systems , 2004, Distributed and Parallel Databases.

[5]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[6]  Soon Myoung Chung,et al.  Parallel mining of association rules from text databases on a cluster of workstations , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[7]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[8]  Frans Coenen,et al.  Data structure for association rule mining: T-trees and P-trees , 2004, IEEE Transactions on Knowledge and Data Engineering.

[9]  Shenghuo Zhu,et al.  A new distributed data mining model based on similarity , 2003, SAC '03.

[10]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.