Toward boosting distributed association rule mining by data de-clustering

Existing parallel algorithms for association rule mining have a large inter-site communication cost or require a large amount of space to maintain the local support counts of a large number of candidate sets. This study proposes a de-clustering approach for distributed architectures, which eliminates the inter-site communication cost, for most of the influential association rule mining algorithms. To de-cluster the database into similar partitions, an efficient algorithm is developed to approximate the shortest spanning path (SSP) to link transaction data together. The SSP obtained is then used to evenly de-cluster the transaction data into subgroups. The proposed approach guarantees that all subgroups are similar to each other and to the original group. Experiment results show that data size and the number of items are the only two factors that determine the performance of de-clustering. Additionally, based on the approach, most of the influential association rule mining algorithms can be implemented in a distributed architecture to obtain a drastic increase in speed without losing any frequent itemsets. Furthermore, the data distribution in each de-clustered participant is almost the same as that of a single site, which implies that the proposed approach can be regarded as a sampling method for distributed association rule mining. Finally, the experiment results prove that the original inadequate mining results can be improved to an almost perfect level.

[1]  Philip S. Yu,et al.  Mining Associations with the Collective Strength Approach , 2001, IEEE Trans. Knowl. Data Eng..

[2]  Steven Skiena,et al.  Implementing discrete mathematics - combinatorics and graph theory with Mathematica , 1990 .

[3]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[4]  Philip S. Yu,et al.  A New Approach to Online Generation of Association Rules , 2001, IEEE Trans. Knowl. Data Eng..

[5]  Kevin Chen-Chuan Chang,et al.  Automatic complex schema matching across Web query interfaces: A correlation mining approach , 2006, TODS.

[6]  Srinivasan Parthasarathy,et al.  Parallel and distributed methods for incremental frequent itemset mining , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[7]  Chin-Chen Chang,et al.  Reversible steganographic method using SMVQ approach based on declustering , 2007, Inf. Sci..

[8]  Christine T. Cheng,et al.  From discrepancy to declustering: near-optimal multidimensional declustering strategies for range queries , 2002, PODS '02.

[9]  Hui Xiong,et al.  Discovery of maximum length frequent itemsets , 2008, Inf. Sci..

[10]  Huan Liu,et al.  Feature Selection via Discretization , 1997, IEEE Trans. Knowl. Data Eng..

[11]  Mikhail J. Atallah,et al.  (Almost) Optimal parallel block access for range queries , 2003, Inf. Sci..

[12]  Kenneth Steiglitz,et al.  Some complexity results for the Traveling Salesman Problem , 1976, STOC '76.

[13]  Damian Dudek RMAIN: Association rules maintenance without reruns through data , 2009, Inf. Sci..

[14]  Ran Wolff,et al.  Communication-efficient distributed mining of association rules , 2001, SIGMOD '01.

[15]  Philip S. Yu,et al.  Finding Localized Associations in Market Basket Data , 2002, IEEE Trans. Knowl. Data Eng..

[16]  J ZakiMohammed Parallel and Distributed Association Mining , 1999 .

[17]  Ali Saman Tosun Multi-Site Retrieval of Declustered Data , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[18]  Ran Wolff,et al.  Distributed Data Mining in Peer-to-Peer Networks , 2006, IEEE Internet Computing.

[19]  David Wai-Lok Cheung,et al.  Effect of Data Distribution in Parallel Mining of Associations , 1999, Data Mining and Knowledge Discovery.

[20]  Philip S. Yu,et al.  Redefining Clustering for High-Dimensional Applications , 2002, IEEE Trans. Knowl. Data Eng..

[21]  Young-Koo Lee,et al.  Efficient single-pass frequent pattern mining using a prefix-tree , 2009, Inf. Sci..

[22]  室 章治郎 Michael R.Garey/David S.Johnson 著, "COMPUTERS AND INTRACTABILITY A guide to the Theory of NP-Completeness", FREEMAN, A5判変形判, 338+xii, \5,217, 1979 , 1980 .

[23]  Chris Clifton,et al.  Privacy-preserving distributed mining of association rules on horizontally partitioned data , 2004, IEEE Transactions on Knowledge and Data Engineering.

[24]  Ali Saman Tosun Analysis and Comparison of Replicated Declustering Schemes , 2007, IEEE Transactions on Parallel and Distributed Systems.

[25]  Albert Nijenhuis,et al.  Combinatorial Algorithms for Computers and Calculators , 1978 .

[26]  Yueh-Min Huang,et al.  A Method of Cross-level Frequent Pattern Mining for Web-based Instruction , 2007, J. Educ. Technol. Soc..

[27]  Kjetil Nørvåg A study of object declustering strategies in parallel temporal object database systems , 2002, Inf. Sci..

[28]  Richard C. T. Lee Clustering Analysis and Its Applications , 1981 .

[29]  Maybin K. Muyeba,et al.  An algorithm to mine general association rules from tabular data , 2007, Inf. Sci..

[30]  Ferenc Bodon,et al.  A fast APRIORI implementation , 2003, FIMI.

[31]  Hui Xiong,et al.  Mining maximal hyperclique pattern: A hybrid search strategy , 2007, Inf. Sci..

[32]  Ran Wolff,et al.  A high-performance distributed algorithm for mining association rules , 2004, Knowledge and Information Systems.

[33]  Rakesh Agrawal,et al.  Parallel Mining of Association Rules , 1996, IEEE Trans. Knowl. Data Eng..

[34]  Nong Ye,et al.  The Handbook of Data Mining , 2003 .

[35]  Hongyan Liu,et al.  Top-down mining of frequent closed patterns from very high dimensional data , 2009, Inf. Sci..

[36]  Jing-Rung Yu,et al.  FIUT: A new method for mining frequent itemsets , 2009, Inf. Sci..

[37]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[38]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[39]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[40]  Mohammed J. Zaki Parallel and distributed association mining: a survey , 1999, IEEE Concurr..

[41]  Young-Koo Lee,et al.  Sliding window-based frequent pattern mining over data streams , 2009, Inf. Sci..

[42]  Jiawei Han,et al.  A fast distributed algorithm for mining association rules , 1996, Fourth International Conference on Parallel and Distributed Information Systems.

[43]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[44]  Randeep Bhatia,et al.  Multidimensional Declustering Schemes Using Golden Ratio and Kronecker Sequences , 2003, IEEE Trans. Knowl. Data Eng..

[45]  Jian Pei,et al.  Data Mining: Concepts and Techniques, 3rd edition , 2006 .

[46]  Ali Saman Tosun Threshold-based declustering , 2007, Inf. Sci..

[47]  David Wai-Lok Cheung,et al.  Efficient Mining of Association Rules in Distributed Databases , 1996, IEEE Trans. Knowl. Data Eng..

[48]  Anthony J. T. Lee,et al.  An efficient algorithm for mining frequent inter-transaction patterns , 2007, Inf. Sci..

[49]  Ran Wolff,et al.  Association rule mining in peer-to-peer systems , 2003, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[50]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[51]  Sheng Zhong,et al.  Privacy-preserving algorithms for distributed mining of frequent itemsets , 2007, Inf. Sci..

[52]  Hakan Ferhatosmanoglu,et al.  Efficient parallel processing of range queries through replicated declustering , 2006, Distributed and Parallel Databases.

[53]  Feng-Hsu Wang,et al.  On discovery of soft associations with "most" fuzzy quantifier for item promotion applications , 2008, Inf. Sci..

[54]  Chin-Chen Chang,et al.  The Idea of De-Clustering and its Applications , 1986, VLDB.

[55]  Raghu Ramakrishnan Exploratory Mining in Cube Space , 2006, Sixth International Conference on Data Mining (ICDM'06).

[56]  Hannu T. T. Toivonen,et al.  Samplinglarge databases for finding association rules , 1996, VLDB 1996.