ML-DS: A Novel Deterministic Sampling Algorithm for Association Rules Mining

Due to the explosive growth of data in every aspect of our life, data mining algorithms often suffer from scalability issues. One effective way to tackle this problem is to employ sampling techniques. This paper introduces, ML-DS, a novel deterministic sampling algorithm for mining association rules in large datasets. Unlike most algorithms in the literature that use randomness in sampling, our algorithm is fully deterministic. The process of sampling proceeds in stages. The size of the sample data in any stage is half that of the previous stage. In any given stage, the data is partitioned into disjoint groups of equal size. Some distance measure is used to determine the importance of each group in identifying accurate association rules. The groups are then sorted based on this measure. Only the best 50% of the groups move to the next stage. We perform as many stages of sampling as needed to produce a sample of a desired target size. The resultant sample is then employed to identify association rules. Empirical results show that our approach outperforms simple randomized sampling in accuracy and is competitive in comparison with the state-of-the-art sampling algorithms in terms of both time and accuracy.

[1]  Brad Long Distributed result set iterator: a design pattern for efficient retrieval of large result sets from remote data sources , 2004, J. Parallel Distributed Comput..

[2]  Mohammed J. Zaki,et al.  Fast vertical mining using diffsets , 2003, KDD '03.

[3]  Bernard Chazelle,et al.  The Discrepancy Method , 1998, ISAAC.

[4]  Hervé Brönnimann,et al.  Deterministic algorithms for sampling count data , 2008, Data Knowl. Eng..

[5]  Bin Chen,et al.  A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.

[6]  Das Amrita,et al.  Mining Association Rules between Sets of Items in Large Databases , 2013 .

[7]  Pat Langley,et al.  Static Versus Dynamic Sampling for Data Mining , 1996, KDD.

[8]  Bin Chen,et al.  Efficient data reduction with EASE , 2003, KDD '03.

[9]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[10]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[11]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[12]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[13]  Sanguthevar Rajasekaran,et al.  A transaction mapping algorithm for frequent itemsets mining , 2006 .

[14]  Roberto J. Bayardo,et al.  Efficiently mining long patterns from databases , 1998, SIGMOD '98.

[15]  Tim Oates,et al.  Efficient progressive sampling , 1999, KDD '99.

[16]  Hongjun Lu,et al.  H-mine: hyper-structure mining of frequent patterns in large databases , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[17]  Arun N. Swami,et al.  Set-oriented mining for association rules in relational databases , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[18]  Srinivasan Parthasarathy,et al.  New Algorithms for Fast Discovery of Association Rules , 1997, KDD.

[19]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[20]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[21]  Sanguthevar Rajasekaran,et al.  Selection algorithms for parallel disk systems , 1998, Proceedings. Fifth International Conference on High Performance Computing (Cat. No. 98EX238).

[22]  M.A.W. Houtsma,et al.  Set-Oriented Mining for Association Rules , 1993, ICDE 1993.