Efficient data reduction with EASE

A variety of mining and analysis problems --- ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes --- involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" --- appropriately defined --- from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical χ2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

[1]  Leslie G. Valiant,et al.  A theory of the learnable , 1984, CACM.

[2]  Jirí Matousek Derandomization in Computational Geometry , 2000, Handbook of Computational Geometry.

[3]  Hannu Toivonen,et al.  Sampling Large Databases for Association Rules , 1996, VLDB.

[4]  Srinivasan Parthasarathy,et al.  Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[5]  Bernard Chazelle,et al.  The discrepancy method - randomness and complexity , 2000 .

[6]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[7]  Bernard Chazelle,et al.  The Discrepancy Method , 1998, ISAAC.

[8]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[9]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[10]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[11]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[12]  H. Ni Product Range Spaces, Sensitive Sampling, and Derandomization , 1993 .

[13]  Bin Chen,et al.  Efficient Data-Reduction Methods for On-line Association Rule Discovery , 2004 .

[14]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[15]  Bin Chen,et al.  A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.