论文信息 - Efficient data reduction with EASE

Efficient data reduction with EASE

A variety of mining and analysis problems --- ranging from association-rule discovery to contingency table analysis to materialization of certain approximate datacubes --- involve the extraction of knowledge from a set of categorical count data. Such data can be viewed as a collection of "transactions," where a transaction is a fixed-length vector of counts. Classical algorithms for solving count-data problems require one or more computationally intensive passes over the entire database and can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present a new data-reduction algorithm, called EASE, for producing such a sample. Like the FAST algorithm introduced by Chen et al., EASE is especially designed for count data applications. Both EASE and FAST take a relatively large initial random sample and then deterministically produce a subsample whose "distance" --- appropriately defined --- from the complete database is minimal. Unlike FAST, which obtains the final subsample by quasi-greedy descent, EASE uses epsilon-approximation methods to obtain the final subsample by a process of repeated halving. Experiments both in the context of association rule mining and classical χ2 contingency-table analysis show that EASE outperforms both FAST and simple random sampling, sometimes dramatically.

[1] Leslie G. Valiant,et al. A theory of the learnable , 1984, CACM.

[2] Jirí Matousek. Derandomization in Computational Geometry , 2000, Handbook of Computational Geometry.

[3] Hannu Toivonen,et al. Sampling Large Databases for Association Rules , 1996, VLDB.

[4] Srinivasan Parthasarathy,et al. Evaluation of sampling for data mining of association rules , 1997, Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications.

[5] Bernard Chazelle,et al. The discrepancy method - randomness and complexity , 2000 .

[6] Tomasz Imielinski,et al. Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[7] Bernard Chazelle,et al. The Discrepancy Method , 1998, ISAAC.

[8] Hamid Pirahesh,et al. Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[9] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[10] Ramakrishnan Srikant,et al. Fast algorithms for mining association rules , 1998, VLDB 1998.

[11] Noga Alon,et al. The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[12] H. Ni. Product Range Spaces, Sensitive Sampling, and Derandomization , 1993 .

[13] Bin Chen,et al. Efficient Data-Reduction Methods for On-line Association Rule Discovery , 2004 .

[14] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[15] Bin Chen,et al. A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.