论文信息 - Efficient Data-Reduction Methods for On-line Association Rule Discovery

Efficient Data-Reduction Methods for On-line Association Rule Discovery

Classical data mining algorithms that require one or more computationally intensive passes over the entire database can be prohibitively slow. One effective method for dealing with this ever-worsening scalability problem is to run the algorithms on a small sample of the data. We present and empirically compare two data-reduction algorithms for producing such a sample; these algorithms, called FAST and EA, are tailored to “count” data applications such as association-rule mining. The algorithms are similar in that both attempt to produce a sample whose “distance” — appropriately defined — from the complete database is minimal. They differ greatly, however, in the way that they greedily search through the exponential number of possible samples. FAST, originally presented in [8], uses random sampling together with trimming of “outlier” transactions. On the other hand, the EA algorithm, introduced in the current paper, repeatedly and deterministically halves the data to obtain the final sample. Unlike FAST, the EA algorithm provides a guaranteed level of accuracy. Our experiments show that EA is more expensive to run than FAST, but yields more accurate results for a given sample size. Thus, depending on the specific problem under consideration, the user can trade off speed and accuracy by selecting the appropriate method. We conclude by showing how the EA data-reduction approach can potentially be adapted to provide data-reduction schemes for streaming data systems. The proposed schemes favor recent data while still retaining partial information about all of the data seen so

[1] Ramesh C Agarwal,et al. Depth first generation of long patterns , 2000, KDD '00.

[2] Tomasz Imielinski,et al. Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[3] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[4] P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[5] Bernard Chazelle,et al. The Discrepancy Method , 1998, ISAAC.

[6] Bin Chen,et al. A new two-phase sampling based algorithm for discovering association rules , 2002, KDD.

[7] George Varghese,et al. New directions in traffic measurement and accounting , 2002, CCRV.

[8] Jian Pei,et al. Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[9] Jian Pei,et al. Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[10] Rakesh Agarwal,et al. Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[11] Fritz Leiber,et al. The Big Time , 1958 .

[12] RamakrishnanRaghu,et al. Bottom-up computation of sparse and Iceberg CUBE , 1999 .

[13] Hannu Toivonen,et al. Sampling Large Databases for Association Rules , 1996, VLDB.

[14] Viswanath Poosala,et al. Congressional samples for approximate answering of group-by queries , 2000, SIGMOD '00.

[15] Noga Alon,et al. The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.