Mining Optimized Association Rules for Numeric Attributes

Given a huge database, we address the problem of finding association rules for numeric attributes, such as(Balance?I)?(CardLoan=yes),which implies that bank customers whose balances fall in a rangeIare likely to use card loan with a probability greater thanp. The above rule is interesting only if the rangeIhas some special feature with respect to the interrelation betweenBalanceandCardLoan. It is required that the number of customers whose balances are contained inI(called thesupportofI) is sufficient and also that the probabilitypof the conditionCardLoan=yesbeing met (called theconfidence ratio) be much higher than the average probability of the condition over all the data. Our goal is to realize a system that finds such appropriate ranges automatically. We mainly focus on computing twooptimized ranges: one that maximizes the support on the condition that the confidence ratio is at least a given threshold value, and another that maximizes the confidence ratio on the condition that the support is at least a given threshold number. Using techniques from computational geometry, we present novel algorithms that compute the optimized ranges in linear time if the data are sorted. Since sorting data with respect to each numeric attribute is expensive in the case of huge databases that occupy much more space than the main memory, we instead apply randomized bucketing as the preprocessing method and thus obtain an efficient rule-finding system. Tests show that our implementation is fast not only in theory but also in practice. The efficiency of our algorithm enables us to compute optimized rules for all combinations of hundreds of numeric and Boolean attributes in a reasonable time.

[1]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[2]  Jorma Rissanen,et al.  SLIQ: A Fast Scalable Classifier for Data Mining , 1996, EDBT.

[3]  Jon Bentley,et al.  Programming pearls: algorithm design techniques , 1984, CACM.

[4]  Michael Stonebraker,et al.  DBMS Research at a Crossroads: The Vienna Update , 1993, VLDB.

[5]  Yasuhiko Morimoto,et al.  Computing Optimized Rectilinear Regions for Association Rules , 1997, KDD.

[6]  Rajeev Motwani,et al.  Randomized Algorithms , 1995, SIGA.

[7]  Jiawei Han,et al.  Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[8]  Jon Louis Bentley,et al.  Programming pearls , 1987, CACM.

[9]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[10]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[11]  William Frawley,et al.  Knowledge Discovery in Databases , 1991 .

[12]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[13]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[14]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[15]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[16]  Yasuhiko Morimoto,et al.  Data mining using two-dimensional optimized association rules: scheme, algorithms, and visualization , 1996, SIGMOD '96.

[17]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[18]  Tomasz Imielinski,et al.  Database Mining: A Performance Perspective , 1993, IEEE Trans. Knowl. Data Eng..

[19]  Tomasz Imielinski,et al.  An Interval Classifier for Database Mining Applications , 1992, VLDB.