Maximum Entropy Based Significance of Itemsets

We consider the problem of defining the significance of an itemset. We say that the itemset is significant if we are surprised by its frequency when compared to the frequencies of its sub-itemsets. In other words, we estimate the frequency of the itemset from the frequencies of its sub-itemsets and compute the deviation between the real value and the estimate. For the estimation we use Maximum Entropy and for measuring the deviation we use Kullback-Leibler divergence. A major advantage compared to the previous methods is that we are able to use richer models whereas the previous approaches only measure the deviation from the independence model. We show that our measure of significance goes to zero for derivable itemsets and that we can use the rank as a statistical test. Our empirical results demonstrate that for our real datasets the independence assumption is too strong but applying more flexible models leads to good results.

[1]  Heikki Mannila,et al.  The Pattern Ordering Problem , 2003, PKDD.

[2]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[3]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[4]  Heikki Mannila,et al.  Verkamo: Fast Discovery of Association Rules , 1996, KDD 1996.

[5]  Nello Cristianini,et al.  MINI: Mining Informative Non-redundant Itemsets , 2007, PKDD.

[6]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[7]  Gregory Piatetsky-Shapiro,et al.  Discovery, Analysis, and Presentation of Strong Rules , 1991, Knowledge Discovery in Databases.

[8]  Geoffrey I. Webb Discovering significant rules , 2006, KDD '06.

[9]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[10]  Nikolaj Tatti,et al.  Safe projections of binary data sets , 2006, Acta Informatica.

[11]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[12]  Heikki Mannila,et al.  Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data , 2003, IEEE Trans. Knowl. Data Eng..

[13]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[14]  Jean-François Boulicaut,et al.  Approximation of Frequency Queris by Means of Free-Sets , 2000, PKDD.

[15]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[16]  Szymon Jaroszewicz,et al.  Pruning Redundant Association Rules Using Maximum Entropy Principle , 2002, PAKDD.

[17]  R. Jirousek,et al.  On the effective implementation of the iterative proportional fitting procedure , 1995 .

[18]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[19]  I. Csiszár $I$-Divergence Geometry of Probability Distributions and Minimization Problems , 1975 .

[20]  Solomon Kullback,et al.  Information Theory and Statistics , 1970, The Mathematical Gazette.

[21]  Nikolaj Tatti,et al.  Computational complexity of queries based on itemsets , 2006, Inf. Process. Lett..

[22]  Rajeev Motwani,et al.  Dynamic itemset counting and implication rules for market basket data , 1997, SIGMOD '97.

[23]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[24]  K. Vanhoof,et al.  Profiling of High-Frequency Accident Locations by Use of Association Rules , 2003 .

[25]  Edward Omiecinski,et al.  Alternative Interest Measures for Mining Associations in Databases , 2003, IEEE Trans. Knowl. Data Eng..

[26]  Philip S. Yu,et al.  A new framework for itemset generation , 1998, PODS '98.

[27]  Heikki Mannila,et al.  Finding low-entropy sets and trees from binary data , 2007, KDD '07.

[28]  Tom Brijs,et al.  Profiling high frequency accident locations using associations rules , 2002 .

[29]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[30]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[31]  Toon Calders,et al.  Mining All Non-derivable Frequent Itemsets , 2002, PKDD.

[32]  Aristides Gionis,et al.  Spectral ordering and biochronology of European fossil mammals , 2006, Paleobiology.

[33]  Carla E. Brodley,et al.  KDD-Cup 2000 organizers' report: peeling the onion , 2000, SKDD.

[34]  William DuMouchel,et al.  Empirical bayes screening for multi-item associations , 2001, KDD '01.

[35]  A. Bate,et al.  Extending the methods used to screen the WHO drug safety database towards analysis of complex associations and improved accuracy for rare events , 2006, Statistics in medicine.