Maximum entropy models and subjective interestingness: an application to tiles in binary databases

Recent research has highlighted the practical benefits of subjective interestingness measures, which quantify the novelty or unexpectedness of a pattern when contrasted with any prior information of the data miner (Silberschatz and Tuzhilin, Proceedings of the 1st ACM SIGKDD international conference on Knowledge discovery and data mining (KDD95), 1995; Geng and Hamilton, ACM Comput Surv 38(3):9, 2006). A key challenge here is the formalization of this prior information in a way that lends itself to the definition of a subjective interestingness measure that is both meaningful and practical. In this paper, we outline a general strategy of how this could be achieved, before working out the details for a use case that is important in its own right. Our general strategy is based on considering prior information as constraints on a probabilistic model representing the uncertainty about the data. More specifically, we represent the prior information by the maximum entropy (MaxEnt) distribution subject to these constraints. We briefly outline various measures that could subsequently be used to contrast patterns with this MaxEnt model, thus quantifying their subjective interestingness. We demonstrate this strategy for rectangular databases with knowledge of the row and column sums. This situation has been considered before using computation intensive approaches based on swap randomizations, allowing for the computation of empirical p-values as interestingness measures (Gionis et al., ACM Trans Knowl Discov Data 1(3):14, 2007). We show how the MaxEnt model can be computed remarkably efficiently in this situation, and how it can be used for the same purpose as swap randomizations but computationally more efficiently. More importantly, being an explicitly represented distribution, the MaxEnt model can additionally be used to define analytically computable interestingness measures, as we demonstrate for tiles (Geerts et al., Proceedings of the 7th international conference on Discovery science (DS04), 2004) in binary databases.

[1]  Heikki Mannila Randomization Techniques for Data Mining Methods , 2008, ADBIS.

[2]  Alexandr A. Savinov Mining dependence rules by finding largest itemset support quota , 2004, SAC '04.

[3]  John Skilling,et al.  Maximum entropy method in image processing , 1984 .

[4]  Nello Cristianini,et al.  MINI: Mining Informative Non-redundant Itemsets , 2007, PKDD.

[5]  Geert Wets,et al.  Using association rules for product assortment decisions: a case study , 1999, KDD '99.

[6]  Garry Robins,et al.  An introduction to exponential random graph (p*) models for social networks , 2007, Soc. Networks.

[7]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[8]  Heikki Mannila,et al.  Randomization of real-valued matrices for assessing the significance of data mining results , 2008, SDM.

[9]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[10]  Szymon Jaroszewicz,et al.  Interestingness of frequent itemsets using Bayesian networks as background knowledge , 2004, KDD.

[11]  Abraham Silberschatz,et al.  On Subjective Measures of Interestingness in Knowledge Discovery , 1995, KDD.

[12]  Tijl De Bie,et al.  An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases , 2010, SDM.

[13]  Balaji Padmanabhan,et al.  Small is beautiful: discovering the minimal set of unexpected patterns , 2000, KDD '00.

[14]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[15]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[16]  Myron Tribus,et al.  Thermostatics and thermodynamics : an introduction to energy, information and states of matter, with engineering applications , 1961 .

[17]  James E. Gentle,et al.  Elements of computational statistics , 2002 .

[18]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[19]  Tijl De Bie,et al.  Explicit probabilistic models for databases and networks , 2009, ArXiv.

[20]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[21]  Heikki Mannila,et al.  Tell me something I don't know: randomization strategies for iterative data mining , 2009, KDD.

[22]  Nikolaj Tatti,et al.  Maximum entropy based significance of itemsets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[23]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[24]  Aristides Gionis,et al.  Geometric and Combinatorial Tiles in 0-1 Data , 2004, PKDD.

[25]  Mark E. J. Newman,et al.  The Structure and Function of Complex Networks , 2003, SIAM Rev..

[26]  E. Jaynes On the rationale of maximum-entropy methods , 1982, Proceedings of the IEEE.

[27]  Samir Khuller,et al.  The Budgeted Maximum Coverage Problem , 1999, Inf. Process. Lett..

[28]  S. Shen-Orr,et al.  Networks Network Motifs : Simple Building Blocks of Complex , 2002 .

[29]  E. Lehmann Testing Statistical Hypotheses. , 1997 .

[30]  Flemming Topsøe,et al.  Information-theoretical optimization techniques , 1979, Kybernetika.

[31]  G. Rasch On General Laws and the Meaning of Measurement in Psychology , 1961 .

[32]  S. Ravi Testing Statistical Hypotheses, 3rd edn by E. L. Lehmann and J. P. Romano , 2007 .

[33]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[34]  Heikki Mannila,et al.  Beyond Independence: Probabilistic Models for Query Approximation on Binary Transaction Data , 2003, IEEE Trans. Knowl. Data Eng..

[35]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2006, IEEE Transactions on Knowledge and Data Engineering.

[36]  Toon Calders Itemset frequency satisfiability: Complexity and axiomatization , 2008, Theor. Comput. Sci..

[37]  Michel Minoux,et al.  Accelerated greedy algorithms for maximizing submodular set functions , 1978 .

[38]  Tijl De Bie,et al.  Finding interesting itemsets using a probabilistic model for binary databases , 2009 .

[39]  S. Shen-Orr,et al.  Network motifs: simple building blocks of complex networks. , 2002, Science.

[40]  J. Shewchuk An Introduction to the Conjugate Gradient Method Without the Agonizing Pain , 1994 .

[41]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[42]  Fan Chung Graham,et al.  The Average Distance in a Random Graph with Given Expected Degrees , 2004, Internet Math..

[43]  Nello Cristianini,et al.  From frequent itemsets to informative patterns , 2009 .

[44]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[45]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[46]  Christopher Findeisen Tell Me Something I Don’t Know (If You Can): The Pragmatic Challenge to Subjectivity in Frost and Stevens , 2008 .

[47]  Pauli Miettinen,et al.  The Discrete Basis Problem , 2008, IEEE Trans. Knowl. Data Eng..

[48]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[49]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.