A framework for mining interesting pattern sets

This paper suggests a framework for mining subjectively interesting pattern sets that is based on two components: (1) the encoding of prior information in a model for the data miner's state of mind; (2) the search for a pattern set that is maximally informative while efficient to convey to the data miner. We illustrate the framework with an instantiation for tile patterns in binary databases where prior information on the row and column marginals is available. This approach implements step (1) above by constructing the MaxEnt model with respect to the prior information [2, 3], and step (2) by relying on concepts from information and coding theory. We provide a brief overview of a number of possible extensions and future research challenges, including a key challenge related to the design of empirical evaluations for subjective interestingness measures.

[1]  Leon Gordon Kraft,et al.  A device for quantizing, grouping, and coding amplitude-modulated pulses , 1949 .

[2]  Abraham Silberschatz,et al.  On Subjective Measures of Interestingness in Knowledge Discovery , 1995, KDD.

[3]  Balaji Padmanabhan,et al.  A Belief-Driven Method for Discovering Unexpected Patterns , 1998, KDD.

[4]  Balaji Padmanabhan,et al.  Small is beautiful: discovering the minimal set of unexpected patterns , 2000, KDD '00.

[5]  Mohammed J. Zaki,et al.  CHARM: An Efficient Algorithm for Closed Itemset Mining , 2002, SDM.

[6]  Szymon Jaroszewicz,et al.  Interestingness of frequent itemsets using Bayesian networks as background knowledge , 2004, KDD.

[7]  Bart Goethals,et al.  Tiling Databases , 2004, Discovery Science.

[8]  J. Winderickx,et al.  Inferring transcriptional modules from ChIP-chip, motif and microarray data , 2006, Genome Biology.

[9]  Jilles Vreeken,et al.  Item Sets that Compress , 2006, SDM.

[10]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[11]  Luc De Raedt,et al.  Constraint-Based Pattern Set Mining , 2007, SDM.

[12]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  Nello Cristianini,et al.  MINI: Mining Informative Non-redundant Itemsets , 2007, PKDD.

[14]  Aristides Gionis,et al.  Assessing data mining results via swap randomization , 2007, TKDD.

[15]  Nikolaj Tatti Maximum Entropy Based Significance of Itemsets , 2007, ICDM.

[16]  Heikki Mannila,et al.  Randomization of real-valued matrices for assessing the significance of data mining results , 2008, SDM.

[17]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[18]  Tijl De Bie,et al.  Explicit probabilistic models for databases and networks , 2009, ArXiv.

[19]  Nello Cristianini,et al.  From frequent itemsets to informative patterns , 2009 .

[20]  Heikki Mannila,et al.  Tell me something I don't know: randomization strategies for iterative data mining , 2009, KDD.

[21]  Tijl De Bie,et al.  An Information-Theoretic Approach to Finding Informative Noisy Tiles in Binary Databases , 2010, SDM.

[22]  Gemma C. Garriga,et al.  Evaluating Query Result Significance in Databases via Randomizations , 2010, SDM.

[23]  Tijl De Bie,et al.  Maximum entropy models and subjective interestingness: an application to tiles in binary databases , 2010, Data Mining and Knowledge Discovery.

[24]  Bart Goethals,et al.  Mining interesting sets and rules in relational databases , 2010, SAC '10.