Diverse subgroup set discovery

Large data is challenging for most existing discovery algorithms, for several reasons. First of all, such data leads to enormous hypothesis spaces, making exhaustive search infeasible. Second, many variants of essentially the same pattern exist, due to (numeric) attributes of high cardinality, correlated attributes, and so on. This causes top-k mining algorithms to return highly redundant result sets, while ignoring many potentially interesting results. These problems are particularly apparent with subgroup discovery (SD) and its generalisation, exceptional model mining. To address this, we introduce subgroup set discovery: one should not consider individual subgroups, but sets of subgroups. We consider three degrees of redundancy, and propose corresponding heuristic selection strategies in order to eliminate redundancy. By incorporating these (generic) subgroup selection methods in a beam search, the aim is to improve the balance between exploration and exploitation. The proposed algorithm, dubbed DSSD for diverse subgroup set discovery, is experimentally evaluated and compared to existing approaches. For this, a variety of target types with corresponding datasets and quality measures is used. The subgroup sets that are discovered by the competing methods are evaluated primarily on the following three criteria: (1) diversity in the subgroup covers (exploration), (2) the maximum quality found (exploitation), and (3) runtime. The results show that DSSD outperforms each traditional SD method on all or a (non-empty) subset of these criteria, depending on the specific setting. The more complex the task, the larger the benefit of using our diverse heuristic search turns out to be.

[1]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[2]  Barbara F. I. Pieters,et al.  Subgroup Discovery in Ranked Data, with an Application to Gene Set Enrichment , 2010 .

[3]  Wynne Hsu,et al.  Discovering the set of fundamental rule changes , 2001, KDD '01.

[4]  Yehuda Lindell,et al.  A Statistical Theory for Quantitative Association Rules , 1999, KDD '99.

[5]  Arno J. Knobbe,et al.  Maximally informative k-itemsets and their efficient discovery , 2006, KDD '06.

[6]  C.J.H. Mann,et al.  Handbook of Data Mining and Knowledge Discovery , 2004 .

[7]  Heikki Mannila,et al.  Multiple Uses of Frequent Sets and Condensed Representations (Extended Abstract) , 1996, KDD.

[8]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[9]  Saso Dzeroski,et al.  Beam Search Induction and Similarity Constraints for Predictive Clustering Trees , 2006, KDID.

[10]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[12]  Wei Jiang,et al.  Data Mining Methods and Applications , 2006 .

[13]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[14]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[15]  Arno J. Knobbe,et al.  Non-redundant Subgroup Discovery in Large and Complex Data , 2011, ECML/PKDD.

[16]  Daniel Paurat,et al.  Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups in a Reduced Candidate Space , 2011, ECML/PKDD.

[17]  Henrik Grosskreutz,et al.  Subgroup Discovery for Election Analysis: A Case Study in Descriptive Data Mining , 2010, Discovery Science.

[18]  Arno Knobbe,et al.  Building Classifiers from Pattern Teams , 2009 .

[19]  Hendrik Blockeel,et al.  Multi-Relational Data Mining , 2005, Frontiers in Artificial Intelligence and Applications.

[20]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Stefan Rüping,et al.  On subgroup discovery in numerical domains , 2009, Data Mining and Knowledge Discovery.

[22]  Geoffrey I. Webb OPUS: An Efficient Admissible Algorithm for Unordered Search , 1995, J. Artif. Intell. Res..

[23]  Peter Shell,et al.  Improving Search through Diversity , 1994, AAAI.

[24]  Jinyan Li,et al.  CAEP: Classification by Aggregating Emerging Patterns , 1999, Discovery Science.

[25]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[26]  Jan Zima,et al.  The Atlas of European Mammals , 1999 .

[27]  Bruce T. Lowerre,et al.  The HARPY speech recognition system , 1976 .

[28]  Florian Lemmerich,et al.  Fast Subgroup Discovery for Continuous Target Concepts , 2009, ISMIS.

[29]  Luc De Raedt,et al.  Correlated itemset mining in ROC space: a constraint programming approach , 2009, KDD.

[30]  Nicholas I. Fisher,et al.  Bump hunting in high-dimensional data , 1999, Stat. Comput..

[31]  Stefan Wrobel,et al.  Tight Optimistic Estimates for Fast Subgroup Discovery , 2008, ECML/PKDD.

[32]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[33]  Florian Lemmerich,et al.  Fast Discovery of Relevant Subgroup Patterns , 2010, FLAIRS Conference.

[34]  Peter A. Flach,et al.  Subgroup Discovery with CN2-SD , 2004, J. Mach. Learn. Res..

[35]  A. J. Feelders,et al.  Subgroup Discovery Meets Bayesian Networks -- An Exceptional Model Mining Approach , 2010, 2010 IEEE International Conference on Data Mining.

[36]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[37]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[38]  Geoffrey I. Webb Discovering associations with numeric variables , 2001, KDD '01.

[39]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[40]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[41]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[42]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[43]  瀬々 潤,et al.  Traversing Itemset Lattices with Statistical Metric Pruning (小特集 「発見科学」及び一般演題) , 2000 .

[44]  H. Mannila,et al.  Biogeography of European land mammals shows environmentally distinct and spatially coherent clusters , 2007 .

[45]  David Taniar,et al.  Exception Rules in Data Mining , 2005, Encyclopedia of Information Science and Technology.

[46]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[47]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[48]  Willi Klösgen,et al.  Spatio-Temporal Subgroup Discovery , 2002 .

[49]  Peter Clark,et al.  The CN2 Induction Algorithm , 1989, Machine Learning.

[50]  Frank Puppe,et al.  Local Models for Expectation-Driven Subgroup Discovery , 2011, 2011 IEEE 11th International Conference on Data Mining.

[51]  Matthijs van Leeuwen,et al.  Maximal exceptions with minimal descriptions , 2010, Data Mining and Knowledge Discovery.

[52]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[53]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[54]  Peter Clark,et al.  Rule Induction with CN2: Some Recent Improvements , 1991, EWSL.

[55]  Nada Lavrac,et al.  Closed Sets for Labeled Data , 2006, PKDD.