Robust subgroup discovery

We introduce the problem of robust subgroup discovery, i.e., finding a set of interpretable descriptions of subsets that 1) stand out with respect to one or more target attributes, 2) are statistically robust, and 3) non-redundant. Many attempts have been made to mine either locally robust subgroups or to tackle the pattern explosion, but we are the first to address both challenges at the same time from a global modelling perspective. First, we formulate the broad model class of subgroup lists, i.e., ordered sets of subgroups, for univariate and multivariate targets that can consist of nominal or numeric variables, and that includes traditional top-1 subgroup discovery in its definition. This novel model class allows us to formalise the problem of optimal robust subgroup discovery using the Minimum Description Length (MDL) principle, where we resort to optimal Normalised Maximum Likelihood and Bayesian encodings for nominal and numeric targets, respectively. Second, as finding optimal subgroup lists is NP-hard, we propose SSD++, a greedy heuristic that finds good subgroup lists and guarantees that the most significant subgroup found according to the MDL criterion is added in each iteration, which is shown to be equivalent to a Bayesian one-sample proportions, multinomial, or t-test between the subgroup and dataset marginal target distributions plus a multiple hypothesis testing penalty. We empirically show on 54 datasets Hugo Manuel Proença LIACS, Niels Bohrweg 1, 2333 CA Leiden, Netherlands E-mail: h.manuel.proenca@liacs.leidenuniv.nl Peter Grünwald CWI, Science Park 123, 1098 XG Amsterdam E-mail: peter.grunwald@cwi.nl Thomas Bäck LIACS, Niels Bohrweg 1, 2333 CA Leiden, Netherlands E-mail: t.h.w.baeck@liacs.leidenuniv.nl Matthijs van Leeuwen LIACS, Niels Bohrweg 1, 2333 CA Leiden, Netherlands E-mail: m.van.leeuwen@liacs.leidenuniv.nl ar X iv :2 10 3. 13 68 6v 2 [ cs .L G ] 2 8 N ov 2 02 1 2 Hugo M. Proença et al. that SSD++ outperforms previous subgroup set discovery methods in terms of quality and subgroup list size.

[1]  Thomas Bäck,et al.  Discovering outstanding subgroup lists for numeric targets using MDL , 2020, ECML/PKDD.

[2]  Jilles Vreeken,et al.  Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery , 2017, Data Mining and Knowledge Discovery.

[3]  Arno J. Knobbe,et al.  Diverse subgroup set discovery , 2012, Data Mining and Knowledge Discovery.

[4]  Daniel Paurat,et al.  Direct local pattern sampling by efficient two-step random procedures , 2011, KDD.

[5]  J. Rissanen A UNIVERSAL PRIOR FOR INTEGERS AND ESTIMATION BY MINIMUM DESCRIPTION LENGTH , 1983 .

[6]  Johannes Fürnkranz,et al.  From Local Patterns to Global Models: The LeGo Approach to Data Mining , 2008 .

[7]  J. Ross Quinlan,et al.  Generating Production Rules from Decision Trees , 1987, IJCAI.

[8]  Florian Lemmerich,et al.  Fast Subgroup Discovery for Continuous Target Concepts , 2009, ISMIS.

[9]  Peter A. Flach,et al.  Subgroup Discovery with CN2-SD , 2004, J. Mach. Learn. Res..

[10]  M. Boley,et al.  Uncovering structure-property relationships of materials by subgroup discovery , 2016, 1612.04307.

[11]  Been Kim,et al.  Considerations for Evaluation and Generalization in Interpretable Machine Learning , 2018 .

[12]  Teemu Roos,et al.  Minimum Description Length Revisited , 2019, ArXiv.

[13]  Tijl De Bie,et al.  Subjectively Interesting Subgroup Discovery on Real-Valued Targets , 2017, 2018 IEEE 34th International Conference on Data Engineering (ICDE).

[14]  Jorma Rissanen,et al.  An MDL Framework for Data Clustering , 2005 .

[15]  Wynne Hsu,et al.  Integrating Classification and Association Rule Mining , 1998, KDD.

[16]  JOHANNES FÜRNKRANZ,et al.  Separate-and-Conquer Rule Learning , 1999, Artificial Intelligence Review.

[17]  Jeffrey N. Rouder,et al.  Bayesian t tests for accepting and rejecting the null hypothesis , 2009, Psychonomic bulletin & review.

[18]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[19]  Cynthia Rudin,et al.  Interpretable classifiers using rules and Bayesian analysis: Building a better stroke prediction model , 2015, ArXiv.

[20]  Enrique Delahoz-Dominguez,et al.  Dataset of academic performance evolution for engineering students , 2020, Data in brief.

[21]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[22]  Margo I. Seltzer,et al.  Learning Certifiably Optimal Rule Lists , 2017, KDD.

[23]  Daniel Paurat,et al.  An enhanced relevance criterion for more concise supervised pattern discovery , 2012, KDD.

[24]  Matthijs van Leeuwen,et al.  Discovering Skylines of Subgroup Sets , 2013, ECML/PKDD.

[25]  Peter A. Flach,et al.  Rule Evaluation Measures: A Unifying View , 1999, ILP.

[26]  H. Jeffreys Some Tests of Significance, Treated by the Theory of Probability , 1935, Mathematical Proceedings of the Cambridge Philosophical Society.

[27]  Philip S. Yu,et al.  Direct Discriminative Pattern Mining for Effective Classification , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[28]  Johannes Fürnkranz,et al.  Foundations of Rule Learning , 2012, Cognitive Technologies.

[29]  Mykola Pechenizkiy,et al.  Exceptional spatio-temporal behavior mining through Bayesian non-parametric modeling , 2020, Data Mining and Knowledge Discovery.

[30]  Arno J. Knobbe,et al.  Non-redundant Subgroup Discovery in Large and Complex Data , 2011, ECML/PKDD.

[31]  Chedy Raïssi,et al.  Anytime discovery of a diverse set of patterns with Monte Carlo tree search. (Découverte d'un ensemble diversifié de motifs avec la recherche arborescente de Monte Carlo) , 2017 .

[32]  María José del Jesús,et al.  NMEEF-SD: Non-dominated Multiobjective Evolutionary Algorithm for Extracting Fuzzy Rules in Subgroup Discovery , 2010, IEEE Transactions on Fuzzy Systems.

[33]  Matthijs van Leeuwen,et al.  Expect the Unexpected - On the Significance of Subgroups , 2016, DS.

[34]  Tias Guns,et al.  Finding Probabilistic Rule Lists using the Minimum Description Length Principle , 2018, DS.

[35]  Jilles Vreeken,et al.  Krimp: mining itemsets that compress , 2011, Data Mining and Knowledge Discovery.

[36]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[37]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[38]  Peter A. Flach,et al.  Subgroup Discovery in Smart Electricity Meter Data , 2014, IEEE Transactions on Industrial Informatics.

[39]  Mehdi Kaytoue-Uberall,et al.  Anytime Subgroup Discovery in Numerical Domains with Guarantees , 2018, ECML/PKDD.

[40]  Jian Pei,et al.  CMAR: accurate and efficient classification based on multiple class-association rules , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[41]  Heikki Mannila,et al.  The Pattern Ordering Problem , 2003, PKDD.

[42]  L. M. M.-T. Theory of Probability , 1929, Nature.

[43]  Wouter Duivesteijn,et al.  Exploiting False Discoveries -- Statistical Validation of Patterns and Quality Measures in Subgroup Discovery , 2011, 2011 IEEE 11th International Conference on Data Mining.

[44]  Mario Boley,et al.  Instant Exceptional Model Mining Using Weighted Controlled Pattern Sampling , 2014, IDA.

[45]  Thomas Bäck,et al.  Identifying flight delay patterns using diverse subgroup discovery , 2018, 2018 IEEE Symposium Series on Computational Intelligence (SSCI).

[46]  Geoffrey I. Webb OPUS: An Efficient Admissible Algorithm for Unordered Search , 1995, J. Artif. Intell. Res..

[47]  A. Knobbe,et al.  Flexible Enrichment with Cortana – Software Demo , 2011 .

[48]  Kailash Budhathoki,et al.  Discovering Reliable Causal Rules , 2020, SDM.

[49]  María José del Jesús,et al.  An overview on subgroup discovery: foundations and applications , 2011, Knowledge and Information Systems.

[50]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[51]  Peter A. Flach,et al.  Subgroup Discovery with Proper Scoring Rules , 2016, ECML/PKDD.

[52]  Kotagiri Ramamohanarao,et al.  Information-Based Classification by Aggregating Emerging Patterns , 2000, IDEAL.

[53]  Geoffrey I. Webb,et al.  Better Short than Greedy: Interpretable Models through Optimal Rule Boosting , 2021, SDM.

[54]  Petri Myllymäki,et al.  Computing the Multinomial Stochastic Complexity in Sub-Linear Time , 2008 .

[55]  Martin Atzmüller,et al.  Subgroup discovery , 2005, Künstliche Intell..

[56]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[57]  Peter A. Flach,et al.  Uni- and multivariate probability density models for numeric subgroup discovery , 2020, Intell. Data Anal..

[58]  Jorma Rissanen,et al.  Minimum Description Length Principle , 2010, Encyclopedia of Machine Learning.

[59]  Frank Puppe,et al.  Fast exhaustive subgroup discovery with numerical target concepts , 2016, Data Mining and Knowledge Discovery.

[60]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[61]  María José del Jesús,et al.  Overview on evolutionary subgroup discovery: analysis of the suitability and potential of the search performed by evolutionary algorithms , 2014, WIREs Data Mining Knowl. Discov..

[62]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[63]  María José del Jesús,et al.  Revisiting Evolutionary Fuzzy Systems: Taxonomy, applications, new trends and challenges , 2015, Knowl. Based Syst..

[64]  Matthijs van Leeuwen,et al.  Maximal exceptions with minimal descriptions , 2010, Data Mining and Knowledge Discovery.

[65]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[66]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[67]  Siegfried Nijssen,et al.  Supervised Pattern Mining and Applications to Classification , 2014, Frequent Pattern Mining.

[68]  Charu C. Aggarwal,et al.  Frequent Pattern Mining Algorithms: A Survey , 2014, Frequent Pattern Mining.

[69]  Jesús Alcalá-Fdez,et al.  KEEL Data-Mining Software Tool: Data Set Repository, Integration of Algorithms and Experimental Analysis Framework , 2011, J. Multiple Valued Log. Soft Comput..

[70]  Amedeo Napoli,et al.  Mint: MDL-based approach for Mining INTeresting Numerical Pattern Sets , 2020, Data Mining and Knowledge Discovery.

[71]  Ronald L. Rivest,et al.  Learning decision lists , 2004, Machine Learning.

[72]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[73]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[74]  Branko Kavsek,et al.  APRIORI-SD: ADAPTING ASSOCIATION RULE LEARNING TO SUBGROUP DISCOVERY , 2006, IDA.

[75]  Marvin Meeng,et al.  For real: a thorough look at numeric attributes in subgroup discovery , 2020, Data Min. Knowl. Discov..

[76]  Wilhelmiina Hämäläinen,et al.  Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures , 2011, Knowledge and Information Systems.

[77]  W. Johnson,et al.  The Bayesian Two-Sample t Test , 2005 .

[78]  A. J. Feelders,et al.  Subgroup Discovery Meets Bayesian Networks -- An Exceptional Model Mining Approach , 2010, 2010 IEEE International Conference on Data Mining.

[79]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[80]  Francisco Charte,et al.  Multilabel Classification , 2016, Springer International Publishing.

[81]  Arno J. Knobbe,et al.  Effects of Pacing Properties on Performance in Long-Distance Running , 2018, Big Data.

[82]  Geoffrey I. Webb,et al.  Specious rules: an efficient and effective unifying method for removing misleading and uninformative patterns in association rule mining , 2017, SDM.

[83]  The Minimum Description Length Principle for Pattern Mining: A Survey , 2020, ArXiv.

[84]  Stefan Rüping,et al.  On subgroup discovery in numerical domains , 2009, Data Mining and Knowledge Discovery.

[85]  Wouter Duivesteijn,et al.  Exceptional Model Mining , 2008, Data Mining and Knowledge Discovery.

[86]  Jilles Vreeken,et al.  Sets of Robust Rules, and How to Find Them , 2019, ECML/PKDD.

[87]  Geoffrey I. Webb,et al.  A tutorial on statistically sound pattern discovery , 2017, Data Mining and Knowledge Discovery.

[88]  Margo I. Seltzer,et al.  Scalable Bayesian Rule Lists , 2016, ICML.

[89]  Matthijs van Leeuwen,et al.  Interpretable multiclass classification by MDL-based rule lists , 2019, Inf. Sci..

[90]  Mehdi Kaytoue-Uberall,et al.  FSSD - A Fast and Efficient Algorithm for Subgroup Set Discovery , 2019, 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[91]  Esther Galbrun,et al.  Association Discovery in Two-View Data , 2015, IEEE Transactions on Knowledge and Data Engineering.

[92]  Kailash Budhathoki,et al.  The Difference and the Norm - Characterising Similarities and Differences Between Databases , 2015, ECML/PKDD.