Optimal Subgroup Discovery in Purely Numerical Data

Subgroup discovery in labeled data is the task of discovering patterns in the description space of objects to find subsets of objects whose labels show an interesting distribution, for example the disproportionate representation of a label value. Discovering interesting subgroups in purely numerical data - attributes and target label - has received little attention so far. Existing methods make use of discretization methods that lead to a loss of information and suboptimal results. This is the case for the reference algorithm SD-Map*. We consider here the discovery of optimal subgroups according to an interestingness measure in purely numerical data. We leverage the concept of closed interval patterns and advanced enumeration and pruning techniques. The performances of our algorithm are studied empirically and its added-value w.r.t. SD-Map* is illustrated.

[1]  Bernhard Ganter,et al.  Formal Concept Analysis: Mathematical Foundations , 1998 .

[2]  Stefan Wrobel,et al.  Tight Optimistic Estimates for Fast Subgroup Discovery , 2008, ECML/PKDD.

[3]  Henrik Grosskreutz,et al.  Non-redundant Subgroup Discovery Using a Closure System , 2009, ECML/PKDD.

[4]  Romain Mathonat,et al.  Actionable Subgroup Discovery and Urban Farm Optimization , 2020, IDA.

[5]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[6]  Frank Puppe,et al.  Fast exhaustive subgroup discovery with numerical target concepts , 2016, Data Mining and Knowledge Discovery.

[7]  Jilles Vreeken,et al.  Flexibly Mining Better Subgroups , 2016, SDM.

[8]  Klaus Truemper,et al.  Discretization of Target Attributes for Subgroup Discovery , 2009, MLDM.

[9]  Chedy Raïssi,et al.  Anytime discovery of a diverse set of patterns with Monte Carlo tree search. (Découverte d'un ensemble diversifié de motifs avec la recherche arborescente de Monte Carlo) , 2017 .

[10]  Stefan Rüping,et al.  On subgroup discovery in numerical domains , 2009, Data Mining and Knowledge Discovery.

[11]  Mehdi Kaytoue-Uberall,et al.  Anytime Subgroup Discovery in Numerical Domains with Guarantees , 2018, ECML/PKDD.

[12]  Geoffrey I. Webb Discovering associations with numeric variables , 2001, KDD '01.

[13]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[14]  Amedeo Napoli,et al.  Biclustering Numerical Data in Formal Concept Analysis , 2011, ICFCA.

[15]  Bruno Crémilleux,et al.  Condensed Representation of Emerging Patterns , 2004, PAKDD.

[16]  Siegfried Nijssen,et al.  Efficient Algorithms for Finding Richer Subgroup Descriptions in Numeric and Nominal Data , 2012, 2012 IEEE 12th International Conference on Data Mining.

[17]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[18]  Amedeo Napoli,et al.  Revisiting Numerical Pattern Mining with Formal Concept Analysis , 2011, IJCAI.

[19]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[20]  Nada Lavrac,et al.  Closed Sets for Labeled Data , 2006, PKDD.

[21]  Francisco Herrera,et al.  A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning , 2013, IEEE Transactions on Knowledge and Data Engineering.

[22]  Willi Klösgen,et al.  Knowledge Discovery in Databases and Data Mining , 1996, ISMIS.

[23]  Daniel Paurat,et al.  Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups in a Reduced Candidate Space , 2011, ECML/PKDD.

[24]  Yehuda Lindell,et al.  A Statistical Theory for Quantitative Association Rules , 1999, KDD '99.