Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups in a Reduced Candidate Space

We consider a modified version of the top-k subgroup discovery task, where subgroups dominated by other subgroups are discarded. The advantage of this modified task, known as relevant subgroup discovery, is that it avoids redundancy in the outcome. Although it has been applied in many applications, so far no efficient exact algorithm for this task has been proposed. Most existing solutions do not guarantee the exact solution (as a result of the use of non-admissible heuristics), while the only exact solution relies on the explicit storage of the whole search space, which results in prohibitively large memory requirements. In this paper, we present a new top-k relevant subgroup discovery algorithm which overcomes these shortcomings. Our solution is based on the fact that if an iterative deepening approach is applied, the relevance check - which is the root of the problems of all other approaches - can be realized based solely on the best k subgroups visited so far. The approach also allows for the integration of admissible pruning techniques like optimistic estimate pruning. The result is a fast, memory-efficient algorithm which clearly outperforms existing top-k relevant subgroup discovery approaches. Moreover, we analytically and empirically show that it is competitive with simpler approaches which do not consider the relevance criterion.

[1]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[2]  Florian Lemmerich,et al.  Fast Subgroup Discovery for Continuous Target Concepts , 2009, ISMIS.

[3]  Luc De Raedt,et al.  Correlated itemset mining in ROC space: a constraint programming approach , 2009, KDD.

[4]  Nada Lavrac,et al.  A Study of Relevance for Learning in Deductive Databases , 1999, J. Log. Program..

[5]  Shusaku Tsumoto,et al.  Foundations of Intelligent Systems, 15th International Symposium, ISMIS 2005, Saratoga Springs, NY, USA, May 25-28, 2005, Proceedings , 2005, ISMIS.

[6]  Nicolas Pasquier,et al.  Efficient Mining of Association Rules Using Closed Itemset Lattices , 1999, Inf. Syst..

[7]  Henrik Grosskreutz,et al.  Non-redundant Subgroup Discovery Using a Closure System , 2009, ECML/PKDD.

[8]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[9]  N. Lavra,et al.  EXPERIMENTAL COMPARISON OF THREE SUBGROUP DISCOVERY ALGORITHMS: ANALYSING BRAIN ISCHAEMIA DATA , 2005 .

[10]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  Nada Lavrac,et al.  Closed Sets for Labeled Data , 2006, PKDD.

[12]  Florian Lemmerich,et al.  Fast Discovery of Relevant Subgroup Patterns , 2010, FLAIRS Conference.

[13]  Peter A. Flach,et al.  Subgroup Discovery with CN2-SD , 2004, J. Mach. Learn. Res..

[14]  Peter A. Flach,et al.  Evaluation Measures for Multi-class Subgroup Discovery , 2009, ECML/PKDD.

[15]  Nada Lavrac,et al.  Relevancy in Constraint-Based Subgroup Discovery , 2004, Constraint-Based Mining and Inductive Databases.

[16]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[17]  Hiroki Arimura,et al.  An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases , 2004, Discovery Science.

[18]  Willi Klösgen,et al.  Knowledge Discovery in Databases and Data Mining , 1996, ISMIS.

[19]  Johannes Fürnkranz,et al.  From Local Patterns to Global Models: The LeGo Approach to Data Mining , 2008 .

[20]  Stefan Wrobel,et al.  Tight Optimistic Estimates for Fast Subgroup Discovery , 2008, ECML/PKDD.

[21]  Siegfried Nijssen,et al.  Pattern-Based Classification: A Unifying Perspective , 2011, ArXiv.

[22]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[23]  Luc De Raedt,et al.  Constraint-Based Mining and Inductive Databases, European Workshop on Inductive Databases and Constraint Based Mining, Hinterzarten, Germany, March 11-13, 2004, Revised Selected Papers , 2005, Constraint-Based Mining and Inductive Databases.

[24]  瀬々 潤,et al.  Traversing Itemset Lattices with Statistical Metric Pruning (小特集 「発見科学」及び一般演題) , 2000 .

[25]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[26]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.