Discovering statistically non-redundant subgroups

The objective of subgroup discovery is to find groups of individuals who are statistically different from others in a large data set. Most existing measures of the quality of subgroups are intuitive and do not precisely capture statistical differences of a group with the other, and their discovered results contain many redundant subgroups. Odds ratio is a statistically sound measure to quantify the statistical difference of two groups for a certain outcome and it is a very suitable measure for quantifying the quality of subgroups. In this paper, we propose a statistically sound framework for statistically non-redundant subgroup discovery: measuring the quality of subgroups by the odds ratio and defining statistically non-redundant subgroups by the error bounds of odds ratios. We show that our proposed method is faster than most existing methods and discovers complete statistically non-redundant subgroups.

[1]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[2]  Geoffrey I. Webb,et al.  On detecting differences between groups , 2003, KDD '03.

[3]  Nancy L. Leech,et al.  SPSS for Introductory Statistics : Use and Interpretation, Second Edition , 2004 .

[4]  Geoffrey I. Webb,et al.  Supervised Descriptive Rule Discovery: A Unifying Survey of Contrast Set, Emerging Pattern and Subgroup Mining , 2009, J. Mach. Learn. Res..

[5]  Jinyan Li,et al.  Relative risk and odds ratio: a data mining perspective , 2005, PODS '05.

[6]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[7]  Dimitrios Gunopulos,et al.  Constraint-Based Rule Mining in Large, Dense Databases , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[8]  Jiuyong Li On optimal rule discovery , 2006 .

[9]  Luc De Raedt,et al.  Correlated itemset mining in ROC space: a constraint programming approach , 2009, KDD.

[10]  J. Fleiss,et al.  Statistical methods for rates and proportions , 1973 .

[11]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[12]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[13]  Daniel Paurat,et al.  Fast and Memory-Efficient Discovery of the Top-k Relevant Subgroups in a Reduced Candidate Space , 2011, ECML/PKDD.

[14]  Jian Pei,et al.  Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[15]  Frank Puppe,et al.  Introspective Subgroup Analysis for Interactive Knowledge Refinement , 2006, FLAIRS Conference.

[16]  Stefan Wrobel,et al.  Tight Optimistic Estimates for Fast Subgroup Discovery , 2008, ECML/PKDD.

[17]  Stefan Rüping,et al.  On subgroup discovery in numerical domains , 2009, Data Mining and Knowledge Discovery.

[18]  Christian Borgelt,et al.  EFFICIENT IMPLEMENTATIONS OF APRIORI AND ECLAT , 2003 .

[19]  B. Everitt,et al.  Statistical methods for rates and proportions , 1973 .

[20]  Geoffrey I. Webb Discovering Significant Patterns , 2007, Machine Learning.

[21]  Wouter Duivesteijn,et al.  Discovering Local Subgroups, with an Application to Fraud Detection , 2013, PAKDD.

[22]  Siegfried Nijssen,et al.  Efficient Algorithms for Finding Richer Subgroup Descriptions in Numeric and Nominal Data , 2012, 2012 IEEE 12th International Conference on Data Mining.

[23]  Arno J. Knobbe,et al.  Non-redundant Subgroup Discovery in Large and Complex Data , 2011, ECML/PKDD.

[24]  Nada Lavrac,et al.  Expert-Guided Subgroup Discovery: Methodology and Application , 2011, J. Artif. Intell. Res..

[25]  A. J. Feelders,et al.  Subgroup Discovery Meets Bayesian Networks -- An Exceptional Model Mining Approach , 2010, 2010 IEEE International Conference on Data Mining.

[26]  Henrik Grosskreutz,et al.  Subgroup Discovery for Election Analysis: A Case Study in Descriptive Data Mining , 2010, Discovery Science.

[27]  Arno J. Knobbe,et al.  Diverse subgroup set discovery , 2012, Data Mining and Knowledge Discovery.

[28]  Saso Dzeroski,et al.  Inductive process modeling , 2008, Machine Learning.

[29]  Branko Kavsek,et al.  APRIORI-SD: ADAPTING ASSOCIATION RULE LEARNING TO SUBGROUP DISCOVERY , 2006, IDA.

[30]  Henrik Grosskreutz,et al.  Non-redundant Subgroup Discovery Using a Closure System , 2009, ECML/PKDD.

[31]  Willi Klösgen,et al.  Explora: A Multipattern and Multistrategy Discovery Assistant , 1996, Advances in Knowledge Discovery and Data Mining.

[32]  Nada Lavrac,et al.  Contrast Set Mining Through Subgroup Discovery Applied to Brain Ischaemina Data , 2007, PAKDD.

[33]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[34]  Rajeev Motwani,et al.  Beyond market baskets: generalizing association rules to correlations , 1997, SIGMOD '97.

[35]  Peter A. Flach,et al.  Subgroup Discovery with CN2-SD , 2004, J. Mach. Learn. Res..

[36]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[37]  Mohammed J. Zaki Mining Non-Redundant Association Rules , 2004, Data Min. Knowl. Discov..

[38]  Wouter Duivesteijn,et al.  Exploiting False Discoveries -- Statistical Validation of Patterns and Quality Measures in Subgroup Discovery , 2011, 2011 IEEE 11th International Conference on Data Mining.

[39]  Shinichi Morishita,et al.  Transversing itemset lattices with statistical metric pruning , 2000, PODS '00.

[40]  Heikki Mannila,et al.  Fast Discovery of Association Rules , 1996, Advances in Knowledge Discovery and Data Mining.

[41]  María José del Jesús,et al.  An overview on subgroup discovery: foundations and applications , 2011, Knowledge and Information Systems.