Cost-based quality measures in subgroup discovery

We consider data where examples are not only labeled in the classical sense (positive or negative), but also have costs associated with them. In this sense, each example has two target attributes, and we aim to find clearly defined subsets of the data where the values of these two targets have an unusual distribution. In other words, we are focusing on a Subgroup Discovery task with a somewhat unusual target concept, and investigate quality measures that take into account both the binary and the cost target. In defining such quality measures, we aim to produce an interpretable valuation of a subgroup, such that data analysts can directly appreciate the findings, and relate these to monetary gains or losses. Our work is particularly relevant in the domain of health care fraud detection. In this domain, the binary target identifies the patients of a specific medical practitioner under investigation, and the cost target specifies the money spent on each patient. When looking for differences in claim behavior, we need to take into account both the ‘positive’ examples (patients of the practitioner) and ‘negative’ examples (other patients), as well as information about costs of all patients. A typical subgroup will list a number of treatments, and the target practitioner’s patients behavioral difference in both treatment prevalence and associated costs. An additional angle is the Local Subgroup Discovery task, where subgroups are judged according to the difference with a local reference group instead of the entire dataset. We show how the cost-based analysis of data specifically fits this local focus.

[1]  Florian Lemmerich,et al.  Fast Subgroup Discovery for Continuous Target Concepts , 2009, ISMIS.

[2]  Jinyan Li,et al.  Efficient mining of emerging patterns: discovering trends and differences , 1999, KDD '99.

[3]  Qiang Yang,et al.  Mining high utility itemsets , 2003, Third IEEE International Conference on Data Mining.

[4]  Schloss Birlinghoven,et al.  Cascaded Subgroups Discovery with an Application to Regression , 2008 .

[5]  Stefan Wrobel,et al.  An Algorithm for Multi-relational Discovery of Subgroups , 1997, PKDD.

[6]  Peter A. Flach,et al.  Technical Note: Towards ROC Curves in Cost Space , 2011, ArXiv.

[7]  Peter A. Flach,et al.  Rule Evaluation Measures: A Unifying View , 1999, ILP.

[8]  Shusaku Tsumoto,et al.  Foundations of Intelligent Systems , 2003, Lecture Notes in Computer Science.

[9]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[10]  Marvin Meeng,et al.  Cost-Based Quality Measures in Subgroup Discovery , 2013, PAKDD Workshops.

[11]  Mohand-Said Hacid,et al.  Foundations of Intelligent Systems , 2002, Lecture Notes in Computer Science.

[12]  A. Knobbe,et al.  Flexible Enrichment with Cortana – Software Demo , 2011 .

[13]  Stefan Wrobel,et al.  Tight Optimistic Estimates for Fast Subgroup Discovery , 2008, ECML/PKDD.

[14]  Stephen D. Bay,et al.  Detecting Group Differences: Mining Contrast Sets , 2001, Data Mining and Knowledge Discovery.

[15]  Richard Frank,et al.  Exploring the structural characteristics of social networks in a large criminal court database , 2013, 2013 IEEE International Conference on Intelligence and Security Informatics.

[16]  A. Choudhary,et al.  A fast high utility itemsets mining algorithm , 2005, UBDM '05.

[17]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[18]  Wouter Duivesteijn,et al.  Discovering Local Subgroups, with an Application to Fraud Detection , 2013, PAKDD.

[19]  Alípio Mário Jorge,et al.  Distribution Rules with Numeric Attributes of Interest , 2006, PKDD.

[20]  Wojtek Kowalczyk,et al.  Hunting for Fraudsters in Random Forests , 2012, HAIS.

[21]  Arno J. Knobbe,et al.  Pattern Teams , 2006, PKDD.

[22]  Barbara F. I. Pieters,et al.  Subgroup Discovery in Ranked Data, with an Application to Gene Set Enrichment , 2010 .

[23]  Michèle Sebag,et al.  Machine Learning and Knowledge Discovery in Databases , 2015, Lecture Notes in Computer Science.

[24]  Sung-Bae Cho,et al.  Hybrid Artificial Intelligent Systems , 2015, Lecture Notes in Computer Science.