Combining Subgroup Discovery and Clustering to Identify Diverse Subpopulations in Cohort Study Data

Subgroup discovery (SD) exploits its full value in applications where the goal is to generate understandable models. Epidemiologists search for statistically significant relationships between risk factors and outcome in large and heterogeneous datasets encompassing information about the participants health status gathered from questionnaires, medical examinations and image acquisition. SD algorithms can help epidemiologists by automatically detecting such relationships presented as comprehensible rules, aiming to ultimately improve prevention, diagnosis and treatment of diseases. However, SD algorithms often produce large and overlapping rule sets requiring the expert to conduct a manual post-filtering step that is time-consuming and tedious. In this work, we propose a clustering-based algorithm that hierarchically reorganizes rule sets and summarizes all important concepts while maintaining diversity between the rule clusters. For each cluster, a representative rule is selected and then displayed to the expert who in turn can drill-down to other cluster members. We evaluate our algorithm on two cohort study datasets where the diseases hepatic steatosis and goiter serve as target variable, respectively. We report on our findings with respect to effectiveness of our algorithm and we present selected subpopulations.

[1]  W. Rathmann,et al.  Cohort profile: the study of health in Pomerania. , 2011, International journal of epidemiology.

[2]  Florian Lemmerich,et al.  VIKAMINE - Open-Source Subgroup Discovery, Pattern Mining, and Analytics , 2012, ECML/PKDD.

[3]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[4]  Frank Puppe,et al.  SD-Map - A Fast Algorithm for Exhaustive Subgroup Discovery , 2006, PKDD.

[5]  Myra Spiliopoulou,et al.  Learning and inspecting classification rules from longitudinal epidemiological data to identify predictive features on hepatic steatosis , 2014, Expert Syst. Appl..

[6]  P. Pfannenstiel,et al.  [Ultrasonic diagnosis of the thyroid gland]. , 2008, Deutsche medizinische Wochenschrift.

[7]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[8]  María José del Jesús,et al.  An overview on subgroup discovery: foundations and applications , 2011, Knowledge and Information Systems.

[9]  Arno J. Knobbe,et al.  Diverse subgroup set discovery , 2012, Data Mining and Knowledge Discovery.

[10]  Nada Lavrac,et al.  Expert-Guided Subgroup Discovery: Methodology and Application , 2011, J. Artif. Intell. Res..

[11]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[12]  Corrado Priami,et al.  Novel drug target identification for the treatment of dementia using multi-relational association mining , 2015, Scientific Reports.

[13]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[14]  Peter Clark,et al.  The CN2 induction algorithm , 2004, Machine Learning.

[15]  María José del Jesús,et al.  Evolutionary fuzzy rule extraction for subgroup discovery in a psychiatric emergency department , 2011, Soft Comput..

[16]  Klaus Truemper,et al.  Data-driven subclassification of speech sound disorders in preschool children. , 2014, Journal of speech, language, and hearing research : JSLHR.

[17]  William W. Cohen Fast Effective Rule Induction , 1995, ICML.

[18]  Peter A. Flach,et al.  Subgroup Discovery with CN2-SD , 2004, J. Mach. Learn. Res..

[19]  Willi Klösgen,et al.  Spatial Subgroup Mining Integrated in an Object-Relational Spatial Database , 2002, PKDD.