Identifying Relevant Features for a Multi-factorial Disorder with Constraint-Based Subspace Clustering

The identification of predictive features associated with distinct medical outcomes is a key requirement for meaningful clinical decision support. Usually, their discovery is based on sets of labeled examples and an analysis of the inherent information of the features w. r. t. the target variable. However, obtaining large sets of labeled examples may be not feasible and the sole label consideration could even dilute characteristics unique to distinct subgroups. In such cases, instead of considering the value of the target variable, expert knowledge on the similarity between examples could be utilized. In this work we propose a new algorithm for the "Discovery of Relevant Example-constrained Subspaces" (DRESS) which uses constraints on the similarity between examples to discover feature sets that describe a target concept. DRESS exploits the density of clusters and the distance-behavior between constrained examples to evaluate the quality of a feature set without requiring explicit information about the target variable. We evaluate DRESS against classical feature selection methods on cohort participants for the disorder "hepatic steatosis", and report on our findings on classifier performance and identified important features.

[1]  Myra Spiliopoulou,et al.  Learning and inspecting classification rules from longitudinal epidemiological data to identify predictive features on hepatic steatosis , 2014, Expert Syst. Appl..

[2]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[3]  Myra Spiliopoulou,et al.  C-DBSCAN: Density-Based Clustering with Constraints , 2009, RSFDGrC.

[4]  W. Rathmann,et al.  Cohort profile: the study of health in Pomerania. , 2011, International journal of epidemiology.

[5]  Myra Spiliopoulou,et al.  Using Participant Similarity for the Classification of Epidemiological Data on Hepatic Steatosis , 2014, 2014 IEEE 27th International Symposium on Computer-Based Medical Systems.

[6]  Thomas Kohlmann,et al.  Menopausal status and hepatic steatosis in a general female population , 2007, Gut.

[7]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[8]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[9]  Anna L. Buczak,et al.  A data-driven epidemiological prediction method for dengue outbreaks using local and remote sensing data , 2012, BMC Medical Informatics and Decision Making.

[10]  Tae Keun Yoo,et al.  Diabetic retinopathy risk prediction for fundus examination using sparse learning: a cross-sectional study , 2013, BMC Medical Informatics and Decision Making.

[11]  Myra Spiliopoulou,et al.  Can We Classify the Participants of a Longitudinal Epidemiological Study from Their Previous Evolution? , 2015, 2015 IEEE 28th International Symposium on Computer-Based Medical Systems.

[12]  Hans-Peter Kriegel,et al.  Ranking Interesting Subspaces for Clustering High Dimensional Data , 2003, PKDD.