Discovering, selecting and exploiting feature sequence records of study participants for the classification of epidemiological data on hepatic steatosis

In longitudinal epidemiological studies, participants undergo repeated medical examinations and are thus represented by a potentially large number of short examination outcome sequences. Some of those sequences may contain important information in various forms, such as patterns, with respect to the disease under study, while others may be on features of little relevance to the outcome. In this work, we propose a framework for Discovery, Selection and Exploitation (DiSelEx) of longitudinal epidemiological data, aiming to identify informative patterns among these sequences. DiSelEx combines sequence clustering with supervised learning to identify sequence groups that contribute to class separation. Newly derived and old features are evaluated and selected according to their redundancy and informativeness regarding the target variable. The selected feature set is then used to learn a classification model on the study data. We evaluate DiSelEx on cohort participants for the disorder "hepatic steatosis" and report on the impact on predictive performance when using sequential data in comparison to utilizing only the basic classifier.1

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[3]  Myra Spiliopoulou,et al.  Identifying Relevant Features for a Multi-factorial Disorder with Constraint-Based Subspace Clustering , 2016, 2016 IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS).

[4]  Myra Spiliopoulou,et al.  Using Participant Similarity for the Classification of Epidemiological Data on Hepatic Steatosis , 2014, 2014 IEEE 27th International Symposium on Computer-Based Medical Systems.

[5]  Gilles Louppe,et al.  Understanding variable importances in forests of randomized trees , 2013, NIPS.

[6]  Mykola Pechenizkiy,et al.  Heart failure hospitalization prediction in remote patient management systems , 2010, 2010 IEEE 23rd International Symposium on Computer-Based Medical Systems (CBMS).

[7]  Panagiotis Papapetrou,et al.  Generalized random shapelet forests , 2016, Data Mining and Knowledge Discovery.

[8]  Myra Spiliopoulou,et al.  Can We Classify the Participants of a Longitudinal Epidemiological Study from Their Previous Evolution? , 2015, 2015 IEEE 28th International Symposium on Computer-Based Medical Systems.

[9]  Panagiotis Papapetrou,et al.  Learning from heterogeneous temporal data in electronic health records , 2017, J. Biomed. Informatics.

[10]  Jason Lines,et al.  Classification of time series by shapelet transformation , 2013, Data Mining and Knowledge Discovery.

[11]  W. Rathmann,et al.  Cohort profile: the study of health in Pomerania. , 2011, International journal of epidemiology.

[12]  Elpida T. Keravnou,et al.  Temporal abstraction and temporal Bayesian networks in clinical domains: A survey , 2014, Artif. Intell. Medicine.

[13]  Li Wei,et al.  Experiencing SAX: a novel symbolic representation of time series , 2007, Data Mining and Knowledge Discovery.

[14]  Tony R. Martinez,et al.  Improved Heterogeneous Distance Functions , 1996, J. Artif. Intell. Res..

[15]  Myra Spiliopoulou,et al.  Learning and inspecting classification rules from longitudinal epidemiological data to identify predictive features on hepatic steatosis , 2014, Expert Syst. Appl..

[16]  Jing Zhao,et al.  Detecting adverse drug events with multiple representations of clinical measurements , 2014, 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[17]  Myra Spiliopoulou,et al.  Mining Longitudinal Epidemiological Data to Understand a Reversible Disorder , 2014, IDA.

[18]  Jimeng Sun,et al.  A System for Mining Temporal Physiological Data Streams for Advanced Prognostic Decision Support , 2010, 2010 IEEE International Conference on Data Mining.

[19]  Girish N. Nadkarni,et al.  Incorporating temporal EHR data in predictive models for risk stratification of renal function deterioration , 2014, J. Biomed. Informatics.

[20]  Eamonn J. Keogh,et al.  A symbolic representation of time series, with implications for streaming algorithms , 2003, DMKD '03.

[21]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.