Impact of Density of Lab Data in EHR for Prediction of Potentially Preventable Events

This paper presents an analysis of sparse and incomplete Electronic Health Record (EHR) data for the prediction of patients with the risk of Potentially Preventable Events (PPEs). PPEs are admissions, readmissions, complications and emergency department visits that could have been avoided if the patient had been given the appropriate interventions. Machine learning techniques have made the task of PPE detection less difficult. However, it is still a challenging task due to the sparse and incomplete nature of the EHR data. It is therefore important to investigate the factors that impact the prediction of PPE in EHR data. In this paper we define the term density for evaluating sparse and incomplete nature of the EHR data set. We analyze three important factors that impacts PPE prediction in sparse and incomplete EHR data. These factors are - 1) Effect of varying domain information in the patient records on PPE prediction, 2) Impact of a popular data mining pre-processing technique known as rank aggregation based feature selection on PPE prediction, and 3) Effect of ensemble classification on prediction of PPE. The results of the analysis indicate that the rank aggregation based feature selection technique and ensemble classification improves classification accuracy by approximately 3-4\% despite of the sparse and incomplete nature of the data. However, eliminating patient records with less domain information, in order to reduce incompleteness in the data, does not cause an enhancement in the classification accuracy. We conclude that feature selection and ensemble classification techniques are important factors that affect classification accuracy even in sparse and incomplete data sets. We conclude as well that randomly decreasing domain information by varying lab values does not assist in increasing accuracy for the prediction of PPE.

[1]  P.-C.-F. Daunou,et al.  Mémoire sur les élections au scrutin , 1803 .

[2]  Gregory Piatetsky-Shapiro,et al.  Advances in Knowledge Discovery and Data Mining , 2004, Lecture Notes in Computer Science.

[3]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[4]  Jaideep Srivastava,et al.  Early Prediction of Potentially Preventable Events in Ambulatory Care Sensitive Admissions from Clinical Data , 2012, 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology.

[5]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[6]  Thomas J. Eggebraaten,et al.  A health-care data model based on the HL7 Reference Information Model , 2007, IBM Syst. J..

[7]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..

[8]  Torben Bach Pedersen,et al.  Research issues in clinical data warehousing , 1998, Proceedings. Tenth International Conference on Scientific and Statistical Database Management (Cat. No.98TB100243).

[9]  Jaideep Srivastava,et al.  Improved feature selection for hematopoietic cell transplantation outcome prediction using rank aggregation , 2012, 2012 Federated Conference on Computer Science and Information Systems (FedCSIS).

[10]  A. Begoyan,et al.  AN OVERVIEW OF INTEROPERABILITY STANDARDS FOR ELECTRONIC HEALTH RECORDS , 2007 .

[11]  Jacob Anhøj,et al.  Generic Design of Web-Based Clinical Databases , 2003, Journal of medical Internet research.

[12]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[13]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[14]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[15]  Patricia Rose Gomes de Melo Viol Martins,et al.  MATHEMATICS WITHOUT NUMBERS: AN INTRODUCTION TO THE STUDY OF LOGIC , 2015 .