Identifying high-cost patients using data mining techniques and a small set of non-trivial attributes

In this paper, we use data mining techniques, namely neural networks and decision trees, to build predictive models to identify very high-cost patients in the top 5 percentile among the general population. A large empirical dataset from the Medical Expenditure Panel Survey with 98,175 records was used in our study. After pre-processing, partitioning and balancing the data, the refined dataset of 31,704 records was modeled by Decision Trees (including C5.0 and CHAID), and Neural Networks. The performances of the models are analyzed using various measures including accuracy, G-mean, and Area under ROC curve. We concluded that the CHAID classifier returns the best G-mean and AUC measures for top performing predictive models ranging from 76% to 85%, and 0.812 to 0.942 units, respectively. We also identify a small set of 5 non-trivial attributes among a primary set of 66 attributes to identify the top 5% of the high cost population. The attributes are the individual׳s overall health perception, age, history of blood cholesterol check, history of physical/sensory/mental limitations, and history of colonic prevention measures. The small set of attributes are what we call non-trivial and does not include visits to care providers, doctors or hospitals, which are highly correlated with expenditures and does not offer new insight to the data. The results of this study can be used by healthcare data analysts, policy makers, insurer, and healthcare planners to improve the delivery of health services.

[1]  Joel W. Cohen,et al.  The Medical Expenditure Panel Survey: A National Information Resource to Support Healthcare Cost Research and Inform Policy and Practice , 2009, Medical care.

[2]  Ryszard Tadeusiewicz,et al.  Artificial neural network modelling of the results of tympanoplasty in chronic suppurative otitis media patients , 2013, Comput. Biol. Medicine.

[3]  Zbigniew Omiotek,et al.  The use of decision tree induction and artificial neural networks for automatic diagnosis of Hashimoto's disease , 2013, Expert Syst. Appl..

[4]  Andrew W. Moore,et al.  Algorithms for rapid outbreak detection: a research synthesis , 2005, J. Biomed. Informatics.

[5]  Luca Maria Gambardella,et al.  A Bayesian network model for predicting pregnancy after in vitro fertilization , 2013, Comput. Biol. Medicine.

[6]  Saad Rais,et al.  Predicting patients with high risk of becoming high-cost healthcare users in Ontario (Canada). , 2014, Healthcare policy = Politiques de sante.

[7]  Sai T. Moturu,et al.  Predictive risk modelling for forecasting high-cost patients: a real-world application using Medicaid data , 2010 .

[8]  S. Wild,et al.  Inpatient costs for people with type 1 and type 2 diabetes in Scotland: a study from the Scottish Diabetes Research Network Epidemiology Group , 2011, Diabetologia.

[9]  J. Fleishman,et al.  Using information on clinical conditions to predict high-cost patients. , 2010, Health services research.

[10]  Xin Jin,et al.  Intelligent Analysis of Acute Bed Overflow in a Tertiary Hospital in Singapore , 2010, Journal of Medical Systems.

[11]  Balaji Rajagopalan,et al.  Data Mining to Support Simulation Modeling of Patient Flow in Hospitals , 2002, Journal of Medical Systems.

[12]  Rajarshi Guha,et al.  Using a neural network for mining interpretable relationships of West Nile risk factors. , 2011, Social science & medicine.

[13]  Filip De Turck,et al.  A self-learning nurse call system , 2014, Comput. Biol. Medicine.

[14]  Jing Luan,et al.  Data mining: Going beyond traditional statistics , 2006 .

[15]  Shamima Khan Predicting Future High-Cost Asthma Patients (Recipients) , 2006 .

[16]  Syed Sibte Raza Abidi Applying Data Mining in Healthcare: An Info- Structure for Delivering 'Data-Driven' Strategic Services , 1999, MIE.

[17]  R. Devol,et al.  An Unhealthy America:The Economic Burden of Chronic Disease , 2007 .

[18]  Craig E. Kuziemsky,et al.  Towards an implementation framework for business intelligence in healthcare , 2014, Int. J. Inf. Manag..

[19]  J. Farley,et al.  A comparison of comorbidity measurements to predict healthcare expenditures. , 2006, The American journal of managed care.

[20]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[21]  Vincent S. Tseng,et al.  A novel data mining mechanism considering bio-signal and environmental data with applications on asthma monitoring , 2011, Comput. Methods Programs Biomed..

[22]  Kwok-Leung Tsui,et al.  A Review of Healthcare, Public Health, and Syndromic Surveillance , 2008 .

[23]  Dario Gregori,et al.  Extreme regression models for characterizing high-cost patients. , 2009, Journal of evaluation in clinical practice.

[24]  Mattias Ohlsson,et al.  Detecting acute myocardial infarction in the 12-lead ECG using Hermite expansions and neural networks , 2004, Artif. Intell. Medicine.

[25]  Stan Matwin,et al.  Learning When Negative Examples Abound , 1997, ECML.