A comparison of models to predict medical procedure costs from open public healthcare data

In our earlier work, we presented BOAT, a big-data open source analytics toolkit and framework, and applied it to analyze trends and outliers in public healthcare data. In this paper, we extend this framework to predict the costs of different medical procedures that patients could incur, based on open healthcare data. Specifically, we analyze de-identified patient data from New York State SPARCS (statewide planning and research cooperative system), consisting of more than 2 million records. We investigated the three model classes consisting of multiple linear regression, regression trees and deep neural networks (DNNs). We conducted a grid-search to identify the best parameter choices. We determined that the best performance based on grid-search cross validation with the widely used R2 metric was achieved by an 8 layered DNN with size 5x5x10x25x25x10x5x5 using an Adam optimizer with learning rate of 0.01. We obtained an R2 value of 0.71 which is better than the values reported in the literature for similar problems.

[1]  A. Ravishankar Rao,et al.  A fully integrated open-source toolkit for mining healthcare big-data: architecture and applications , 2016, 2016 IEEE International Conference on Healthcare Informatics (ICHI).

[2]  Rosser Johnson The Appeal of the Infomercial , 2017 .

[3]  S. de Lusignan,et al.  A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. , 2005, Informatics in primary care.

[4]  Martine De Cock,et al.  Population Cost Prediction on Public Healthcare Datasets , 2015, Digital Health.

[5]  UBINA,et al.  Predicting Days in Hospital using Health Insurance Claims , 2016 .

[6]  Shu-Ching Chen,et al.  Computational Health Informatics in the Big Data Age , 2016, ACM Comput. Surv..

[7]  Rajarshi Das,et al.  A framework for analyzing publicly available healthcare data , 2015, 2015 17th International Conference on E-health Networking, Application & Services (HealthCom).

[8]  Paul Voosen,et al.  The AI detectives. , 2017, Science.

[9]  Qihui Wu,et al.  A survey of machine learning for big data processing , 2016, EURASIP Journal on Advances in Signal Processing.

[10]  Fan Jiang,et al.  Algorithmic prediction of individual diseases , 2017, Int. J. Prod. Res..

[11]  Dimitrios Zikos,et al.  A Platform based on Multiple Regression to Estimate the Effect of in-Hospital Events on Total Charges , 2016, 2016 IEEE International Conference on Healthcare Informatics (ICHI).

[12]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[13]  Daniel Clarke,et al.  A system for exploring big data: an iterative k-means searchlight for outlier detection on open health data , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[14]  Michael Stonebraker,et al.  Detecting Data Errors: Where are we and what needs to be done? , 2016, Proc. VLDB Endow..

[15]  Jordan J. Cohen,et al.  Health industry practices that create conflicts of interest: a policy proposal for academic medical centers. , 2006, JAMA.

[16]  A. Ravishankar Rao,et al.  Facilitating the Exploration of Open Health-Care Data Through BOAT: A Big Data Open Source Analytics Tool , 2018 .

[17]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[18]  M Christopher Roebuck,et al.  Predictive Modeling of Total Healthcare Costs Using Pharmacy Claims Data: A Comparison of Alternative Econometric Cost Modeling Techniques , 2005, Medical care.

[19]  Vijay Srinivas Agneeswaran Big Data Analytics Beyond Hadoop: Real-Time Applications with Storm, Spark, and More Hadoop Alternatives , 2014 .

[20]  Julie M Donohue,et al.  A decade of direct-to-consumer advertising of prescription drugs. , 2007, The New England journal of medicine.

[21]  R. Bonney,et al.  Next Steps for Citizen Science , 2014, Science.

[22]  A. Ravishankar Rao,et al.  Building an Open Health Data Analytics Platform: a Case Study Examining Relationships and Trends in Seniority and Performance in Healthcare Providers , 2018, Journal of Healthcare Informatics Research.

[23]  Santosh S. Vempala,et al.  Algorithmic Prediction of Health-Care Costs , 2008, Oper. Res..

[24]  Joachim Roski,et al.  Creating value in health care through big data: opportunities and policy implications. , 2014, Health affairs.

[25]  J van Vlymen,et al.  A system of metadata to control the process of query, aggregating, cleaning and analysing large datasets of primary care data. , 2005 .

[26]  Dario Gregori,et al.  Regression models for analyzing costs and their determinants in health care: an introductory review. , 2011, International journal for quality in health care : journal of the International Society for Quality in Health Care.

[27]  Michael J. A. Berry,et al.  Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management , 2004 .

[28]  S. Schneeweiss Learning from big health care data. , 2014, The New England journal of medicine.

[29]  A. Ravishankar Rao,et al.  An open-source framework for the interactive exploration of Big Data: Applications in understanding health care , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[30]  Yu-Fang Chung,et al.  Development of a Decision Support Engine to Assist Patients with Hospital Selection , 2014, Journal of Medical Systems.

[31]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[32]  E. Larson,et al.  Building trust in the power of "big data" research to serve the public good. , 2013, JAMA.

[33]  Daniel Clarke,et al.  Hiding in Plain Sight: Insights about Health-Care Trends Gained through Open Health Data , 2017, ArXiv.

[34]  B. Landon,et al.  Measuring low-value care in Medicare. , 2014, JAMA internal medicine.

[35]  Marcel Bilger,et al.  Measuring overfitting in nonlinear models: a new method and an application to health expenditures. , 2015, Health economics.

[36]  S. Parente,et al.  Prices For Common Medical Services Vary Substantially Among The Commercially Insured. , 2016, Health affairs.

[37]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[38]  Wu He,et al.  Internet of Things in Industries: A Survey , 2014, IEEE Transactions on Industrial Informatics.