Dynamic distributed predictive learning models that preserve privacy for hospitals with insufficient labeled data

A prediction model built dynamically using patient data from multiple hospitals can serve as a tool for suggestive knowledge in clinical decision support. Such a tool that accommodates queries based on attributes of interest is helpful in building a targeted model from multiple hospitals when a local clinical data repository does not have sufficient number of records to draw conclusions from. However, because of privacy concerns and legal ramifications, hospitals are reluctant to divulge raw medical records. Hence, mechanisms to build distributed prediction models using just the statistics of patient data are attractive. Distributed ID3-based decision tree (DIDT) algorithm is such a prediction model builder. In this study, we analyze National Inpatient Sample data for 3 years and demonstrate that DIDT can be used to help collaboratively build better predictive models when hospitals have insufficient number of records for good local models. Using 261 attributes for model building, we showed that collaborating hospitals with less than 100 cases of hospitalizations for a targeted disease were able to achieve good improvement in accuracies for predicting hospitalization collectively using a distributed model compared to local models. When relying on local models for predicting risks for sample diseases, more patients were misclassified and some local patients could not be classified. Our collaborative model effectively reduced misclassification providing accurate early diagnostics to additional patients. The profile of hospitals with sufficiently large number of patient records was explored to identify local models with specific characteristics that can serve the needs of hospitals with insufficient data.

[1]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[2]  Vasant Honavar,et al.  A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning Decision Trees , 2004, Int. J. Hybrid Intell. Syst..

[3]  Mohammad Khalilia,et al.  Improving disease prediction using ICD-9 ontological features , 2011, 2011 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2011).

[4]  Amanda H. Salanitro,et al.  Risk prediction models for hospital readmission: a systematic review. , 2011, JAMA.

[5]  Dean F. Sittig,et al.  A survey of factors affecting clinician acceptance of clinical decision support , 2006, BMC Medical Informatics Decis. Mak..

[6]  Zoran Obradovic,et al.  Poster: Auto-reduction of features for containing communication costs in a distributed privacy-preserving clinical decision support system , 2013, 2013 IEEE 3rd International Conference on Computational Advances in Bio and medical Sciences (ICCABS).

[7]  William van Melle,et al.  MYCIN: a knowledge-based consultation program for infectious disease diagnosis , 1978 .

[8]  Ahmed M. Elmisery,et al.  Privacy Preserving Distributed Learning Clustering of HealthCare Data Using Cryptography Protocols , 2010, COMPSAC Workshops.

[9]  Qing He,et al.  Distributed data mining in grid computing environments , 2007, Future Gener. Comput. Syst..

[10]  Taghi M. Khoshgoftaar,et al.  Identifying noise in an attribute of interest , 2005, Fourth International Conference on Machine Learning and Applications (ICMLA'05).

[11]  Veerasathpurush Allareddy,et al.  Factors associated with length of stay and hospital charges for patients hospitalized with mouth cellulitis. , 2012, Oral surgery, oral medicine, oral pathology and oral radiology.

[12]  Jihoon Kim,et al.  Grid Binary LOgistic REgression (GLORE): building shared models without sharing data , 2012, J. Am. Medical Informatics Assoc..

[13]  Jiexun Li,et al.  Semantic-enhanced models to support timely admission prediction at emergency departments , 2012, Network Modeling Analysis in Health Informatics and Bioinformatics.

[14]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[15]  Taghi M. Khoshgoftaar,et al.  Threshold-based feature selection techniques for high-dimensional bioinformatics data , 2012, Network Modeling Analysis in Health Informatics and Bioinformatics.

[16]  Ruoming Jin,et al.  Communication and Memory Efficient Parallel Decision Tree Construction , 2003, SDM.

[17]  Jaideep Vaidya,et al.  Privacy-Preserving SVM Classification on Vertically Partitioned Data , 2006, PAKDD.

[18]  Edward H. Shortliffe,et al.  Rule Based Expert Systems: The Mycin Experiments of the Stanford Heuristic Programming Project (The Addison-Wesley series in artificial intelligence) , 1984 .

[19]  Lawrence B. Holder,et al.  Mining Graph Data , 2006 .

[20]  Julie Ann Sosa,et al.  Age matters: a study of clinical and economic outcomes following cholecystectomy in elderly Americans. , 2011, American journal of surgery.

[21]  Xun Yi,et al.  Classification of Privacy-preserving Distributed Data Mining protocols , 2011, 2011 Sixth International Conference on Digital Information Management.

[22]  Zoran Obradovic,et al.  Disease Prediction Based on Prior Knowledge , 2012 .

[23]  Nong Ye,et al.  The Handbook of Data Mining , 2003 .

[24]  K. Davis,et al.  Cost of acute hospitalization and post-discharge follow-up care for meningococcal disease in the US , 2011, Human vaccines.

[25]  Bruce G. Buchanan,et al.  The MYCIN Experiments of the Stanford Heuristic Programming Project , 1985 .

[26]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[27]  Zoran Obradovic,et al.  A privacy-preserving framework for distributed clinical decision support , 2011, 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[28]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[29]  Yen-Jen Oyang,et al.  Application of density estimation algorithms in analyzing co-morbidities of migraine , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[30]  Muin J. Khoury,et al.  Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes , 2010, BMC Medical Informatics Decis. Mak..

[31]  Daniel G. Bobrow,et al.  Expert systems: perils and promise , 1986, CACM.

[32]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[33]  Ran Wolff,et al.  Hierarchical decision tree induction in distributed genomic databases , 2005, IEEE Transactions on Knowledge and Data Engineering.

[34]  Jonathan M. Teich,et al.  Grand challenges in clinical decision support , 2008, J. Biomed. Informatics.

[35]  Tapio Elomaa,et al.  General and Efficient Multisplitting of Numerical Attributes , 1999, Machine Learning.

[36]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[37]  Charu C. Aggarwal,et al.  Managing and Mining Graph Data , 2010, Managing and Mining Graph Data.

[38]  Zoran Obradovic,et al.  Distributed Privacy Preserving Decision Support System for Predicting Hospitalization Risk in Hospitals with Insufficient Data , 2012, 2012 11th International Conference on Machine Learning and Applications.

[39]  Bernard M. E. Moret,et al.  Decision Trees and Diagrams , 1982, CSUR.

[40]  Scott T. Weiss,et al.  Prediction of chronic obstructive pulmonary disease (COPD) in asthma patients using electronic medical records. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[41]  Hillol Kargupta,et al.  Distributed Data Mining: Algorithms, Systems, and Applications , 2003 .

[42]  Stefan Rüping,et al.  Towards an environment for data mining based analysis processes in bioinformatics and personalized medicine , 2013, Network Modeling Analysis in Health Informatics and Bioinformatics.