A Data and Knowledge Driven Randomization Technique for Privacy-Preserving Data Enrichment in Hospital Readmission Prediction

In health care predictive analytics, limited data is often a major obstacle for developing highly accurate predictive models. The lack of data is related to various factors: minimal data available as in rare diseases, the cost of data collection, and privacy regulation related to patient data. In order to enable data enrichment within and between hospitals, while preserving privacy, we propose a system for data enrichment that adds a randomization component on top of existing anonymization techniques. In order to prevent information loss (inclusive loss of predictive accuracy of the algorithm) related to randomization, we propose a technique for data generation that exploits fused domain knowledge and available data-driven techniques as complementary information sources. Such fusion allows the generation of additional examples by controlled randomization and increased protection of privacy of personally sensitive information when data is shared between different sites. The initial evaluation was conducted on Electronic Health Records (EHRs), for a 30-day hospital readmission prediction based on pediatric hospital discharge data from 5 hospitals in California. It was demonstrated that besides ensuring privacy, this approach preserves (and in some cases even improves) predictive accuracy.

[1]  Zoran Obradovic,et al.  Domain knowledge Based Hierarchical Feature Selection for 30-Day Hospital Readmission Prediction , 2015, AIME.

[2]  Joshua R. Vest,et al.  Health information exchange: persistent challenges and new strategies , 2010, J. Am. Medical Informatics Assoc..

[3]  Matjaz Gams,et al.  Combining domain knowledge and machine learning for robust fall detection , 2014, Expert Syst. J. Knowl. Eng..

[4]  Zoran Obradovic,et al.  A data-driven acute inflammation therapy , 2013, BMC Medical Genomics.

[5]  Der-Chiang Li,et al.  Using mega-trend-diffusion and artificial samples in small data set learning for early flexible manufacturing system scheduling knowledge , 2007, Comput. Oper. Res..

[6]  Chen Gong-he,et al.  Method for Constructing Training Data Set in Intrusion Detection System , 2006 .

[7]  Mohamed F. Ghalwash,et al.  A Data-Driven Model for Optimizing Therapy Duration for Septic Patients , 2014 .

[8]  Fei Wang,et al.  Pediatric readmission classification using stacked regularized logistic regression models. , 2014, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[9]  Bernhard Schölkopf,et al.  Improving the Accuracy and Speed of Support Vector Machines , 1996, NIPS.

[10]  Fei Wang,et al.  FeaFiner: biomarker identification from medical data through feature generalization and selection , 2013, KDD.

[11]  Jian Tang,et al.  Modeling high dimensional frequency spectral data based on virtual sample generation technique , 2015, 2015 IEEE International Conference on Information and Automation.

[12]  Murat Kantarcioglu,et al.  An Efficient Approximate Protocol for Privacy-Preserving Association Rule Mining , 2009, PAKDD.

[13]  Zoran Obradovic,et al.  Distributed Privacy-Preserving Decision Support System for Highly Imbalanced Clinical Data , 2013, TMIS.

[14]  Long-Sheng Chen,et al.  Using Functional Virtual Population as assistance to learn scheduling knowledge in dynamic manufacturing environments , 2003 .

[15]  Der-Chiang Li,et al.  A genetic algorithm-based virtual sample generation technique to improve small data set learning , 2014, Neurocomputing.

[16]  Tomaso Poggio,et al.  Incorporating prior information in machine learning by creating virtual examples , 1998, Proc. IEEE.

[17]  Zoran Obradovic,et al.  Predicting Sepsis Severity from Limited Temporal Observations , 2014, Discovery Science.

[18]  Girish N. Nadkarni,et al.  Leveraging hierarchy in medical codes for predictive modeling , 2014, BCB.

[19]  Der-Chiang Li,et al.  Using virtual sample generation to build up management knowledge in the early manufacturing stages , 2006, Eur. J. Oper. Res..

[20]  Der-Chiang Li,et al.  Utilization of virtual samples to facilitate cancer identification for DNA microarray data in the early stages of an investigation , 2009, Inf. Sci..

[21]  Rajendu Srivastava,et al.  Pediatric readmissions as a hospital quality measure. , 2013, JAMA.

[22]  Zoran Obradovic,et al.  Improving Hospital Readmission Prediction Using Domain Knowledge Based Virtual Examples , 2015, KMO.

[23]  Shahram Yazdani,et al.  Emergence of pediatric rare diseases , 2013, Rare diseases.

[24]  Sungzoon Cho,et al.  Virtual sample generation using a population of networks , 2004, Neural Processing Letters.

[25]  Jianpei Zhang,et al.  A novel virtual sample generation method based on Gaussian distribution , 2011, Knowl. Based Syst..

[26]  Claudio Moraga,et al.  A diffusion-neural-network for learning from small samples , 2004, Int. J. Approx. Reason..