SHARE: system design and case studies for statistical health information release

OBJECTIVES We present SHARE, a new system for statistical health information release with differential privacy. We present two case studies that evaluate the software on real medical datasets and demonstrate the feasibility and utility of applying the differential privacy framework on biomedical data. MATERIALS AND METHODS SHARE releases statistical information in electronic health records with differential privacy, a strong privacy framework for statistical data release. It includes a number of state-of-the-art methods for releasing multidimensional histograms and longitudinal patterns. We performed a variety of experiments on two real datasets, the surveillance, epidemiology and end results (SEER) breast cancer dataset and the Emory electronic medical record (EeMR) dataset, to demonstrate the feasibility and utility of SHARE. RESULTS Experimental results indicate that SHARE can deal with heterogeneous data present in medical data, and that the released statistics are useful. The Kullback-Leibler divergence between the released multidimensional histograms and the original data distribution is below 0.5 and 0.01 for seven-dimensional and three-dimensional data cubes generated from the SEER dataset, respectively. The relative error for longitudinal pattern queries on the EeMR dataset varies between 0 and 0.3. While the results are promising, they also suggest that challenges remain in applying statistical data release using the differential privacy framework for higher dimensional data. CONCLUSIONS SHARE is one of the first systems to provide a mechanism for custodians to release differentially private aggregate statistics for a variety of use cases in the medical domain. This proof-of-concept system is intended to be applied to large-scale medical data warehouses.

[1]  Peter J. Haas,et al.  Improved histograms for selectivity estimation of range predicates , 1996, SIGMOD '96.

[2]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[3]  Lucila Ohno-Machado,et al.  Using Boolean reasoning to anonymize databases , 1999, Artif. Intell. Medicine.

[4]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[5]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[6]  Lucila Ohno-Machado,et al.  Protecting patient privacy by quantifiable control of disclosures in disseminated databases , 2004, Int. J. Medical Informatics.

[7]  Bradley Malin,et al.  Technical Evaluation: An Evaluation of the Current State of Genomic Data Privacy Protection Technology and a Roadmap for the Future , 2004, J. Am. Medical Informatics Assoc..

[8]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[9]  K. El Emam,et al.  Evaluating Common De-Identification Heuristics for Personal Health Information , 2006, Journal of medical Internet research.

[10]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[11]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[12]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[13]  Li Xiong,et al.  HIDE: An Integrated System for Health Information DE-identification , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[14]  James J. Lu,et al.  FRIL: A Tool for Comparative Record Linkage , 2008, AMIA.

[15]  Herbert S. Lin,et al.  Computational Technology for Effective Health Care: Immediate Steps and Strategic Directions , 2009 .

[16]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[17]  Moni Naor,et al.  On the complexity of differentially private data release: efficient algorithms and hardness results , 2009, STOC '09.

[18]  Benjamin C. M. Fung,et al.  Anonymizing healthcare data: a case study on the blood transfusion service , 2009, KDD.

[19]  Li Xiong,et al.  An integrated framework for de-identifying unstructured medical data , 2009, Data Knowl. Eng..

[20]  James J. Lu,et al.  HIDE: heterogeneous information DE-identification , 2009, EDBT '09.

[21]  Cristina Nita-Rotaru,et al.  A survey of attack and defense techniques for reputation systems , 2009, CSUR.

[22]  Laura A. Levit,et al.  Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research. Washington, DC: National Academies Press , 2009 .

[23]  Khaled El Emam,et al.  Model Formulation: Evaluating Predictors of Geographic Area Population Size Cut-offs to Manage Re-identification Risk , 2009, J. Am. Medical Informatics Assoc..

[24]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[25]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[26]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[27]  Chun Yuan,et al.  Differentially Private Data Release through Multidimensional Partitioning , 2010, Secure Data Management.

[28]  Bradley Malin,et al.  Evaluating re-identification risks with respect to the HIPAA privacy rule , 2010, J. Am. Medical Informatics Assoc..

[29]  Joel H. Saltz,et al.  An evaluation of feature sets and sampling techniques for de-identification of medical records , 2010, IHI.

[30]  C. Dwork A firm foundation for private data analysis , 2011, Commun. ACM.

[31]  Bradley Malin,et al.  Never too old for anonymity: a statistical standard for demographic data sharing via the HIPAA Privacy Rule , 2011, J. Am. Medical Informatics Assoc..

[32]  Khaled El Emam,et al.  De-identifying a public use microdata file from the Canadian national discharge abstract database , 2011, BMC Medical Informatics Decis. Mak..

[33]  Benjamin C. M. Fung,et al.  Publishing set-valued data via differential privacy , 2011, Proc. VLDB Endow..

[34]  B. Malin,et al.  Correction: A Systematic Review of Re-Identification Attacks on Health Data , 2015, PloS one.

[35]  Anand D. Sarwate,et al.  Protecting count queries in study design , 2012, J. Am. Medical Informatics Assoc..

[36]  Jihoon Kim,et al.  Grid Binary LOgistic REgression (GLORE): building shared models without sharing data , 2012, J. Am. Medical Informatics Assoc..

[37]  Jihoon Kim,et al.  iDASH: integrating data for analysis, anonymization, and sharing , 2012, J. Am. Medical Informatics Assoc..

[38]  Divesh Srivastava,et al.  Differentially Private Spatial Decompositions , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[39]  Benjamin C. M. Fung,et al.  Differentially private transit data publication: a case study on the montreal transportation system , 2012, KDD.

[40]  Luk Arbuckle,et al.  El Emam Et Al.: the De‐identification of the Heritage Health Prize Claims Data Set Multimedia Appendix Multimedia Appendix 1 Truncation of Claims 2 Removal of High Risk Patients , 2022 .

[41]  Li Xiong,et al.  DPCube: Releasing Differentially Private Data Cubes for Health Information , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[42]  Khaled El Emam,et al.  The application of differential privacy to health data , 2012, EDBT-ICDT '12.

[43]  Haoran Li,et al.  DPCube: Differentially Private Histogram Release through Multidimensional Partitioning , 2012, Trans. Data Priv..