Disassociation for electronic health record privacy

The dissemination of Electronic Health Record (EHR) data, beyond the originating healthcare institutions, can enable large-scale, low-cost medical studies that have the potential to improve public health. Thus, funding bodies, such as the National Institutes of Health (NIH) in the U.S., encourage or require the dissemination of EHR data, and a growing number of innovative medical investigations are being performed using such data. However, simply disseminating EHR data, after removing identifying information, may risk privacy, as patients can still be linked with their record, based on diagnosis codes. This paper proposes the first approach that prevents this type of data linkage using disassociation, an operation that transforms records by splitting them into carefully selected subsets. Our approach preserves privacy with significantly lower data utility loss than existing methods and does not require data owners to specify diagnosis codes that may lead to identity disclosure, as these methods do. Consequently, it can be employed when data need to be shared broadly and be used in studies, beyond the intended ones. Through extensive experiments using EHR data, we demonstrate that our method can construct data that are highly useful for supporting various types of clinical case count studies and general medical analysis tasks.

[1]  Julie A. Pavlin,et al.  Code-based Syndromic Surveillance for Influenzalike Illness by International Classification of Diseases, Ninth Revision , 2007, Emerging infectious diseases.

[2]  Nikos Mamoulis,et al.  Privacy Preservation by Disassociation , 2012, Proc. VLDB Endow..

[3]  Stephen E. Fienberg,et al.  "Secure" Log-Linear and Logistic Regression Analysis of Distributed Databases , 2006, Privacy in Statistical Databases.

[4]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[5]  Ashwin Machanavajjhala,et al.  No free lunch in data privacy , 2011, SIGMOD '11.

[6]  Aris Gkoulalas-Divanis,et al.  Anonymizing Transaction Data to Eliminate Sensitive Inferences , 2010, DEXA.

[7]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[8]  A. Schuchat DEPARTMENT OF HEALTH & HUMAN SERVICES , 2015 .

[9]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[10]  Joshua C Denny,et al.  Anonymization of administrative billing codes with repeated diagnoses through censoring. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[11]  Khaled El Emam,et al.  The application of differential privacy to health data , 2012, EDBT-ICDT '12.

[12]  J. Dumortier Directive 98/48/EC of the European Parliament and of the Council , 2006 .

[13]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[14]  Benny Pinkas,et al.  Cryptographic techniques for privacy-preserving data mining , 2002, SKDD.

[15]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[16]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[17]  Aris Gkoulalas-Divanis,et al.  Efficient and flexible anonymization of transaction data , 2012, Knowledge and Information Systems.

[18]  Xiaoqian Jiang,et al.  EXpectation Propagation LOgistic REgRession (EXPLORER): Distributed privacy-preserving online model learning , 2013, J. Biomed. Informatics.

[19]  Joshua C. Denny,et al.  Chapter 13: Mining Electronic Health Records in the Genomics Era , 2012, PLoS Comput. Biol..

[20]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[21]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[22]  Xiaoqian Jiang,et al.  SHARE: system design and case studies for statistical health information release , 2013, J. Am. Medical Informatics Assoc..

[23]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[24]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[25]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[26]  Murat Kantarcioglu,et al.  Secure Management of Biomedical Data With Cryptographic Hardware , 2012, IEEE Transactions on Information Technology in Biomedicine.

[27]  Grigorios Loukides,et al.  Capturing data usefulness and privacy protection in K-anonymisation , 2007, SAC '07.

[28]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[29]  Randolph A. Miller,et al.  Reducing patient re-identification risk for laboratory results within research datasets , 2013, J. Am. Medical Informatics Assoc..

[30]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[31]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[32]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[33]  Ravi S. Sandhu,et al.  Role-Based Access Control Models , 1996, Computer.

[34]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[35]  Benjamin C. M. Fung,et al.  Privacy-preserving heterogeneous health data sharing , 2013, J. Am. Medical Informatics Assoc..

[36]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[37]  K. Frawley,et al.  NCVHS (National Committee on Vital and Health Statistics) focuses on HIPAA. , 1999, Journal of AHIMA.

[38]  Vaidy S. Sunderam,et al.  FAST: differentially private real-time aggregate monitor with filtering and adaptive sampling , 2013, SIGMOD '13.

[39]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[40]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[41]  Spiros Skiadopoulos,et al.  Anonymizing Data with Relational and Transaction Attributes , 2013, ECML/PKDD.

[42]  K. El Emam,et al.  Methods for the de-identification of electronic health records for genomic research , 2011, Genome Medicine.

[43]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[44]  Mark Elliot,et al.  Statistical disclosure control architectures for patient records in biomedical information systems , 2008, J. Biomed. Informatics.

[45]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[46]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[47]  Jamie Cattell,et al.  How big data can revolutionize pharmaceutical R & D April 2013 , 2013 .

[48]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD '00.

[49]  Graham Cormode,et al.  Personal privacy vs population privacy: learning to attack anonymization , 2011, KDD.

[50]  David Sánchez,et al.  A semantic framework to protect the privacy of electronic health records with non-numerical attributes , 2013, J. Biomed. Informatics.

[51]  Aris Gkoulalas-Divanis,et al.  Anonymization of Electronic Medical Records to Support Clinical Analysis , 2013, Springer Briefs in Electrical and Computer Engineering.

[52]  G. Loukides,et al.  Utility-Aware Anonymization of Diagnosis Codes , 2013, IEEE Journal of Biomedical and Health Informatics.

[53]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[54]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.