Anonymization of Longitudinal Electronic Medical Records

Electronic medical record (EMR) systems have enabled healthcare providers to collect detailed patient information from the primary care domain. At the same time, longitudinal data from EMRs are increasingly combined with biorepositories to generate personalized clinical decision support protocols. Emerging policies encourage investigators to disseminate such data in a deidentified form for reuse and collaboration, but organizations are hesitant to do so because they fear such actions will jeopardize patient privacy. In particular, there are concerns that residual demographic and clinical features could be exploited for reidentification purposes. Various approaches have been developed to anonymize clinical data, but they neglect temporal information and are, thus, insufficient for emerging biomedical research paradigms. This paper proposes a novel approach to share patient-specific longitudinal data that offers robust privacy guarantees, while preserving data utility for many biomedical investigations. Our approach aggregates temporal and diagnostic information using heuristics inspired from sequence alignment and clustering methods. We demonstrate that the proposed approach can generate anonymized data that permit effective biomedical analysis using several patient cohorts derived from the EMR system of the Vanderbilt University Medical Center.

[1]  Khaled El Emam,et al.  Protecting privacy using k-anonymity. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[2]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[3]  Susan Jensen Mining Medical Data for Predictive and Sequential patterns : PKDD 2001 , .

[4]  Francesco Bonchi,et al.  Never Walk Alone: Uncertainty for Anonymity in Moving Objects Databases , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  John Doucette,et al.  Adopting electronic medical records in primary care: Lessons learned from health information systems implementation experience in seven countries , 2009, Int. J. Medical Informatics.

[6]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[7]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[8]  David F. Garway-Heath,et al.  The Pseudotemporal Bootstrap for Predicting Glaucoma From Cross-Sectional Visual Field Data , 2010, IEEE Transactions on Information Technology in Biomedicine.

[9]  Joshua C Denny,et al.  Anonymization of administrative billing codes with repeated diagnoses through censoring. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[10]  Melissa A. Basford,et al.  Identification of Genomic Predictors of Atrioventricular Conduction: Using Electronic Medical Records as a Tool for Genome Science , 2010, Circulation.

[11]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[12]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[13]  B. Malin,et al.  Privacy-preserving publication of diagnosis codes for effective biomedical analysis , 2010, Proceedings of the 10th IEEE International Conference on Information Technology and Applications in Biomedicine.

[14]  Bradley Malin,et al.  Evaluating re-identification risks with respect to the HIPAA privacy rule , 2010, J. Am. Medical Informatics Assoc..

[15]  Sowmya R. Rao,et al.  Use of electronic health records in U.S. hospitals. , 2009, The New England journal of medicine.

[16]  Ashwin Machanavajjhala,et al.  Privacy-Preserving Data Publishing , 2009, Found. Trends Databases.

[17]  W Gall,et al.  Utilizing IHE-based Electronic Health Record systems for secondary use. , 2011, Methods of information in medicine.

[18]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[19]  Muin J. Khoury,et al.  Quantifying realistic sample size requirements for human genome epidemiology , 2008 .

[20]  Grigorios Loukides,et al.  Capturing data usefulness and privacy protection in K-anonymisation , 2007, SAC '07.

[21]  B. Dean,et al.  Review: Use of Electronic Medical Records for Health Outcomes Research , 2009, Medical care research and review : MCRR.

[22]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[23]  Ton de Waal,et al.  Statistical Disclosure Control in Practice , 1996 .

[24]  Sarah M. Diesburg,et al.  A survey of confidential data storage and deletion methods , 2010, CSUR.

[25]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[26]  D. Blumenthal Stimulating the adoption of health information technology. , 2009, The West Virginia medical journal.

[27]  Michael N Liebman,et al.  Personalized medicine: a perspective on the patient, disease and causal diagnostics. , 2007, Personalized medicine.

[28]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[29]  Ravi S. Sandhu,et al.  Role-Based Access Control Models , 1996, Computer.

[30]  Yücel Saygin,et al.  Towards trajectory anonymization: a generalization-based approach , 2008, SPRINGL '08.

[31]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[32]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[33]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[34]  Abel N. Kho,et al.  A Highly Specific Algorithm for Identifying Asthma Cases and Controls for Genome-Wide Association Studies , 2009, AMIA.

[35]  Charles Safran,et al.  Toward a national framework for the secondary use of health data: an American Medical Informatics Association White Paper. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[36]  P. Donnelly,et al.  Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip , 2009, PLoS genetics.

[37]  Christopher G. Chute,et al.  A Genome-Wide Association Study of Red Blood Cell Traits Using the Electronic Medical Record , 2010, PloS one.

[38]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[39]  Ton de Waal,et al.  Introduction to Statistical Disclosure Control , 1996 .

[40]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[41]  J V Tu,et al.  Myocardial infarction and the validation of physician billing and hospitalization data using electronic medical records. , 2010, Chronic diseases in Canada.

[42]  Julie A. Pavlin,et al.  Code-based Syndromic Surveillance for Influenzalike Illness by International Classification of Diseases, Ninth Revision , 2007, Emerging infectious diseases.

[43]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[44]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[45]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.

[46]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[47]  Philip S. Yu,et al.  A Survey of Randomization Methods for Privacy-Preserving Data Mining , 2008, Privacy-Preserving Data Mining.

[48]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[49]  Melissa A. Basford,et al.  Robust replication of genotype-phenotype associations across multiple diseases in an electronic medical record. , 2010, American journal of human genetics.

[50]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[51]  Benny Pinkas,et al.  Cryptographic techniques for privacy-preserving data mining , 2002, SKDD.

[52]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[53]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[54]  Philip S. Yu,et al.  A Condensation Approach to Privacy Preserving Data Mining , 2004, EDBT.

[55]  Nikos Mamoulis,et al.  Privacy Preservation in the Publication of Trajectories , 2008, The Ninth International Conference on Mobile Data Management (mdm 2008).

[56]  Joshua C Denny,et al.  Modulators of normal electrocardiographic intervals identified in a large electronic medical record. , 2011, Heart rhythm.

[57]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[58]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[59]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.