Toward smarter healthcare: Anonymizing medical data to support research studies

Healthcare is a major industry in the Smarter Planet initiative of IBM and a key area where analytics can have a substantial impact by improving disease prediction and treatment. To facilitate healthcare analytics, patient data usually need to be widely disseminated. This, however, may risk the disclosure of private and sensitive patient information. In this paper, we illustrate the importance of preserving medical data privacy and the inapplicability of several popular techniques to preserve the privacy of structured medical data. Subsequently, we review a privacy-preserving approach for the dissemination of patient records. This approach involves patient record de-identification, anonymization of diagnosis codes contained in the records, and a method for balancing data utility with privacy. This approach is practical in that it allows healthcare data providers to specify fine-grained privacy and utility requirements, and it is able to construct anonymized data with a desired balance between utility and privacy. The effectiveness of the approach is demonstrated through a case study using electronic medical records. We conclude this paper with a roadmap for future trends in medical data privacy.

[1]  Tamir Tassa,et al.  k-Anonymization Revisited , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[3]  Grigorios Loukides,et al.  Preventing range disclosure in k-anonymised data , 2011, Expert Syst. Appl..

[4]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[5]  Lucila Ohno-Machado,et al.  To Share or Not To Share: That Is Not the Question , 2012, Science Translational Medicine.

[6]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[7]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[8]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[9]  E. Clayton,et al.  Identifiability in biobanks: models, measures, and mitigation strategies , 2011, Human Genetics.

[10]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[11]  Aris Gkoulalas-Divanis,et al.  Efficient and flexible anonymization of transaction data , 2012, Knowledge and Information Systems.

[12]  L. Cosler,et al.  Conforming to HIPAA regulations and compilation of research data. , 2004, American journal of health-system pharmacy : AJHP : official journal of the American Society of Health-System Pharmacists.

[13]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[14]  Chedy Raïssi,et al.  ρ-uncertainty , 2010, Proc. VLDB Endow..

[15]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[16]  Yücel Saygin,et al.  Anonymization of Longitudinal Electronic Medical Records , 2012, IEEE Transactions on Information Technology in Biomedicine.

[17]  James J. Lu,et al.  HIDE: heterogeneous information DE-identification , 2009, EDBT '09.

[18]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[19]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[20]  Latanya Sweeney,et al.  Guaranteeing anonymity when sharing medical data, the Datafly System , 1997, AMIA.

[21]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[22]  Ton de Waal,et al.  Statistical Disclosure Control in Practice , 1996 .

[23]  Aris Gkoulalas-Divanis,et al.  Assessing Disclosure Risk and Data Utility Trade-off in Transaction Data Anonymization , 2012, Int. J. Softw. Informatics.

[24]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[25]  Rathindra Sarathy,et al.  Evaluating Laplace Noise Addition to Satisfy Differential Privacy for Numeric Data , 2011, Trans. Data Priv..

[26]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[27]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[28]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[29]  Ninghui Li,et al.  Modeling and Integrating Background Knowledge in Data Anonymization , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[30]  John Doucette,et al.  Adopting electronic medical records in primary care: Lessons learned from health information systems implementation experience in seven countries , 2009, Int. J. Medical Informatics.

[31]  Khaled El Emam,et al.  Protecting privacy using k-anonymity. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[32]  Aris Gkoulalas-Divanis,et al.  Utility-guided Clustering-based Transaction Data Anonymization , 2012, Trans. Data Priv..

[33]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[34]  K. Emam Methods for the de-identification of electronic health records for genomic research , 2011, Genome Medicine.

[35]  Raymond Chi-Wing Wong,et al.  Anonymization-based attacks in privacy-preserving data publishing , 2009, TODS.

[36]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[37]  Benjamin C. M. Fung,et al.  Anonymizing healthcare data: a case study on the blood transfusion service , 2009, KDD.

[38]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[39]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[40]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[41]  Wenliang Du,et al.  Privacy-MaxEnt: integrating background knowledge in privacy quantification , 2008, SIGMOD Conference.

[42]  Khaled El Emam,et al.  The application of differential privacy to health data , 2012, EDBT-ICDT '12.