A Survey of Anonymization Algorithms for Electronic Health Records

Electronic Health Records (EHRs) contain various types of structured data about patients, such as patients’ diagnoses, laboratory results, active medication, and allergies, which are increasingly shared to support a wide spectrum of medical analyses. To protect patient privacy, EHR data must be anonymized before their sharing. Anonymization ensures that the re-identification of patients and/or the inference of patients’ sensitive information is prevented, and it is possible using several algorithms that have been proposed recently. In this chapter, we survey popular data anonymization algorithms for EHR data and explain their objectives, as well as the main aspects of their operation. After that, we present several promising directions for future research in this area.

[1]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[2]  Khaled El Emam,et al.  Protecting privacy using k-anonymity. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[3]  Ninghui Li,et al.  Closeness: A New Privacy Measure for Data Publishing , 2010, IEEE Transactions on Knowledge and Data Engineering.

[4]  Tamir Tassa,et al.  k-Anonymization Revisited , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  Traian Marius Truta,et al.  Protection : p-Sensitive k-Anonymity Property , 2006 .

[6]  David J. DeWitt,et al.  Workload-aware anonymization techniques for large-scale datasets , 2008, TODS.

[7]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[8]  Benjamin C. M. Fung,et al.  Privacy-preserving data publishing for cluster analysis , 2009, Data Knowl. Eng..

[9]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[10]  Heather N. Watson,et al.  Use of electronic medical records (EMR) for oncology outcomes research: assessing the comparability of EMR information to patient registry and health claims data , 2011, Clinical epidemiology.

[11]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[12]  Aris Gkoulalas-Divanis,et al.  Hiding sensitive knowledge without side effects , 2009, Knowledge and Information Systems.

[13]  Panos Kalnis,et al.  Fast Data Anonymization with Low Information Loss , 2007, VLDB.

[14]  Grigorios Loukides,et al.  Preventing range disclosure in k-anonymised data , 2011, Expert Syst. Appl..

[15]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[16]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[17]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[18]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[19]  Clara Pizzuti,et al.  DESCRY: A Density Based Clustering Algorithm for Very Large Data Sets , 2004, IDEAL.

[20]  Latanya Sweeney,et al.  Computational disclosure control: a primer on data privacy protection , 2001 .

[21]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[22]  Aris Gkoulalas-Divanis,et al.  PCTA: privacy-constrained clustering-based transaction data anonymization , 2011, PAIS '11.

[23]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[24]  Chedy Raïssi,et al.  ρ-uncertainty , 2010, Proc. VLDB Endow..

[25]  E. Hing,et al.  Use and characteristics of electronic health record systems among office-based physician practices: United States, 2001-2012. , 2012, NCHS data brief.

[26]  Spiros Skiadopoulos,et al.  Anonymizing Data with Relational and Transaction Attributes , 2013, ECML/PKDD.

[27]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[28]  Elisa Bertino,et al.  Efficient k -Anonymization Using Clustering Techniques , 2007, DASFAA.

[29]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[30]  Aris Gkoulalas-Divanis,et al.  Anonymizing Transaction Data to Eliminate Sensitive Inferences , 2010, DEXA.

[31]  Cristina Nita-Rotaru,et al.  A survey of attack and defense techniques for reputation systems , 2009, CSUR.

[32]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[33]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[34]  Feng Zhu,et al.  On Multidimensional k-Anonymity with Local Recoding Generalization , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[35]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[36]  Aris Gkoulalas-Divanis,et al.  Efficient and flexible anonymization of transaction data , 2012, Knowledge and Information Systems.

[37]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.

[38]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[39]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[40]  K. Sirotkin,et al.  The NCBI dbGaP database of genotypes and phenotypes , 2007, Nature Genetics.

[41]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[42]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[43]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[44]  Grigorios Loukides,et al.  Towards Preference-Constrained k-Anonymisation , 2009, DASFAA Workshops.

[45]  Aris Gkoulalas-Divanis,et al.  Revisiting sequential pattern hiding to enhance utility , 2011, KDD.

[46]  Raymond Chi-Wing Wong,et al.  Achieving k-Anonymity by Clustering in Attribute Hierarchical Structures , 2006, DaWaK.

[47]  Ke Wang,et al.  Anonymizing Transaction Data by Integrating Suppression and Generalization , 2010, PAKDD.

[48]  G. Loukides,et al.  Utility-Aware Anonymization of Diagnosis Codes , 2013, IEEE Journal of Biomedical and Health Informatics.

[49]  D. DeWitt,et al.  K-Anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[50]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[51]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[52]  Michael J. Laszlo,et al.  Minimum spanning tree partitioning algorithm for microaggregation , 2005, IEEE Transactions on Knowledge and Data Engineering.

[53]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[54]  Jimeng Sun,et al.  Publishing data from electronic health records while preserving privacy: A survey of algorithms , 2014, J. Biomed. Informatics.

[55]  Aris Gkoulalas-Divanis,et al.  Utility-preserving transaction data anonymization with low information loss , 2012, Expert Syst. Appl..

[56]  Rajeev Motwani,et al.  Approximation Algorithms for k-Anonymity , 2005 .

[57]  Robert Gwadera,et al.  Permutation-Based Sequential Pattern Hiding , 2013, 2013 IEEE 13th International Conference on Data Mining.

[58]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[59]  Bradley Malin,et al.  An Integrative Framework for Anonymizing Clinical and Genomic Data , 2010 .

[60]  Grigorios Loukides,et al.  An Efficient Clustering Algorithm for k-Anonymisation , 2008, Journal of Computer Science and Technology.

[61]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[62]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[63]  M. Keeling,et al.  Impact of spatial clustering on disease transmission and optimal control , 2009, Proceedings of the National Academy of Sciences.

[64]  Grigorios Loukides,et al.  Capturing data usefulness and privacy protection in K-anonymisation , 2007, SAC '07.

[65]  Wenfei Fan,et al.  Conditional functional dependencies for capturing data inconsistencies , 2008, TODS.

[66]  Tim Sprosen,et al.  UK Biobank: from concept to reality. , 2005, Pharmacogenomics.

[67]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.