Overview of Patient Data Anonymization

This chapter presents an overview of anonymization techniques that can be used to protect different types of patient data. We first discuss anonymization principles that can prevent identity and sensitive information disclosure in demographic data publishing, as well as algorithms for enforcing these principles, in Section.1. Subsequently, in Section 2.2, we turn our attention to diagnosis code anonymization, which, somewhat surprisingly, has not been considered by the medical informatics community. After motivating the need for anonymizing diagnosis codes, we present several related principles and transformation models that have been proposed by the data management community. We then review anonymization algorithms, detailing how these principles and algorithms are utilized by them. Last, in Section 2.3, we examine genomic data, which are also susceptible to privacy attacks.

[1]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[2]  Kyuseok Shim,et al.  Approximate algorithms for K-anonymity , 2007, SIGMOD '07.

[3]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[4]  E. Zerhouni,et al.  Protecting Aggregate Genomic Data , 2008, Science.

[5]  Stephen E. Fienberg,et al.  Privacy Preserving GWAS Data Sharing , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[6]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[7]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[8]  D. DeWitt,et al.  K-Anonymization as Spatial Indexing: Toward Scalable and Incremental Anonymization , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[9]  Haixu Tang,et al.  Learning your identity and disease from research papers: information leaks in genome wide association study , 2009, CCS.

[10]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[11]  David J. DeWitt,et al.  Workload-aware anonymization , 2006, KDD '06.

[12]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[13]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[14]  Richard W. Hamming,et al.  Coding and Information Theory , 2018, Feynman Lectures on Computation.

[15]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[16]  Rajeev Motwani,et al.  Approximation Algorithms for k-Anonymity , 2005 .

[17]  Grigorios Loukides,et al.  Preventing range disclosure in k-anonymised data , 2011, Expert Syst. Appl..

[18]  Yufei Tao,et al.  Personalized privacy preservation , 2006, Privacy-Preserving Data Mining.

[19]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[20]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[21]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[22]  Khaled El Emam,et al.  Protecting privacy using k-anonymity. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[23]  Josep Domingo-Ferrer,et al.  Ordinal, Continuous and Heterogeneous k-Anonymity Through Microaggregation , 2005, Data Mining and Knowledge Discovery.

[24]  Mark A. Rothstein,et al.  Ethical and legal implications of pharmacogenomics , 2001, Nature Reviews Genetics.

[25]  Sushil Jajodia,et al.  The inference problem: a survey , 2002, SKDD.

[26]  Elisa Bertino,et al.  Efficient k -Anonymization Using Clustering Techniques , 2007, DASFAA.

[27]  Lucila Ohno-Machado,et al.  Effects of Data Anonymization by Cell Suppression on Descriptive Statistics and Predictive Modeling Performance , 2002, J. Am. Medical Informatics Assoc..

[28]  E. Clayton,et al.  Identifiability in biobanks: models, measures, and mitigation strategies , 2011, Human Genetics.

[29]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[30]  K. Emam Methods for the de-identification of electronic health records for genomic research , 2011, Genome Medicine.

[31]  Raymond Chi-Wing Wong,et al.  Achieving k-Anonymity by Clustering in Attribute Hierarchical Structures , 2006, DaWaK.

[32]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[33]  Grigorios Loukides,et al.  Capturing data usefulness and privacy protection in K-anonymisation , 2007, SAC '07.

[34]  Ninghui Li,et al.  Towards optimal k-anonymization , 2008, Data Knowl. Eng..

[35]  Bradley Malin,et al.  COAT: COnstraint-based anonymization of transactions , 2010, Knowledge and Information Systems.

[36]  Qing Zhang,et al.  Aggregate Query Answering on Anonymized Tables , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[37]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[38]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[39]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[40]  B. Malin,et al.  Anonymization of electronic medical records for validating genome-wide association studies , 2010, Proceedings of the National Academy of Sciences.

[41]  Russ B Altman,et al.  Confidentiality in Genome Research , 2006, Science.

[42]  Alina Campan,et al.  Generating Microdata with P -Sensitive K -Anonymity Property , 2007, Secure Data Management.

[43]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[44]  Aris Gkoulalas-Divanis,et al.  Anonymizing Transaction Data to Eliminate Sensitive Inferences , 2010, DEXA.

[45]  Á. Carracedo,et al.  Inferring ancestral origin using a single multiplex assay of ancestry-informative marker SNPs. , 2007, Forensic science international. Genetics.

[46]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[47]  Christopher A Cassa,et al.  My sister's keeper?: genomic research and the identifiability of siblings , 2008, BMC Medical Genomics.

[48]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[49]  Raghu Ramakrishnan,et al.  Privacy Skyline: Privacy with Multidimensional Adversarial Knowledge , 2007, VLDB.

[50]  Jinghui Zhang,et al.  Needles in the Haystack: Identifying Individuals Present in Pooled Genomic Data , 2009, PLoS genetics.

[51]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[52]  Bo Peng,et al.  To Release or Not to Release: Evaluating Information Leaks in Aggregate Human-Genome Data , 2011, ESORICS.

[53]  Panos Kalnis,et al.  SABRE: a Sensitive Attribute Bucketization and REdistribution framework for t-closeness , 2011, The VLDB Journal.

[54]  J E Rogers,et al.  Quality Assurance of Medical Ontologies , 2006, Methods of Information in Medicine.

[55]  Jean-Pierre Corriveau,et al.  A globally optimal k-anonymity method for the de-identification of health data. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[56]  Josep Domingo-Ferrer,et al.  Practical Data-Oriented Microaggregation for Statistical Disclosure Control , 2002, IEEE Trans. Knowl. Data Eng..

[57]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[58]  Antonin Guttman,et al.  R-trees: a dynamic index structure for spatial searching , 1984, SIGMOD '84.

[59]  Aris Gkoulalas-Divanis,et al.  PCTA: privacy-constrained clustering-based transaction data anonymization , 2011, PAIS '11.

[60]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[61]  Joshua C. Denny,et al.  The disclosure of diagnosis codes can breach research participants' privacy , 2010, J. Am. Medical Informatics Assoc..

[62]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[63]  Grigorios Loukides,et al.  Towards Preference-Constrained k-Anonymisation , 2009, DASFAA Workshops.

[64]  Philip S. Yu,et al.  A Condensation Approach to Privacy Preserving Data Mining , 2004, EDBT.

[65]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.