A globally optimal k-anonymity method for the de-identification of health data.

BACKGROUND Explicit patient consent requirements in privacy laws can have a negative impact on health research, leading to selection bias and reduced recruitment. Often legislative requirements to obtain consent are waived if the information collected or disclosed is de-identified. OBJECTIVE The authors developed and empirically evaluated a new globally optimal de-identification algorithm that satisfies the k-anonymity criterion and that is suitable for health datasets. DESIGN Authors compared OLA (Optimal Lattice Anonymization) empirically to three existing k-anonymity algorithms, Datafly, Samarati, and Incognito, on six public, hospital, and registry datasets for different values of k and suppression limits. Measurement Three information loss metrics were used for the comparison: precision, discernability metric, and non-uniform entropy. Each algorithm's performance speed was also evaluated. RESULTS The Datafly and Samarati algorithms had higher information loss than OLA and Incognito; OLA was consistently faster than Incognito in finding the globally optimal de-identification solution. CONCLUSIONS For the de-identification of health datasets, OLA is an improvement on existing k-anonymity algorithms in terms of information loss and performance.

[1]  Oded Goldreich Computational Complexity , 2008 .

[2]  Burton J Kushner HIPAA and research: how have the first two years gone? , 2006, American journal of ophthalmology.

[3]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[4]  P Lachmann,et al.  The Academy of Medical Sciences , 2018, The Grants Register 2019.

[5]  Christine Robson,et al.  Reidentification of Individuals in Chicago’s Homicide Database: A Technical and Legal Study , 2002 .

[6]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[7]  Latanya Sweeney,et al.  Guaranteeing anonymity when sharing medical data, the Datafly System , 1997, AMIA.

[8]  Chris Clifton,et al.  Thoughts on k-Anonymization , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[9]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[10]  Khaled El Emam,et al.  Protecting privacy using k-anonymity. , 2008, Journal of the American Medical Informatics Association : JAMIA.

[11]  K. Emam,et al.  Evaluating the Risk of Re-identification of Patients from Hospital Prescription Records. , 2009, The Canadian journal of hospital pharmacy.

[12]  Roberta B. Ness,et al.  Influence of the HIPAA privacy rule on health research , 2008 .

[13]  T. Giordano,et al.  The Health Insurance Portability and Accountability Act of 1996 (HIPAA) privacy rule: implications for clinical research. , 2006, Annual review of medicine.

[14]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[15]  V. Torra,et al.  Disclosure control methods and information loss for microdata , 2001 .

[16]  Emmett Flemming,et al.  NCES Statistical Standards. , 1992 .

[17]  Jennifer Fisher Wilson,et al.  Health Insurance Portability and Accountability Act Privacy Rule Causes Ongoing Concerns among Clinicians and Researchers , 2006, Annals of Internal Medicine.

[18]  Alexander La,et al.  Access to social security microdata files for research and statistical purposes. , 1978, Social security bulletin.

[19]  K. Kudsk,et al.  Health Insurance Portability Accountability Act (HIPAA) Regulations: Effect on Medical Record Research , 2004, Annals of surgery.

[20]  Ag De Waal,et al.  A view on statistical disclosure control for microdata , 1996 .

[21]  Robert A Hiatt,et al.  HIPAA: the end of epidemiology, or a new social contract? , 2003, Epidemiology.

[22]  G. Duncan,et al.  Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics , 1993 .

[23]  David J. DeWitt,et al.  Incognito: efficient full-domain K-anonymity , 2005, SIGMOD '05.

[24]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[25]  Khaled El Emam The ROI from Software Quality , 2005 .

[26]  Rajeev Motwani,et al.  Approximation Algorithms for k-Anonymity , 2005 .

[27]  K. El Emam,et al.  Evaluating Common De-Identification Heuristics for Personal Health Information , 2006, Journal of medical Internet research.

[28]  Tamir Tassa,et al.  k -Anonymization with Minimal Loss of Information , 2007, ESA.

[29]  D. Rubin,et al.  Statistical Analysis with Missing Data , 1988 .

[30]  Judith A Erlen,et al.  HIPAA--Implications for research. , 2005, Orthopedic nursing.

[31]  A. Shaikh Working Paper No. 19 , 1998 .

[32]  Josep Domingo-Ferrer,et al.  Disclosure risk assessment in statistical microdata protection via advanced record linkage , 2003, Stat. Comput..

[33]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[34]  Sushil Jajodia,et al.  Secure Data Management in Decentralized Systems , 2014, Secure Data Management in Decentralized Systems.

[35]  P. Doyle,et al.  Confidentiality, Disclosure and Data Access: Theory and Practical Applications for Statistical Agencies , 2001 .

[36]  K. Brazil,et al.  Access to medical records for research purposes: varying perceptions across research ethics boards , 2008, Journal of Medical Ethics.

[37]  David Wendler,et al.  Informed Consent for Research and Authorization under the Health Insurance Portability and Accountability Act Privacy Rule: An Integrated Approach , 2006, Annals of Internal Medicine.

[38]  Rajeev Motwani,et al.  Anonymizing Tables , 2005, ICDT.

[39]  Sharad Mehrotra,et al.  Flexible Anonymization For Privacy Preserving Data Publishing: A Systematic Search Based Approach , 2007, SDM.

[40]  Roberto J. Bayardo,et al.  Data privacy through optimal k-anonymization , 2005, 21st International Conference on Data Engineering (ICDE'05).

[41]  A. Levy,et al.  Personal privacy and public health: potential impacts of privacy legislation on health research in Canada. , 2008, Canadian journal of public health = Revue canadienne de sante publique.

[42]  L. Sweeney Computational Disclosure Control for Medical Microdata , 1997 .

[43]  W. Lowrance,et al.  Learning from experience: privacy and the secondary use of data in health research. , 2003, Journal of health services research & policy.

[44]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[45]  Jae-On Kim,et al.  The Treatment of Missing Data in Multivariate Analysis , 1977 .

[46]  T. B. Jabine,et al.  Access to social security microdata files for research and statistical purposes. , 1978, Social security bulletin.

[47]  Jian Pei,et al.  Utility-based anonymization using local recoding , 2006, KDD '06.

[48]  Latanya Sweeney,et al.  Computational disclosure control: a primer on data privacy protection , 2001 .

[49]  Feng Zhu,et al.  On Multidimensional k-Anonymity with Local Recoding Generalization , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[50]  Khaled El Emam,et al.  Model Formulation: Evaluating Predictors of Geographic Area Population Size Cut-offs to Manage Re-identification Risk , 2009, J. Am. Medical Informatics Assoc..

[51]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[52]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[53]  J. Kulynych,et al.  The effect of the new federal medical-privacy rule on research. , 2002, The New England journal of medicine.

[54]  Jian Xu,et al.  Utility-based anonymization for privacy preservation with less information loss , 2006, SKDD.

[55]  Ninghui Li,et al.  Optimal k-Anonymity with Flexible Generalization Schemes through Bottom-up Searching , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).