A unified framework for evaluating the risk of re-identification of text de-identification tools

OBJECTIVES It has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identification risk. The metrics commonly used to determine if these systems are performing well do not accurately reflect the risk of a patient being re-identified. We therefore developed a framework for measuring the risk of re-identification associated with textual data releases. METHODS We apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method. RESULTS We demonstrate how this framework compares against common measures of the re-identification risk associated with an automated text de-identification process. For the probability of re-identification using our evaluation framework we obtained a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The 95% confidence interval for these estimates were below the relevant thresholds. The threshold for direct identifier risk was based on previously used approaches in the literature. The threshold for quasi-identifiers was determined based on the context of the data release following commonly used de-identification criteria for structured data. DISCUSSION Our framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identification. CONCLUSIONS This framework should be used as a basis for computing re-identification risk in order to more realistically evaluate future text de-identification tools.

[1]  Lynette Hirschman,et al.  Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text , 2013, J. Am. Medical Informatics Assoc..

[2]  Lynette Hirschman,et al.  Measuring Risk and Information Preservation: Toward New Metrics for De-identification of Clinical Texts , 2010, Louhi@NAACL-HLT.

[3]  Khaled El Emam,et al.  Risk-Based De-Identification of Health Data , 2010, IEEE Secur. Priv..

[4]  David L. Buckeridge,et al.  The re-identification risk of Canadians from longitudinal demographics , 2011, BMC Medical Informatics Decis. Mak..

[5]  S. Meystre,et al.  Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents , 2012, BMC Medical Research Methodology.

[6]  Ag De Waal,et al.  A view on statistical disclosure control for microdata , 1996 .

[7]  Khaled El Emam,et al.  Guide to the De-Identification of Personal Health Information , 2013 .

[8]  Philippe Golle,et al.  Revisiting the uniqueness of simple demographics in the US population , 2006, WPES '06.

[9]  C.T.A.M. de Laat,et al.  A study on the re-identifiability of Dutch citizens , 2010 .

[10]  D. Vose Risk Analysis: A Quantitative Guide , 2000 .

[11]  Alexander La,et al.  Access to social security microdata files for research and statistical purposes. , 1978, Social security bulletin.

[12]  Ton de Waal,et al.  Statistical Disclosure Control in Practice , 1996 .

[13]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[14]  Max Henrion,et al.  Uncertainty: A Guide to Dealing with Uncertainty in Quantitative Risk and Policy Analysis , 1990 .

[15]  Shuying Shen,et al.  BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[16]  L. Willenborg,et al.  Elements of Statistical Disclosure Control , 2000 .

[17]  Khaled El Emam,et al.  De-identifying a public use microdata file from the Canadian national discharge abstract database , 2011, BMC Medical Informatics Decis. Mak..

[18]  G. Duncan,et al.  Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics , 1993 .

[19]  A. Agresti,et al.  Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions , 1998 .

[20]  Khaled El Emam,et al.  Anonymizing Health Data: Case Studies and Methods to Get You Started , 2013 .

[21]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[22]  Shuying Shen,et al.  Can Physicians Recognize Their Own Patients in De-identified Notes? , 2014, MIE.

[23]  Hhs Office for Civil Rights Standards for privacy of individually identifiable health information. Final rule. , 2002, Federal register.

[24]  Khaled El Emam,et al.  Accessing Health and Health-Related Data in Canada: The Expert Panel on Timely Access to Health and Social Data for Health Research and Health System Innovation , 2015 .

[25]  C. Skinner,et al.  The case for samples of anonymized records from the 1991 census. , 1991, Journal of the Royal Statistical Society. Series A,.

[26]  Linda L Kloss AHIMA's Comments on Notice of Proposed Rule-making Regarding Standards for Privacy of Individually Identifiable Health Information , 2000 .

[27]  Luk Arbuckle,et al.  El Emam Et Al.: the De‐identification of the Heritage Health Prize Claims Data Set Multimedia Appendix Multimedia Appendix 1 Truncation of Claims 2 Removal of High Risk Patients , 2022 .

[28]  Emmett Flemming,et al.  NCES Statistical Standards. , 1992 .

[29]  Juan José SALAZAR-GONZÁLEZ,et al.  Statistical Confidentiality: Principles and Practice , 2011 .