Optimizing annotation resources for natural language de-identification via a game theoretic framework

OBJECTIVE Electronic medical records (EMRs) are increasingly repurposed for activities beyond clinical care, such as to support translational research and public policy analysis. To mitigate privacy risks, healthcare organizations (HCOs) aim to remove potentially identifying patient information. A substantial quantity of EMR data is in natural language form and there are concerns that automated tools for detecting identifiers are imperfect and leak information that can be exploited by ill-intentioned data recipients. Thus, HCOs have been encouraged to invest as much effort as possible to find and detect potential identifiers, but such a strategy assumes the recipients are sufficiently incentivized and capable of exploiting leaked identifiers. In practice, such an assumption may not hold true and HCOs may overinvest in de-identification technology. The goal of this study is to design a natural language de-identification framework, rooted in game theory, which enables an HCO to optimize their investments given the expected capabilities of an adversarial recipient. METHODS We introduce a Stackelberg game to balance risk and utility in natural language de-identification. This game represents a cost-benefit model that enables an HCO with a fixed budget to minimize their investment in the de-identification process. We evaluate this model by assessing the overall payoff to the HCO and the adversary using 2100 clinical notes from Vanderbilt University Medical Center. We simulate several policy alternatives using a range of parameters, including the cost of training a de-identification model and the loss in data utility due to the removal of terms that are not identifiers. In addition, we compare policy options where, when an attacker is fined for misuse, a monetary penalty is paid to the publishing HCO as opposed to a third party (e.g., a federal regulator). RESULTS Our results show that when an HCO is forced to exhaust a limited budget (set to $2000 in the study), the precision and recall of the de-identification of the HCO are 0.86 and 0.8, respectively. A game-based approach enables a more refined cost-benefit tradeoff, improving both privacy and utility for the HCO. For example, our investigation shows that it is possible for an HCO to release the data without spending all their budget on de-identification and still deter the attacker, with a precision of 0.77 and a recall of 0.61 for the de-identification. There also exist scenarios in which the model indicates an HCO should not release any data because the risk is too great. In addition, we find that the practice of paying fines back to a HCO (an artifact of suing for breach of contract), as opposed to a third party such as a federal regulator, can induce an elevated level of data sharing risk, where the HCO is incentivized to bait the attacker to elicit compensation. CONCLUSIONS A game theoretic framework can be applied in leading HCO's to optimized decision making in natural language de-identification investments before sharing EMR data.

[1]  L. Lee,et al.  Updated guidelines for evaluating public health surveillance systems: recommendations from the Guidelines Working Group. , 2001, MMWR. Recommendations and reports : Morbidity and mortality weekly report. Recommendations and reports.

[2]  Nicolas Christin,et al.  Audit Games , 2013, IJCAI.

[3]  Shuying Shen,et al.  BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[4]  Lynette Hirschman,et al.  Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text , 2013, J. Am. Medical Informatics Assoc..

[5]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[6]  Quanyan Zhu,et al.  Game theory meets network security and privacy , 2013, CSUR.

[7]  Robin C. Meili,et al.  Can electronic medical record systems transform health care? Potential health benefits, savings, and costs. , 2005, Health affairs.

[8]  Keith Marsolo,et al.  Preparing an annotated gold standard corpus to share with extramural investigators for de-identification research , 2014, J. Biomed. Informatics.

[9]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[10]  Li Xiong,et al.  An integrated framework for de-identifying unstructured medical data , 2009, Data Knowl. Eng..

[11]  Wendy W. Chapman,et al.  Fever detection from free-text clinical records for biosurveillance , 2004, Journal of Biomedical Informatics.

[12]  Keith Marsolo,et al.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction , 2013, J. Am. Medical Informatics Assoc..

[13]  Melissa A. Basford,et al.  Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. , 2013, Journal of the American Medical Informatics Association : JAMIA.

[14]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[15]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[16]  Ashish K. Jha,et al.  Electronic health records in small physician practices: availability, use, and perceived benefits , 2011, J. Am. Medical Informatics Assoc..

[17]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[18]  Sarit Kraus,et al.  Playing games for security: an efficient exact algorithm for solving Bayesian Stackelberg games , 2008, AAMAS.

[19]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[20]  Ricky K. Taira,et al.  Identification of patient name references within medical documents using semantic selectional restrictions , 2002, AMIA.

[21]  Lisa Rajbhandari,et al.  Using Game Theory to Analyze Risk to Privacy: An Initial Insight , 2010, PrimeLife.

[22]  Lynette Hirschman,et al.  Effects of personal identifier resynthesis on clinical text de-identification , 2010, J. Am. Medical Informatics Assoc..

[23]  G. Loewenstein,et al.  What Is Privacy Worth? , 2013, The Journal of Legal Studies.

[24]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[25]  Wendy A. Wolf,et al.  The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies , 2011, BMC Medical Genomics.

[26]  Raymond Heatherly,et al.  A Game Theoretic Framework for Analyzing Re-Identification Risk , 2015, PloS one.

[27]  Hhs Office for Civil Rights Standards for privacy of individually identifiable health information. Final rule. , 2002, Federal register.

[28]  Özlem Uzuner,et al.  Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus , 2015, J. Biomed. Informatics.

[29]  Kai Lung Hui,et al.  Online Information Privacy: Measuring the Cost-Benefit Trade-Off , 2002, ICIS.

[30]  David W. Bates,et al.  Position Paper: A Proposal for Electronic Medical Records in U.S. Primary Care , 2003, J. Am. Medical Informatics Assoc..

[31]  Anna Rumshisky,et al.  Normalization of Relative and Incomplete Temporal Expressions in Clinical Narratives , 2015, J. Am. Medical Informatics Assoc..

[32]  Jacques Pasquier-Rocha,et al.  Enhancing E-Health Information Systems with Agent Technology , 2008, International journal of telemedicine and applications.

[33]  Benjamin C. M. Fung,et al.  Quantifying the costs and benefits of privacy-preserving health data publishing , 2014, J. Biomed. Informatics.

[34]  John F. Hurdle,et al.  Assessing the Difficulty and Time Cost of De-identification in Clinical Narratives , 2006, Methods of Information in Medicine.

[35]  John Doucette,et al.  Primary Care Physicians' Experience with Electronic Medical Records: Barriers to Implementation in a Fee-for-Service Environment , 2008, International journal of telemedicine and applications.

[36]  Murat Kantarcioglu,et al.  When Do Firms Invest in Privacy-Preserving Technologies? , 2010, GameSec.

[37]  Alessandro Acquisti,et al.  Is There a Cost to Privacy Breaches? An Event Study , 2006, WEIS.

[38]  Anne F. Kittler,et al.  A cost-benefit analysis of electronic medical records in primary care. , 2003, The American journal of medicine.

[39]  Tobias Scheffer,et al.  Stackelberg games for adversarial prediction problems , 2011, KDD.

[40]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[41]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[42]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[43]  Sanjay Chawla,et al.  A Game Theoretical Model for Adversarial Learning , 2009, 2009 IEEE International Conference on Data Mining Workshops.