Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text

OBJECTIVE Secondary use of clinical text is impeded by a lack of highly effective, low-cost de-identification methods. Both, manual and automated methods for removing protected health information, are known to leave behind residual identifiers. The authors propose a novel approach for addressing the residual identifier problem based on the theory of Hiding In Plain Sight (HIPS). MATERIALS AND METHODS HIPS relies on obfuscation to conceal residual identifiers. According to this theory, replacing the detected identifiers with realistic but synthetic surrogates should collectively render the few 'leaked' identifiers difficult to distinguish from the synthetic surrogates. The authors conducted a pilot study to test this theory on clinical narrative, de-identified by an automated system. Test corpora included 31 oncology and 50 family practice progress notes read by two trained chart abstractors and an informaticist. RESULTS Experimental results suggest approximately 90% of residual identifiers can be effectively concealed by the HIPS approach in text containing average and high densities of personal identifying information. DISCUSSION This pilot test suggests HIPS is feasible, but requires further evaluation. The results need to be replicated on larger corpora of diverse origin under a range of detection scenarios. Error analyses also suggest areas where surrogate generation techniques can be refined to improve efficacy. CONCLUSIONS If these results generalize to existing high-performing de-identification systems with recall rates of 94-98%, HIPS could increase the effective de-identification rates of these systems to levels above 99% without further advancements in system recall. Additional and more rigorous assessment of the HIPS approach is warranted.

[1]  John F. Hurdle,et al.  Assessing the Difficulty and Time Cost of De-identification in Clinical Narratives , 2006, Methods of Information in Medicine.

[2]  Clement J. McDonald,et al.  What can natural language processing do for clinical decision support? , 2009, J. Biomed. Informatics.

[3]  Li Xiong,et al.  An integrated framework for de-identifying unstructured medical data , 2009, Data Knowl. Eng..

[4]  Peter L. Elkin,et al.  Detection of infectious symptoms from VA emergency department and primary care clinical documentation , 2012, Int. J. Medical Informatics.

[5]  Ricky K. Taira,et al.  Identification of patient name references within medical documents using semantic selectional restrictions , 2002, AMIA.

[6]  Rebecca Herold,et al.  Standards for privacy of individually identifiable health information. Office of the Assistant Secretary for Planning and Evaluation, DHHS. Final rule. , 2001, Federal register.

[7]  Lynette Hirschman,et al.  Effects of personal identifier resynthesis on clinical text de-identification , 2010, J. Am. Medical Informatics Assoc..

[8]  Clement J. McDonald,et al.  Application of Information Technology: A Software Tool for Removing Patient Identifying Information from Clinical Documents , 2008, J. Am. Medical Informatics Assoc..

[9]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[10]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[11]  James J. Lu,et al.  HIDE: heterogeneous information DE-identification , 2009, EDBT '09.

[12]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[13]  Samantha Jenkins,et al.  Information theory-based software metrics and obfuscation , 2004, J. Syst. Softw..

[14]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[15]  Lynette Hirschman,et al.  Measuring Risk and Information Preservation: Toward New Metrics for De-identification of Clinical Texts , 2010, Louhi@NAACL-HLT.

[16]  K. Bretonnel Cohen,et al.  Current issues in biomedical text mining and natural language processing , 2009, J. Biomed. Informatics.

[17]  Peter L. Elkin,et al.  Comparison of Natural Language Processing Biosurveillance Methods for Identifying Influenza From Encounter Notes , 2012, Annals of Internal Medicine.

[18]  Hhs Office for Civil Rights Standards for privacy of individually identifiable health information. Final rule. , 2002, Federal register.

[19]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[20]  Jeanmarie Mayer,et al.  Inductive Creation of an Annotation Schema and a Reference Standard for De-identification of VA Electronic Clinical Notes , 2009, AMIA.

[21]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[22]  Shuying Shen,et al.  Optimizing A Syndromic Surveillance Text Classifier for Influenza-like Illness: Does Document Source Matter? , 2008, AMIA.

[23]  Richard L Berg,et al.  Use of an Electronic Medical Record for the Identification of Research Subjects with Diabetes Mellitus , 2007, Clinical Medicine & Research.

[24]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[25]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[26]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[27]  Lin Chen,et al.  Importance of multi-modal approaches to effectively identify cataract cases from electronic health records , 2012, J. Am. Medical Informatics Assoc..