De-identification of Unstructured Clinical Data for Patient Privacy Protection

The adoption of Electronic Health Record (EHR) systems is growing at a fast pace in the United States and in Europe, and this growth results in very large quantities of patient clinical information becoming available in electronic format, with tremendous potential, but also equally growing concern for patient confidentiality breaches. Secondary use of clinical information is essential to fulfil the promises for high quality healthcare, improved healthcare management, and effective clinical research. De-identification of patient information has been proposed as a solution to both facilitate secondary use of clinical information, and protect patient information confidentiality. Most clinical information found in the EHR is unstructured and represented as narrative text, and de-identification of clinical text is a tedious and costly manual endeavor. Automated approaches based on Natural Language Processing have been implemented and evaluated, allowing for much faster de-identification than manual approaches. This chapter introduces clinical text-de-identification in general, and then focuses on recent efforts and studies at the U.S. Veterans Health Administration. It includes the origins and definition of text de-identification in the United States and Europe and a discussion about text anonymization. It also presents methods applied for text de-identification, examples of clinical text de-identification applications, and U.S. Veterans Health Administration clinical text de-identification efforts.

[1]  James J. Lu,et al.  HIDE: heterogeneous information DE-identification , 2009, EDBT '09.

[2]  Lynette Hirschman,et al.  Hiding in plain sight: use of realistic surrogates to reduce exposure of protected health information in clinical text , 2013, J. Am. Medical Informatics Assoc..

[3]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[4]  Khaled El Emam,et al.  Estimating the re-identification risk of clinical data sets , 2012, BMC Medical Informatics and Decision Making.

[5]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[6]  Deborah A. Nichols,et al.  Strategies for De-identification and Anonymization of Electronic Health Record Data for Use in Multicenter Research Studies , 2012, Medical care.

[7]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[8]  Jeffrey E. F. Friedl Mastering Regular Expressions , 1997 .

[9]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[10]  Stéphane M. Meystre,et al.  Text de-identification for privacy protection: A study of its impact on clinical text information content , 2014, J. Biomed. Informatics.

[11]  Angus Roberts,et al.  Development and evaluation of a de-identification procedure for a case register sourced from mental health electronic records , 2013, BMC Medical Informatics and Decision Making.

[12]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[13]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[14]  Z. Galil,et al.  Pattern matching algorithms , 1997 .

[15]  D. Blumenthal,et al.  The "meaningful use" regulation for electronic health records. , 2010, The New England journal of medicine.

[16]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[17]  Shuying Shen,et al.  Generalizability and Comparison of Automatic Clinical Text De-Identification Methods and Resources , 2012, AMIA.

[18]  Shuying Shen,et al.  BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[19]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[20]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[21]  Clement J. McDonald,et al.  Application of Information Technology: A Software Tool for Removing Patient Identifying Information from Clinical Documents , 2008, J. Am. Medical Informatics Assoc..

[22]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[23]  John F. Hurdle,et al.  Assessing the Difficulty and Time Cost of De-identification in Clinical Narratives , 2006, Methods of Information in Medicine.

[24]  Pierre Zweigenbaum,et al.  Testing Tactics to Localize De-Identification , 2009, MIE.

[25]  S. Meystre,et al.  Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents , 2012, BMC Medical Research Methodology.

[26]  George Hripcsak,et al.  Using a pipeline to improve de-identification performance , 2009, AMIA.

[27]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[28]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[29]  Shuying Shen,et al.  Can Physicians Recognize Their Own Patients in De-identified Notes? , 2014, MIE.

[30]  Susan Lucci,et al.  Transcription and EHRs. Benefits of a blended approach. , 2010, Journal of AHIMA.

[31]  Mukesh K. Mohania,et al.  Efficient techniques for document sanitization , 2008, CIKM '08.

[32]  Sumithra Velupillai,et al.  Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial , 2009, Int. J. Medical Informatics.

[33]  Lynette Hirschman,et al.  Effects of personal identifier resynthesis on clinical text de-identification , 2010, J. Am. Medical Informatics Assoc..

[34]  J. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004, American journal of clinical pathology.

[35]  Bradley Malin,et al.  Evaluating re-identification risks with respect to the HIPAA privacy rule , 2010, J. Am. Medical Informatics Assoc..

[36]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[37]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[38]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.