BoB, a best-of-breed automated text de-identification system for VHA clinical documents

OBJECTIVE De-identification allows faster and more collaborative clinical research while protecting patient confidentiality. Clinical narrative de-identification is a tedious process that can be alleviated by automated natural language processing methods. The goal of this research is the development of an automated text de-identification system for Veterans Health Administration (VHA) clinical documents. MATERIALS AND METHODS We devised a novel stepwise hybrid approach designed to improve the current strategies used for text de-identification. The proposed system is based on a previous study on the best de-identification methods for VHA documents. This best-of-breed automated clinical text de-identification system (aka BoB) tackles the problem as two separate tasks: (1) maximize patient confidentiality by redacting as much protected health information (PHI) as possible; and (2) leave de-identified documents in a usable state preserving as much clinical information as possible. RESULTS We evaluated BoB with a manually annotated corpus of a variety of VHA clinical notes, as well as with the 2006 i2b2 de-identification challenge corpus. We present evaluations at the instance- and token-level, with detailed results for BoB's main components. Moreover, an existing text de-identification system was also included in our evaluation. DISCUSSION BoB's design efficiently takes advantage of the methods implemented in its pipeline, resulting in high sensitivity values (especially for sensitive PHI categories) and a limited number of false positives. CONCLUSIONS Our system successfully addressed VHA clinical document de-identification, and its hybrid stepwise design demonstrates robustness and efficiency, prioritizing patient confidentiality while leaving most clinical information intact.

[1]  Bandman El Protection of human subjects. , 1980, JAMA.

[2]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[3]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[6]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[7]  J. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004, American journal of clinical pathology.

[8]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[9]  Lynette Hirschman,et al.  Measuring Risk and Information Preservation: Toward New Metrics for De-identification of Clinical Texts , 2010, Louhi@NAACL-HLT.

[10]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[11]  S. Meystre,et al.  Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents , 2012, BMC Medical Research Methodology.

[12]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[13]  Li Xiong,et al.  An integrated framework for de-identifying unstructured medical data , 2009, Data Knowl. Eng..

[14]  Sumithra Velupillai,et al.  Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial , 2009, Int. J. Medical Informatics.

[15]  Clement J. McDonald,et al.  Application of Information Technology: A Software Tool for Removing Patient Identifying Information from Clinical Documents , 2008, J. Am. Medical Informatics Assoc..

[16]  Pam Dixon Medical Identity Theft: the Information Crime That Can Kill You , 2006 .

[17]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[18]  Lyle H. Ungar,et al.  A system for de-identifying medical message board text , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[19]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[20]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[21]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[22]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[23]  M. Hepple,et al.  Identifying Personal Health Information Using Support Vector Machines , 2006 .

[24]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  Lynette Hirschman,et al.  Effects of personal identifier resynthesis on clinical text de-identification , 2010, J. Am. Medical Informatics Assoc..

[27]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..