Automatic de-identification of electronic medical records using token-level and character-level conditional random fields

De-identification, identifying and removing all protected health information (PHI) present in clinical data including electronic medical records (EMRs), is a critical step in making clinical data publicly available. The 2014 i2b2 (Center of Informatics for Integrating Biology and Bedside) clinical natural language processing (NLP) challenge sets up a track for de-identification (track 1). In this study, we propose a hybrid system based on both machine learning and rule approaches for the de-identification track. In our system, PHI instances are first identified by two (token-level and character-level) conditional random fields (CRFs) and a rule-based classifier, and then are merged by some rules. Experiments conducted on the i2b2 corpus show that our system submitted for the challenge achieves the highest micro F-scores of 94.64%, 91.24% and 91.63% under the "token", "strict" and "relaxed" criteria respectively, which is among top-ranked systems of the 2014 i2b2 challenge. After integrating some refined localization dictionaries, our system is further improved with F-scores of 94.83%, 91.57% and 91.95% under the "token", "strict" and "relaxed" criteria respectively.

[1]  Hua Xu,et al.  Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features , 2013, BMC Medical Informatics and Decision Making.

[2]  Ozlem Uzuner,et al.  Second i2b2 workshop on natural language processing challenges for clinical records. , 2008, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[3]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[4]  John R. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004 .

[5]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[6]  Clement J. McDonald,et al.  Application of Information Technology: A Software Tool for Removing Patient Identifying Information from Clinical Documents , 2008, J. Am. Medical Informatics Assoc..

[7]  Özlem Uzuner,et al.  Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus , 2015, J. Biomed. Informatics.

[8]  S. Meystre,et al.  Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents , 2012, BMC Medical Research Methodology.

[9]  Clement J. McDonald,et al.  A successful technique for removing names in pathology reports using an augmented search and replace method , 2002, AMIA.

[10]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[11]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[12]  Lynn A. Karoly,et al.  Health Insurance Portability and Accountability Act of 1996 (HIPAA) Administrative Simplification , 2010, Practice Management Consultant.

[13]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[14]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[15]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[16]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[17]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[18]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[19]  Hua Xu,et al.  A hybrid system for temporal information extraction from clinical text , 2013, J. Am. Medical Informatics Assoc..

[20]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[21]  Robert H. Baud,et al.  Medical document anonymization with a semantic lexicon , 2000, AMIA.

[22]  Keith Marsolo,et al.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction , 2013, J. Am. Medical Informatics Assoc..

[23]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Xiaolong Wang,et al.  Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks , 2014, BioMed research international.