Automatic de-identification of medical records with a multilevel hybrid semi-supervised learning approach

In recent years, sharing electronic medical records (EMRs) for more researchers outside the associated institutions is significant. For privacy preservation of the corresponding patients and the associated institutions, a de-identification task on the EMRs to be shared is a must. Although the deidentification task has been considered with positive research outcomes worldwide, especially those from the i2b2 (Informatics for Integrating Biology and the Bedside) shared tasks in 2006 and 2014, the task has not yet been a solved problem and still needs more investigation realistically. In this paper, we propose an automatic de-identification solution in a multilevel hybrid semi-supervised learning paradigm with a key focus on correctly identifying protected health information (PHI) in the EMRs. Similar to the existing works, our work defines a hybrid approach by combining a machine learning-based method with a conditional random fields model and a rule-based method in a post-processing phase to handle the PHI types with disambiguity. Nevertheless, our work is more general and practical. First, it considers the structure complexity of each EMR so that each section can be treated properly for more correct PHI identification up to its structure complexity: structured, semi-structured, or un-structured. Second, each EMR is then examined in our approach at three different levels of granularity such as a token level in the supervised learning phase, an entity level in the rule-based post-processing phase, and a section level along with the structure complexity in the semi-supervised learning phase. Many various detail levels will give our approach a deeper look at each EMR for more effectiveness. Third, our solution is conducted in a self-training manner so that it can get started with a small annotated data set in practice and get more effective with new EMRs over time. Evaluated with the i2b2 data set in comparison with the related works, our solution is effective with better F-measure values for the AGE, LOCATION, and PHONE PHI types and comparable for the other PHI types.

[1]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[2]  Jonathan M. Garibaldi,et al.  Automatic detection of protected health information from clinic narratives , 2015, J. Biomed. Informatics.

[3]  J. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004, American journal of clinical pathology.

[4]  Xiaolong Wang,et al.  Automatic de-identification of electronic medical records using token-level and character-level conditional random fields , 2015, J. Biomed. Informatics.

[5]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[6]  Goran Nenadic,et al.  Combining knowledge- and data-driven methods for de-identification of clinical narratives , 2015, J. Biomed. Informatics.

[7]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[8]  Özlem Uzuner,et al.  Automated systems for the de-identification of longitudinal clinical narratives: Overview of 2014 i2b2/UTHealth shared task Track 1 , 2015, J. Biomed. Informatics.

[9]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[10]  Anthony N. Nguyen,et al.  De-identification of health records using Anonym: Effectiveness and robustness across datasets , 2014, Artif. Intell. Medicine.

[11]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[12]  Vo Thi Ngoc Chau,et al.  A Hybrid Semi-supervised Learning Approach to Identifying Protected Health Information in Electronic Medical Records , 2016, IMCOM.

[13]  Shuying Shen,et al.  BoB, a best-of-breed automated text de-identification system for VHA clinical documents , 2013, J. Am. Medical Informatics Assoc..

[14]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[15]  Li Xiong,et al.  HIDE: An Integrated System for Health Information DE-identification , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[16]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..