CRFs based de-identification of medical records

De-identification is a shared task of the 2014 i2b2/UTHealth challenge. The purpose of this task is to remove protected health information (PHI) from medical records. In this paper, we propose a novel de-identifier, WI-deId, based on conditional random fields (CRFs). A preprocessing module, which tokenizes the medical records using regular expressions and an off-the-shelf tokenizer, is introduced, and three groups of features are extracted to train the de-identifier model. The experiment shows that our system is effective in the de-identification of medical records, achieving a micro-F1 of 0.9232 at the i2b2 strict entity evaluation level.

[1]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[2]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[3]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[4]  Yi Guan,et al.  Transfer learning based clinical concept extraction on data from multiple sources , 2014, J. Biomed. Informatics.

[5]  Anthony N. Nguyen,et al.  De-identification of health records using Anonym: Effectiveness and robustness across datasets , 2014, Artif. Intell. Medicine.

[6]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[7]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[10]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[11]  Henrik Boström,et al.  De-identifying health records by means of active learning , 2012, ICML 2012.

[12]  Özlem Uzuner,et al.  Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus , 2015, J. Biomed. Informatics.