An evaluation of feature sets and sampling techniques for de-identification of medical records

De-identification of text medical records is of critical importance in any health informatics system in order to facilitate research and sharing of medical records. While statistical learning based techniques have shown promising results for de-identification purposes, few such systems are publicly available. It remains a challenge for practitioners to build an accurate and efficient system as it involves a significant amount of feature engineering, i.e. creation and examination of new features used in the system. A comprehensive evaluation is needed to thoroughly understand the effects of different feature sets and potential impacts of sampling and their trade-offs between the often conflicting goals of precision (or positive predictive value), recall (or sensitivity), and efficiency. In this paper, we present the Health Information DE-identification (HIDE) framework and evaluate the open- source software. We present an evaluation of various types of features used in HIDE, and introduce a window sampling technique (only the terms within a specified distance from personal health information are used to train the classifier) and evaluate its effect on both quality and efficiency. Our results show that the context features (previous and next terms) are particularly important and the sampling technique can be used to increase recall with minimal impact on precision. We obtained token-level label precision of 0.967, recall of 0.986 and F-Score of 0.977 when not including true negatives. The overall HIDE system achieves token-level precision of .998, recall of .999, and f-score of .999 on the previous i2b2 challenge task.

[1]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[2]  Li Xiong,et al.  HIDE: An Integrated System for Health Information DE-identification , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[3]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[4]  Özlem Uzuner,et al.  Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text , 2006, NAACL.

[5]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[6]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[7]  John Langford,et al.  Cost-sensitive learning by cost-proportionate example weighting , 2003, Third IEEE International Conference on Data Mining.

[8]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[9]  Robert H. Baud,et al.  Medical document anonymization with a semantic lexicon , 2000, AMIA.

[10]  Li Xiong,et al.  An integrated framework for de-identifying unstructured medical data , 2009, Data Knowl. Eng..

[11]  James J. Lu,et al.  Privacy-Preserving Information Discovery on EHRs , 2009 .

[12]  J. Berman Concept-match medical data scrubbing. How pathology text can be used in research. , 2003, Archives of pathology & laboratory medicine.

[13]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[14]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[15]  James J. Lu,et al.  HIDE: heterogeneous information DE-identification , 2009, EDBT '09.

[16]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .