Proposal and evaluation of FASDIM, a Fast And Simple De-Identification Method for unstructured free-text clinical records

PURPOSE Medical free-text records enable to get rich information about the patients, but often need to be de-identified by removing the Protected Health Information (PHI), each time the identification of the patient is not mandatory. Pattern matching techniques require pre-defined dictionaries, and machine learning techniques require an extensive training set. Methods exist in French, but either bring weak results or are not freely available. The objective is to define and evaluate FASDIM, a Fast And Simple De-Identification Method for French medical free-text records. METHODS FASDIM consists in removing all the words that are not present in the authorized word list, and in removing all the numbers except those that match a list of protection patterns. The corresponding lists are incremented in the course of the iterations of the method. For the evaluation, the workload is estimated in the course of records de-identification. The efficiency of the de-identification is assessed by independent medical experts on 508 discharge letters that are randomly selected and de-identified by FASDIM. Finally, the letters are encoded after and before de-identification according to 3 terminologies (ATC, ICD10, CCAM) and the codes are compared. RESULTS The construction of the list of authorized words is progressive: 12h for the first 7000 letters, 16 additional hours for 20,000 additional letters. The Recall (proportion of removed Protected Health Information, PHI) is 98.1%, the Precision (proportion of PHI within the removed token) is 79.6% and the F-measure (harmonic mean) is 87.9%. In average 30.6 terminology codes are encoded per letter, and 99.02% of those codes are preserved despite the de-identification. CONCLUSION FASDIM gets good results in French and is freely available. It is easy to implement and does not require any predefined dictionary.

[1]  Li Li,et al.  Viewpoint Paper: Repurposing the Clinical Record: Can an Existing Natural Language Processing System De-identify Clinical Notes? , 2009, J. Am. Medical Informatics Assoc..

[2]  Jules J. Berman Concept-Match Medical Data Scrubbing , 2009 .

[3]  Ricky K. Taira,et al.  Identification of patient name references within medical documents using semantic selectional restrictions , 2002, AMIA.

[4]  黄亚明 PhysioBank , 2009 .

[5]  Thomas Neubauer,et al.  A methodology for the pseudonymization of medical data , 2011, Int. J. Medical Informatics.

[6]  S. Meystre,et al.  Automatic de-identification of textual documents in the electronic health record: a review of recent research , 2010, BMC medical research methodology.

[7]  K. Ohe,et al.  Automatic Deidentification by using Sentence Features and Label Consistency , 2006 .

[8]  Li Xiong,et al.  HIDE: An Integrated System for Health Information DE-identification , 2008, 2008 21st IEEE International Symposium on Computer-Based Medical Systems.

[9]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[10]  Karen Tu,et al.  De-identification of primary care electronic medical records free-text data in Ontario, Canada , 2010, BMC Medical Informatics Decis. Mak..

[11]  F. Jones,et al.  International Classification of Diseases , 1978 .

[12]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[13]  John F. Hurdle,et al.  Assessing the Difficulty and Time Cost of De-identification in Clinical Narratives , 2006, Methods of Information in Medicine.

[14]  Pierre Zweigenbaum,et al.  Testing Tactics to Localize De-Identification , 2009, MIE.

[15]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[16]  Alexander A. Morgan,et al.  Research Paper: Rapidly Retargetable Approaches to De-identification in Medical Records , 2007, J. Am. Medical Informatics Assoc..

[17]  Margaret Douglass,et al.  Computer-Assisted De-Identification of Free-text Nursing Notes , 2005 .

[18]  Robert H. Baud,et al.  Medical document anonymization with a semantic lexicon , 2000, AMIA.

[19]  Robin Cooper,et al.  Evaluating the State of the Art , 1995 .

[20]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[21]  Tapio Salakoski,et al.  Applying language technology to nursing documents: Pros and cons with a focus on ethics , 2007, Int. J. Medical Informatics.

[22]  Sumithra Velupillai,et al.  Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial , 2009, Int. J. Medical Informatics.

[23]  Clement J. McDonald,et al.  Application of Information Technology: A Software Tool for Removing Patient Identifying Information from Clinical Documents , 2008, J. Am. Medical Informatics Assoc..

[24]  Lynette Hirschman,et al.  The MITRE Identification Scrubber Toolkit: Design, training, and assessment , 2010, Int. J. Medical Informatics.

[25]  G. Nahler anatomical therapeutic chemical classification system (ATC) , 2009 .

[26]  Clement J. McDonald,et al.  A successful technique for removing names in pathology reports using an augmented search and replace method , 2002, AMIA.

[27]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[28]  S. Meystre,et al.  Evaluating current automatic de-identification methods with Veteran’s health administration clinical documents , 2012, BMC Medical Research Methodology.

[29]  J. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004, American journal of clinical pathology.

[30]  J. Berman Concept-match medical data scrubbing. How pathology text can be used in research. , 2003, Archives of pathology & laboratory medicine.

[31]  Ulysses J. Balis,et al.  Development and evaluation of an open source software tool for deidentification of pathology reports , 2006, BMC Medical Informatics Decis. Mak..

[32]  Keith Marsolo,et al.  Large-scale evaluation of automated clinical note de-identification and its impact on information extraction , 2013, J. Am. Medical Informatics Assoc..

[33]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.