Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial

BACKGROUND Electronic patient records (EPRs) contain a large amount of information written in free text. This information is considered very valuable for research but is also very sensitive since the free text parts may contain information that could reveal the identity of a patient. Therefore, methods for de-identifying EPRs are needed. The work presented here aims to perform a manual and automatic Protected Health Information (PHI)-annotation trial for EPRs written in Swedish. METHODS This study consists of two main parts: the initial creation of a manually PHI-annotated gold standard, and the porting and evaluation of an existing de-identification software written for American English to Swedish in a preliminary automatic de-identification trial. Results are measured with precision, recall and F-measure. RESULTS This study reports fairly high Inter-Annotator Agreement (IAA) results on the manually created gold standard, especially for specific tags such as names. The average IAA over all tags was 0.65 F-measure (0.84 F-measure highest pairwise agreement). For name tags the average IAA was 0.80 F-measure (0.91 F-measure highest pairwise agreement). Porting a de-identification software written for American English to Swedish directly was unfortunately non-trivial, yielding poor results. CONCLUSION Developing gold standard sets as well as automatic systems for de-identification tasks in Swedish is feasible. However, discussions and definitions on identifiable information is needed, as well as further developments both on the tag sets and the annotation guidelines, in order to get a reliable gold standard. A completely new de-identification software needs to be developed.

[1]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[2]  Hagit Shatkay,et al.  New directions in biomedical text annotation: definitions, guidelines and corpus construction , 2006, BMC Bioinformatics.

[3]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[4]  Özlem Uzuner,et al.  Role of Local Context in Automatic Deidentification of Ungrammatical, Fragmented Text , 2006, NAACL.

[5]  Sumithra Velupillai,et al.  Diagnosing Diagnoses in Swedish Clinical Records , 2008 .

[6]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[7]  Fredrik Olsson,et al.  Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora , 2008 .

[8]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[9]  Inderjeet Mani,et al.  Protein Name Tagging Guidelines: Lessons Learned , 2005, Comparative and functional genomics.

[10]  Dimitrios Kokkinakis,et al.  Identification of Entity References in Hospital Discharge Letters , 2007, NODALIDA.

[11]  Hercules Dalianis,et al.  SweNam-A Swedish Named Entity recognizer Its construction, training and evaluation , 2001 .

[12]  Pierre Zweigenbaum,et al.  Testing Tactics to Localize De-Identification , 2009, MIE.

[13]  Róbert Busa-Fekete,et al.  State-of-the-art anonymization of medical records using an iterative machine learning framework. , 2007 .

[14]  Philip V. Ogren,et al.  Knowtator: A Protégé plug-in for annotated corpus construction , 2006, NAACL.

[15]  Beth Sundheim,et al.  MUC-5 Evaluation Metrics , 1993, MUC.

[16]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[17]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.