De-identifying Swedish clinical text - refinement of a gold standard and experiments with Conditional random fields

BackgroundIn order to perform research on the information contained in Electronic Patient Records (EPRs), access to the data itself is needed. This is often very difficult due to confidentiality regulations. The data sets need to be fully de-identified before they can be distributed to researchers. De-identification is a difficult task where the definitions of annotation classes are not self-evident.ResultsWe present work on the creation of two refined variants of a manually annotated Gold standard for de-identification, one created automatically, and one created through discussions among the annotators. The data is a subset from the Stockholm EPR Corpus, a data set available within our research group. These are used for the training and evaluation of an automatic system based on the Conditional Random Fields algorithm. Evaluating with four-fold cross-validation on sets of around 4-6 000 annotation instances, we obtained very promising results for both Gold Standards: F-score around 0.80 for a number of experiments, with higher results for certain annotation classes. Moreover, 49 false positives that were verified true positives were found by the system but missed by the annotators.ConclusionsOur intention is to make this Gold standard, The Stockholm EPR PHI Corpus, available to other research groups in the future. Despite being slightly more time-consuming we believe the manual consensus gold standard is the most valuable for further research. We also propose a set of annotation classes to be used for similar de-identification tasks.

[1]  Yefeng Wang,et al.  Annotating and Recognising Named Entities in Clinical Notes , 2009, ACL.

[2]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[3]  Stephen Pulman,et al.  Evaluating the State of the Art , 1995 .

[4]  Peter Szolovits,et al.  Evaluating the state-of-the-art in automatic de-identification. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[5]  J. Gilbertson,et al.  Evaluation of a deidentification (De-Id) software engine to share pathology reports and clinical documents for research. , 2004, American journal of clinical pathology.

[6]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[7]  UzunerÖzlem,et al.  A de-identifier for medical discharge summaries , 2008 .

[8]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[9]  Peter Szolovits,et al.  A de-identifier for medical discharge summaries , 2008, Artif. Intell. Medicine.

[10]  H. Dalianis,et al.  The Stockholm EPR Corpus – Characteristics and Some Initial Findings , 2009 .

[11]  Dingcheng Li,et al.  Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts , 2008, BioNLP.

[12]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[13]  Sumithra Velupillai,et al.  Developing a standard for de-identifying electronic patient records written in Swedish: Precision, recall and F-measure in a manual and computerized annotation trial , 2009, Int. J. Medical Informatics.

[14]  Karën Fort,et al.  Towards a Methodology for Named Entities Annotation , 2009, Linguistic Annotation Workshop.

[15]  Peter Szolovits,et al.  Automated de-identification of free-text medical records , 2008, BMC Medical Informatics Decis. Mak..

[16]  Alan Bundy,et al.  Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence - IJCAI-95 , 1995 .

[17]  Dimitrios Kokkinakis,et al.  Identification of Entity References in Hospital Discharge Letters , 2007, NODALIDA.

[18]  Pierre Zweigenbaum,et al.  Testing Tactics to Localize De-Identification , 2009, MIE.

[19]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[20]  L. Sweeney Replacing personally-identifying information in medical records, the Scrub system. , 1996, Proceedings : a conference of the American Medical Informatics Association. AMIA Fall Symposium.

[21]  David A. Cohn,et al.  Active Learning with Statistical Models , 1996, NIPS.

[22]  Georgy Kopanitsa,et al.  Studies in Health Technology and Informatics , 2012, MIE 2012.