Fraunhofer AICOS at CLEF eHealth 2020 Task 1: Clinical Code Extraction From Textual Data Using Fine-Tuned BERT Models

Nosology is an important branch of Medical Science that concerns the classification and coding of diseases, conditions, procedures, and other medical information. This is a vital task for all stakeholders of the health sector, from hospitals and health regulators, to insurance companies and governments. The ICD10 system is the current revision of a Nosology system managed by the World Health Organization, being widely used internationally. Since medical coding is based on manual analysis of clinical textual data, it is ripe for automation, with Natural Language Processing (NLP) techniques used to address this challenge. This paper describes our contribution to the CLEF eHealth 2020 Task 1 Challenge, regarding Information Extraction of ICD10 codes on unstructured Spanish clinical text. We present two approaches for ICD10 code extraction based on Conditional Random Fields (CRFs) and the BERT Deep Learning Language Model. The BERT -based methodology achieved a mean average precision of 0.517 and 0.445 for ICD10-CM and ICD10-PCS codes, respectively, and a F1 score of 0.505 for the Explainable AI subtask. The results obtained show the flexibility and robustness of pre-trained Deep Learning models for NLP, only requiring fine-tuning for a particular task, leading to reduced requirements both for labelled data and computational effort.

[1]  Anthony N. Nguyen,et al.  Automatic ICD-10 classification of cancers from free-text death certificates , 2015, Int. J. Medical Informatics.

[2]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[3]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[4]  Katherine E Henson,et al.  Risk of Suicide After Cancer Diagnosis in England , 2018, JAMA psychiatry.

[5]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[6]  Liliana Ferreira,et al.  Information Extraction from Unstructured Recipe Data , 2019, Proceedings of the 2019 5th International Conference on Computer and Technology Applications.

[7]  Haim Levkowitz,et al.  Introduction to information retrieval (IR) , 2008 .

[8]  Ulf Leser,et al.  Classifying German Animal Experiment Summaries with Multi-lingual BERT at CLEF eHealth 2019 Task 1 , 2019, CLEF.

[9]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.

[10]  Pierre Zweigenbaum,et al.  CLEF eHealth 2018 Multilingual Information Extraction Task Overview: ICD10 Coding of Death Certificates in French, Hungarian and Italian , 2018, CLEF.

[11]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[12]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[13]  Aitor Gonzalez-Agirre,et al.  Overview of Automatic Clinical Coding: Annotations, Guidelines, and Solutions for non-English Clinical Cases at CodiEsp Track of CLEF eHealth 2020 , 2020, CLEF.

[14]  Bo Zhao,et al.  Deep learning in clinical natural language processing: a methodical review , 2019, J. Am. Medical Informatics Assoc..

[15]  Rodrigo Nogueira,et al.  Portuguese Named Entity Recognition using BERT-CRF , 2019, ArXiv.

[16]  Mai Omura,et al.  Incorporating Unsupervised Features into CRF based Named Entity Recognition , 2014, NTCIR.

[17]  Koby Crammer,et al.  Automatic Code Assignment to Medical Text , 2007, BioNLP@ACL.

[18]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[19]  Mariana L. Neves,et al.  Overview of the CLEF eHealth 2019 Multilingual Information Extraction , 2019, CLEF.

[20]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[21]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[22]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[23]  Gabriella Pasi,et al.  Overview of the CLEF eHealth Evaluation Lab 2020 , 2020, CLEF.

[24]  Slobodan Vucetic,et al.  Improving Medical Code Prediction from Clinical Text via Incorporating Online Knowledge Sources , 2019, WWW.

[25]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[26]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[27]  Günter Neumann,et al.  MLT-DFKI at CLEF eHealth 2019: Multi-label Classification of ICD-10 Codes with BERT , 2019, CLEF.

[28]  Richárd Farkas,et al.  Automatic construction of rule-based ICD-9-CM coding systems , 2008, BMC Bioinformatics.