Pre-trained language models to extract information from radiological reports

This paper describes the participation of the SINAI team in the SpRadIE challenge: Information Extraction from Spanish radiology reports which consists of identifying biomedical entities related to the radiological domain. There have been many tasks focused on extracting relevant information from clinical texts, however, no previous task has been centered on radiology using Spanish as the main language. Detecting relevant information automatically in biomedical texts is a crucial task because current health information systems are not prepared to analyze and extract this knowledge due to the time and cost involved in processing it manually. To accomplish this task, we propose two approaches based on pretrained models using the BERT architecture. Specifically, we use a multi-class classification model, a binary classification model and a pipeline model for entity identification. The results are encouraging since we improved the average of the participants by obtaining a 73.7% F1-score using the binary system.

[1]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[2]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[3]  Loes M. M. Braun,et al.  Natural Language Processing in Radiology: A Systematic Review. , 2016, Radiology.

[4]  Maite Martin,et al.  Using Machine Learning and Deep Learning Methods to Find Mentions of Adverse Drug Reactions in Social Media , 2019, Proceedings of the Fourth Social Media Mining for Health Applications (#SMM4H) Workshop & Shared Task.

[5]  Anna Rumshisky,et al.  Evaluating temporal relations in clinical text: 2012 i2b2 Challenge , 2013, J. Am. Medical Informatics Assoc..

[6]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[7]  Guido Zuccon,et al.  Overview of the CLEF eHealth Evaluation Lab 2015 , 2015, CLEF.

[8]  Lluís Padró,et al.  FreeLing 3.0: Towards Wider Multilinguality , 2012, LREC.

[9]  Aitor Gonzalez-Agirre,et al.  Automatic De-identification of Medical Texts in Spanish: the MEDDOCAN Track, Corpus, Guidelines, Methods and Evaluation of Results , 2019, IberLEF@SEPLN.

[10]  Montserrat Marimon,et al.  PharmaCoNER: Pharmacological Substances, Compounds and proteins Named Entity Recognition track , 2019, EMNLP.

[11]  Hans Uszkoreit,et al.  Annotation of Entities and Relations in Spanish Radiology Reports , 2017, RANLP.

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  J. Gohagan,et al.  Prostate cancer screening in the randomized Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial: mortality results after 13 years of follow-up. , 2012, Journal of the National Cancer Institute.

[14]  Luis Alfonso Ureña López,et al.  Extracting Neoplasms Morphology Mentions in Spanish Clinical Cases through Word Embeddings , 2020, IberLEF@SEPLN.

[15]  Antonio Pertusa,et al.  PadChest: A large chest x-ray image dataset with multi-label annotated reports , 2019, Medical Image Anal..

[16]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[17]  Carol Friedman,et al.  Natural Language and Text Processing in Biomedicine , 2006 .

[18]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[19]  Viviana Cotik,et al.  Overview of CLEF eHealth Task 1 - SpRadIE: A challenge on information extraction from Spanish Radiology Reports , 2021, CLEF.

[20]  Ronald M. Summers,et al.  ChestX-ray: Hospital-Scale Chest X-ray Database and Benchmarks on Weakly Supervised Classification and Localization of Common Thorax Diseases , 2019, Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics.

[21]  Rafael Muñoz,et al.  Overview of the eHealth Knowledge Discovery Challenge at IberLEF 2019 , 2021, IberLEF@SEPLN.