Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language.

[1]  Tomaz Erjavec,et al.  MULTEXT-East: morphosyntactic resources for Central and Eastern European languages , 2011, Language Resources and Evaluation.

[2]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[3]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[4]  Jaewoo Kang,et al.  BRONCO: Biomedical entity Relation ONcology COrpus for extracting gene-variant-disease-drug relations , 2016, Database J. Biol. Databases Curation.

[5]  Pierre Zweigenbaum,et al.  Clinical Natural Language Processing in languages other than English: opportunities and challenges , 2018, Journal of Biomedical Semantics.

[6]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[7]  Jari Björne,et al.  BioInfer: a corpus for information extraction in the biomedical domain , 2007, BMC Bioinformatics.

[8]  Karin M. Verspoor,et al.  Annotating the biomedical literature for the human variome , 2013, Database J. Biol. Databases Curation.

[9]  Renata Vieira,et al.  A Corpus-based Investigation of Definite Description Use , 1997, CL.

[10]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[11]  Paloma Moreda,et al.  DrugSemantics: A corpus for Named Entity Recognition in Spanish Summaries of Product Characteristics , 2017, J. Biomed. Informatics.