Intrinsic and Extrinsic Evaluation of the Quality of Biomedical Embeddings in Different Languages

Lately, language models have been applied to several tasks in biomedical natural language processing. Some public language models are available online, each built with different corpora. In this paper, we evaluate different public word embedding models trained with both general and biomedical corpora for English and Portuguese. We present intrinsic evaluations based on semantic analogies that use word pairs extracted from the MeSH biomedical thesaurus and also from benchmarks that are available for general-domain evaluation. For extrinsic evaluations we rely on a classification task over Eletronic Health Records. Our experiments show that biomedical embeddings can better capture semantics for biomedical analogies in both languages. On the other hand for extrinsic evaluation, based on classification tasks using the language models, larger general textual corpora appeared equally or more effective.

[1]  Henrique Dias Pereira dos Santos,et al.  Fall Detection in EHR using Word Embeddings and Deep Learning , 2019, 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE).

[2]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[3]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[4]  Hongfang Liu,et al.  MedSTS: a resource for clinical semantic textual similarity , 2018, Language Resources and Evaluation.

[5]  Renata Vieira,et al.  An Initial Investigation of the Charlson Comorbidity Index Regression Based on Clinical Notes , 2018, 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS).

[6]  Ying Liu,et al.  U-path: An undirected path-based measure of semantic similarity , 2014, AMIA.

[7]  Felix Gräßer,et al.  Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning , 2018, DH.

[8]  Sadid A. Hasan,et al.  Learning Portuguese Clinical Word Embeddings: A Multi-Specialty and Multi-Institutional Corpus of Clinical Narratives Supporting a Downstream Biomedical Task , 2019, MedInfo.

[9]  Gilles Louppe,et al.  Independent consultant , 2013 .

[10]  António Branco,et al.  LX-DSemVectors: Distributional Semantics Models for Portuguese , 2016, PROPOR.

[11]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[12]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[13]  Manaal Faruqui,et al.  Community Evaluation and Exchange of Word Vectors at wordvectors.org , 2014, ACL.

[14]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[15]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[16]  Arzucan Özgür,et al.  BIOSSES: a semantic sentence similarity estimation system for the biomedical domain , 2017, Bioinform..

[17]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[18]  Reed McEwan,et al.  Corpus domain effects on distributional semantic modeling of medical terms , 2016, Bioinform..

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Donald C. Comeau,et al.  LitSense: making sense of biomedical literature at sentence level , 2019, Nucleic Acids Res..

[21]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[22]  Sampo Pyysalo,et al.  Bio-SimVerb and Bio-SimLex: wide-coverage evaluation sets of word similarity in biomedicine , 2018, BMC Bioinformatics.

[23]  Qingyu Chen,et al.  BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[24]  Yifan Peng,et al.  BioSentVec: creating sentence embeddings for biomedical texts , 2018, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[25]  Stefan M. Rüger,et al.  Adverse Drug Reaction Classification With Deep Neural Networks , 2016, COLING.

[26]  Nathan Hartmann,et al.  Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks , 2017, STIL.

[27]  Sunil Kumar Sahu,et al.  Evaluating distributed word representations for capturing semantics of biomedical concepts , 2015, BioNLP@IJCNLP.

[28]  Felipe Soares,et al.  Medical Word Embeddings for Spanish: Development and Evaluation , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.