Cross-lingual semantic annotation of biomedical literature: experiments in Spanish and English

MOTIVATION Biomedical literature is one of the most relevant sources of information for knowledge mining in the field of Bioinformatics. In spite of English being the most widely addressed language in the field, in recent years there has been a growing interest from the natural language processing community in dealing with languages other than English. However, the availability of language resources and tools for appropriate treatment of non-English texts is lacking behind. Our research is concerned with the semantic annotation of biomedical texts in the Spanish language, which can be considered an under-resourced language where biomedical text processing is concerned. RESULTS We have carried out experiments to assess the effectiveness of several methods for the automatic annotation of biomedical texts in Spanish. One approach is based on the linguistic analysis of Spanish texts and their annotation using an information retrieval and concept disambiguation approach. A second method takes advantage of a Spanish-English machine translation process to annotate English documents and transfer annotations back to Spanish. A third method takes advantage of the combination of both procedures. Our evaluation shows that a combined system has competitive advantages over the two individual procedures. AVAILABILITY UMLSmapper (https://snlt.vicomtech.org/umlsmapper) and the annotation transfer tool (http://scientmin.taln.upf.edu/anntransfer) are freely available for research purposes as web services and/or demos. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Dietrich Rebholz-Schuhmann,et al.  Entity Recognition in Parallel Multi-lingual Biomedical Corpora: The CLEF-ER Laboratory Overview , 2013, CLEF.

[2]  Renata C. Geer,et al.  The NCBI BioSystems database , 2009, Nucleic Acids Res..

[3]  Hongfang Liu,et al.  CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines , 2017, J. Am. Medical Informatics Assoc..

[4]  Koldo Gojenola,et al.  Automatic Annotation of Medical Records in Spanish with Disease, Drug and Substance Names , 2013, CIARP.

[5]  Sam Griffiths-Jones,et al.  The microRNA Registry , 2004, Nucleic Acids Res..

[6]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[7]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[8]  Mariana L. Neves,et al.  The Scielo Corpus: a Parallel Corpus of Scientific Publications for Biomedicine , 2016, LREC.

[9]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[10]  Horacio Saggion,et al.  Improving the accessibility of biomedical texts by semantic enrichment and definition expansion , 2018, Proces. del Leng. Natural.

[11]  José Carlos Cortizo,et al.  Building a Spanish MMTx by Using Automatic Translation and Biomedical Ontologies , 2008, IDEAL.

[12]  Montse Cuadros,et al.  Biomedical term normalization of EHRs with UMLS , 2018, LREC.

[13]  José Luís Oliveira,et al.  BeCAS: biomedical concept recognition services and visualization , 2013, Bioinform..

[14]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[15]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[16]  Dietrich Rebholz-Schuhmann,et al.  Evaluation and Cross-Comparison of Lexical Entities of Biological Interest (LexEBI) , 2013, PloS one.

[17]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[18]  German Rigau,et al.  IXA pipeline: Efficient and Ready to Use Multilingual NLP tools , 2014, LREC.

[19]  Montserrat Marimon,et al.  Finding Mentions of Abbreviations and Their Definitions in Spanish Clinical Cases: The BARR2 Shared Task Evaluation Results , 2018, IberEval@SEPLN.

[20]  Dietrich Rebholz-Schuhmann,et al.  A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC , 2015, J. Am. Medical Informatics Assoc..

[21]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[22]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[23]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[24]  Paloma Martínez,et al.  Automatic identification of biomedical concepts in spanish-language unstructured clinical texts , 2010, IHI.

[25]  Aitor García Pablos,et al.  Vicomtech at BARR2: Detecting Biomedical Abbreviations with ML Methods and Dictionary-based Heuristics , 2018, IberEval@SEPLN.

[26]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27]  Rafael Berlanga Llavori,et al.  Semantic annotation of biomedical texts through concept retrieval , 2010, Proces. del Leng. Natural.

[28]  Christoph Steinbeck,et al.  ChEBI in 2016: Improved services and an expanding collection of metabolites , 2015, Nucleic Acids Res..

[29]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[30]  Taher H. Haveliwala Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[31]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[32]  Peter Szolovits,et al.  Multilingual Named-Entity Recognition from Parallel Corpora , 2013, CLEF.

[33]  Mark Ware,et al.  The STM report: An overview of scientific and scholarly journal publishing fourth edition , 2015 .

[34]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[35]  Girish Chavan,et al.  NOBLE – Flexible concept recognition for large-scale biomedical natural language processing , 2016, BMC Bioinformatics.

[36]  Giuseppe Attardi,et al.  Machine Translation for Entity Recognition across Languages in Biomedical Documents , 2013, CLEF.

[37]  Ulf Leser,et al.  Cross-lingual Candidate Search for Biomedical Concept Normalization , 2018, ArXiv.

[38]  A T McCray,et al.  The Representation of Meaning in the UMLS , 1995, Methods of Information in Medicine.