SIFR annotator: ontology-based semantic annotation of French biomedical text and clinical notes

BackgroundDespite a wide adoption of English in science, a significant amount of biomedical data are produced in other languages, such as French. Yet a majority of natural language processing or semantic tools as well as domain terminologies or ontologies are only available in English, and cannot be readily applied to other languages, due to fundamental linguistic differences. However, semantic resources are required to design semantic indexes and transform biomedical (text)data into knowledge for better information mining and retrieval.ResultsWe present the SIFR Annotator (http://bioportal.lirmm.fr/annotator), a publicly accessible ontology-based annotation web service to process biomedical text data in French. The service, developed during the Semantic Indexing of French Biomedical Data Resources (2013–2019) project is included in the SIFR BioPortal, an open platform to host French biomedical ontologies and terminologies based on the technology developed by the US National Center for Biomedical Ontology. The portal facilitates use and fostering of ontologies by offering a set of services –search, mappings, metadata, versioning, visualization, recommendation– including for annotation purposes. We introduce the adaptations and improvements made in applying the technology to French as well as a number of language independent additional features –implemented by means of a proxy architecture– in particular annotation scoring and clinical context detection. We evaluate the performance of the SIFR Annotator on different biomedical data, using available French corpora –Quaero (titles from French MEDLINE abstracts and EMEA drug labels) and CépiDC (ICD-10 coding of death certificates)– and discuss our results with respect to the CLEF eHealth information extraction tasks.ConclusionsWe show the web service performs comparably to other knowledge-based annotation approaches in recognizing entities in biomedical text and reach state-of-the-art levels in clinical context detection (negation, experiencer, temporality). Additionally, the SIFR Annotator is the first openly web accessible tool to annotate and contextualize French biomedical text with ontology concepts leveraging a dictionary currently made of 28 terminologies and ontologies and 333 K concepts. The code is openly available, and we also provide a Docker packaging for easy local deployment to process sensitive (e.g., clinical) data in-house (https://github.com/sifrproject).

[1]  Julien Velcin,et al.  ECSTRA-INSERM @ CLEF eHealth2016-task 2: ICD10 Code Extraction from Death Certificates , 2016, CLEF.

[2]  Siegfried Handschuh,et al.  Semantic annotation for knowledge management: Requirements and a survey of the state of the art , 2006, J. Web Semant..

[3]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[4]  Yves Grandvalet,et al.  Sparse conditional logistic regression for analyzing large-scale matched data from epidemiological studies: a simple algorithm , 2015, BMC Bioinformatics.

[5]  Clement Jonquet,et al.  Enrichment of French Biomedical Ontologies with UMLS Concepts and Semantic Types for Biomedical Named Entity Recognition Though Ontological Semantic Annotation , 2017 .

[6]  Vincent Emonet,et al.  Enhanced functionalities for annotating and indexing clinical text with the NCBO Annotator+ , 2018, Bioinform..

[7]  Dietrich Rebholz-Schuhmann,et al.  Text processing through Web services: calling Whatizit , 2008, Bioinform..

[8]  Guido Zuccon,et al.  Overview of the CLEF eHealth Evaluation Lab 2015 , 2015, CLEF.

[9]  Vincent Emonet,et al.  ICD10 Coding of Death Certificates with the NCBO and SIFR Annotator(s) at CLEF eHealth 2017 Task 1 , 2017, CLEF.

[10]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[11]  Lina Fatima Soualmia,et al.  SIBM at CLEF eHealth Evaluation Lab 2017: Multilingual Information Extraction with CIM-IND , 2017, CLEF.

[12]  Pierre Zweigenbaum,et al.  LIMSI ICD10 coding Experiments on CépiDC Death Certificate Statements , 2016, CLEF.

[13]  Rong Chen,et al.  Finding Disease-Related Genomic Experiments Within an International Repository: First Steps in Translational Bioinformatics , 2006, AMIA.

[14]  Mark Stevenson,et al.  Determining the difficulty of Word Sense Disambiguation , 2014, J. Biomed. Informatics.

[15]  Jelena Jovanovic,et al.  Semantic annotation in biomedicine: the current landscape , 2017, Journal of Biomedical Semantics.

[16]  Stéfan Jacques Darmoni,et al.  Doc'CISMeF: A Search Tool Based on "Encapsulated" MeSH Thesaurus , 2001, MedInfo.

[17]  Dmitri Nesteruk,et al.  The Functional Perspective , 2020, Design Patterns in .NET Core 3.

[18]  Stéfan Jacques Darmoni,et al.  Evaluation of a French Medical Multi-Terminology Indexer for the Manual Annotation of Natural Language Medical Reports of Healthcare-Associated Infections , 2010, MedInfo.

[19]  Pascal Staccini,et al.  InterSTIS: Interopérabilité sémantique de terminologies de santé francophones , 2011 .

[20]  J. Blake Bio-ontologies—fast and furious , 2004, Nature Biotechnology.

[21]  Paul N. Schofield,et al.  Aber-OWL: a framework for ontology-based data access in biology , 2014, BMC Bioinformatics.

[22]  Erik M. van Mulligen,et al.  Biomedical Concept Recognition in French Text Using Automatic Translation of English Terms , 2015, CLEF.

[23]  Mark A. Musen,et al.  AgroPortal: A vocabulary and ontology repository for agronomy , 2018, Comput. Electron. Agric..

[24]  Daniel L. Rubin,et al.  Comparison of concept recognizers for building the Open Biomedical Annotator , 2009, BMC Bioinformatics.

[25]  Rong Chen,et al.  Ontology-driven indexing of public datasets for translational bioinformatics , 2009, BMC Bioinformatics.

[26]  Anthony W. Isenor,et al.  Semantic mediation of vocabularies for ocean observing systems , 2012, Comput. Geosci..

[27]  Kei-Hoi Cheung,et al.  Advancing translational research with the Semantic Web , 2007, BMC Bioinformatics.

[28]  Pierre Zweigenbaum,et al.  Clinical Natural Language Processing in languages other than English: opportunities and challenges , 2018, Journal of Biomedical Semantics.

[29]  Amina Annane,et al.  Multilingual Mapping Reconciliation between English-French Biomedical Ontologies , 2016, WIMS.

[30]  Ramanathan V. Guha,et al.  Semantic search , 2003, WWW '03.

[31]  Alexa T. McCray,et al.  An Upper-Level Ontology for the Biomedical Domain , 2003, Comparative and functional genomics.

[32]  Kalina Bontcheva,et al.  Getting More Out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics , 2013, PLoS Comput. Biol..

[33]  Julien Grosjean,et al.  Multiterminology cross-lingual model to create the European Health Terminology/Ontology Portal , 2011 .

[34]  Rolf Apweiler,et al.  The Ontology Lookup Service, a lightweight cross-platform tool for controlled vocabulary queries , 2006, BMC Bioinformatics.

[35]  Amina Annane,et al.  Réconciliation d'alignements multi-lingues dans BioPortal , 2016, IC.

[36]  Indra Neil Sarkar,et al.  Leveraging biomedical ontologies and annotation services to organize microbiome data from Mammalian hosts. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[37]  Cyril Grouin,et al.  Overview of the CLEF eHealth Evaluation Lab 2015 , 2015, CLEF.

[38]  Tin Wee Tan,et al.  APBioNet—Transforming Bioinformatics in the Asia-Pacific Region , 2013, PLoS Comput. Biol..

[39]  Julien Grosjean,et al.  Health multi-terminology portal: a semantic added-value for patient safety. , 2011, Studies in health technology and informatics.

[40]  Stéfan Jacques Darmoni,et al.  Language Resources for French in the Biomedical Domain , 2014, LREC.

[41]  K. Bretonnel Cohen,et al.  CLEF eHealth 2017 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certificates in English and French , 2017, CLEF.

[42]  Daniel L. Rubin,et al.  Biomedical ontologies: a functional perspective , 2007, Briefings Bioinform..

[43]  Vincent Emonet,et al.  AgroPortal: an ontology repository for agronomy , 2017 .

[44]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[45]  Christopher G. Chute,et al.  BioPortal: ontologies and integrated data resources at the click of a mouse , 2009, Nucleic Acids Res..

[46]  Lina Fatima Soualmia,et al.  SIBM at CLEF e-Health Evaluation Lab 2015 , 2015, CLEF.

[47]  Ulf Leser,et al.  Multi-lingual ICD-10 Coding using a Hybrid rule-based and Supervised Classification Approach at CLEF eHealth 2017 , 2017, CLEF.

[48]  Clement Jonquet,et al.  SIFR BioPortal : Un portail ouvert et générique d’ontologies et de terminologies biomédicales françaises au service de l’annotation sémantique , 2016 .

[49]  M. Joubert Interopérabilité sémantique de terminologies de santé francophones , 2011 .

[50]  Ghislain Auguste Atemezing NoNLP: Annotating Medical Domain by using Semantic Technologies , 2017, CLEF.

[51]  Jennifer R. Smith,et al.  Using the NCBO Web Services for Concept Recognition and Ontology Annotation of Expression Datasets , 2009, SWAT4LS.

[52]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[53]  Nigel Collier,et al.  Using silver and semi-gold standard corpora to compare open named entity recognisers , 2013, 2013 IEEE International Conference on Bioinformatics and Biomedicine.

[54]  et al.,et al.  NCBO Technology: Powering semantically aware applications , 2013, Journal of Biomedical Semantics.

[55]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[56]  Wendy W. Chapman,et al.  ConText: An algorithm for determining negation, experiencer, and temporal status from clinical reports , 2009, J. Biomed. Informatics.

[57]  Cécile Fabre,et al.  LITL at CLEF eHealth2017: Automatic Classification of Death Reports , 2017, CLEF.

[58]  Robert H. Baud,et al.  VUMeF: Extending the French Involvement in the UMLS metathesaurus , 2003, AMIA.

[59]  Guido Zuccon,et al.  CLEF 2017 eHealth Evaluation Lab Overview , 2017, CLEF.

[60]  Bin Zhao,et al.  Ontobee: A linked ontology data server to support ontology term dereferencing, linkage, query and integration , 2016, Nucleic Acids Res..

[61]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[62]  Mike Conway,et al.  Extending the NegEx Lexicon for Multiple Languages , 2013, MedInfo.

[63]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[64]  N. Lorenzi,et al.  Translational research: understanding the continuum from bench to bedside. , 2011, Translational research : the journal of laboratory and clinical medicine.

[65]  Erik M. van Mulligen,et al.  Erasmus MC at CLEF eHealth 2016: Concept Recognition and Coding in French Texts , 2016, CLEF.

[66]  Lina Fatima Soualmia,et al.  SIBM at CLEF eHealth Evaluation Lab 2016: Extracting Concepts in French Medical Texts with ECMT and CIMIND , 2016, CLEF.

[67]  Girish Chavan,et al.  NOBLE – Flexible concept recognition for large-scale biomedical natural language processing , 2016, BMC Bioinformatics.

[68]  Pierre Zweigenbaum,et al.  The Quaero French Medical Corpus : A Ressource for Medical Entity Recognition and Normalization , 2014 .

[69]  Robert H. Baud,et al.  Towards a Unified Medical Lexicon for French , 2003, MIE.

[70]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[71]  K. Bretonnel Cohen,et al.  Large-scale biomedical concept recognition: an evaluation of current automatic annotators and their parameters , 2014, BMC Bioinformatics.

[72]  Clement Jonquet,et al.  Scoring Semantic Annotations Returned by The NCBO Annotator , 2014, SWAT4LS.

[73]  Mark A. Musen,et al.  Roadmap for a Multilingual BioPortal , 2015, MSW@ESWC.

[74]  Tao Jiang,et al.  OligoSpawn: a software tool for the design of overgo probes from large unigene datasets , 2006, BMC Bioinformatics.