Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents

In this paper, we provide an overview of the Cheminformatics Elsevier Melbourne University (ChEMU) evaluation lab 2020, part of the Conference and Labs of the Evaluation Forum 2020 (CLEF2020). The ChEMU evaluation lab focuses on information extraction over chemical reactions from patent texts. Using the ChEMU corpus of 1500 “snippets” (text segments) sampled from 170 patent documents and annotated by chemical experts, we defined two key information extraction tasks. Task 1 addresses chemical named entity recognition, the identification of chemical compounds and their specific roles in chemical reactions. Task 2 focuses on event extraction, the identification of reaction steps, relating the chemical compounds involved in a chemical reaction. Herein, we describe the resources created for these tasks and the evaluation methodology adopted. We also provide a brief summary of the participants of this lab and the results obtained across 46 runs from 11 teams, finding that several submissions achieve substantially better results than our baseline methods.

[1]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[2]  Isidro Cortes-Ciriano,et al.  Prediction of the potency of mammalian cyclooxygenase inhibitors with ensemble proteochemometric modeling , 2015, Journal of Cheminformatics.

[3]  Mervyn Bregonje,et al.  Patents: A unique source for scientific technical information in chemistry related industry? , 2005 .

[4]  Thierry Kogej,et al.  Making every SAR point count: the development of Chemistry Connect for the large-scale integration of structure and bioactivity data. , 2011, Drug discovery today.

[5]  Timothy Baldwin,et al.  Detecting Chemical Reactions in Patents , 2019, ALTA.

[6]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[7]  Timothy Baldwin,et al.  ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents , 2020, ECIR.

[8]  Jan A. Kors,et al.  Automatic identification of relevant chemical compounds from patents , 2019, Database J. Biol. Databases Curation.

[9]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[10]  George Papadatos,et al.  Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents , 2015, Journal of Cheminformatics.

[11]  Daniel M. Lowe,et al.  Annotated Chemical Patent Corpus: A Gold Standard for Text Mining , 2014, PloS one.

[12]  Dat Quoc Nguyen,et al.  ChEMU dataset for information extraction from chemical patents , 2020 .

[13]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[14]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.