Detecting Chemical Reactions in Patents

Extracting chemical reactions from patents is a crucial task for chemists working on chemical exploration. In this paper we introduce the novel task of detecting the textual spans that describe or refer to chemical reactions within patents. We formulate this task as a paragraph-level sequence tagging problem, where the system is required to return a sequence of paragraphs that contain a description of a reaction. To address this new task, we construct an annotated dataset from an existing proprietary database of chemical reactions manually extracted from patents. We introduce several baseline methods for the task and evaluate them over our dataset. Through error analysis, we discuss what makes the task complex and challenging, and suggest possible directions for future research.

[1]  Daniel M. Lowe Extraction of chemical structures and reactions from the literature , 2012 .

[2]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[3]  Luca Toldo,et al.  Challenges in mining the literature for chemical information , 2013 .

[4]  Simone Teufel,et al.  Annotation of Chemical Named Entities , 2007, BioNLP@ACL.

[5]  Timo Böhme,et al.  OCMiner: Text Processing, Annotation and Relation Extraction for the Life Sciences , 2014, SWAT4LS.

[6]  Wlodek Zadrozny,et al.  Patent retrieval: a literature review , 2017, Knowledge and Information Systems.

[7]  Yifan Peng,et al.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task , 2016, Database J. Biol. Databases Curation.

[8]  John Tait,et al.  CLEF-IP 2009: Retrieval Experiments in the Intellectual Property Domain , 2009, CLEF.

[9]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[10]  Karin M. Verspoor,et al.  Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings , 2019, BioNLP@ACL.

[11]  Yuen-Hsien Tseng,et al.  Text mining techniques for patent analysis , 2007, Inf. Process. Manag..

[12]  Christopher Southan,et al.  Expanding opportunities for mining bioactive chemistry from patents , 2015, Drug discovery today. Technologies.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[15]  Markus Bundschus,et al.  Text mining patents for biomedical knowledge. , 2016, Drug discovery today.

[16]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[17]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[18]  George Papadatos,et al.  Managing expectations: assessment of chemistry databases generated by automated extraction of chemical structures from patents , 2015, Journal of Cheminformatics.

[19]  Daniel M. Lowe,et al.  Annotated Chemical Patent Corpus: A Gold Standard for Text Mining , 2014, PloS one.

[20]  Anália Lourenço,et al.  Overview of the BioCreative VI chemical-protein interaction Track , 2017 .

[21]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[22]  Jan A. Kors,et al.  Automatic identification of relevant chemical compounds from patents , 2019, Database J. Biol. Databases Curation.

[23]  Jonathan Berant,et al.  Text Segmentation as a Supervised Learning Task , 2018, NAACL.

[24]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[25]  Noriko Kando,et al.  Introduction to the special issue on patent processing , 2007, Inf. Process. Manag..

[26]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[27]  Meade Bernard,et al.  Spartan HPC-Cloud Hybrid: Delivering Performance and Flexibility , 2017 .

[28]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[29]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.