The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures

Materials science literature contains millions of materials synthesis procedures described in unstructured natural language text. Large-scale analysis of these synthesis procedures would facilitate deeper scientific understanding of materials synthesis and enable automated synthesis planning. Such analysis requires extracting structured representations of synthesis procedures from the raw text as a first step. To facilitate the training and evaluation of synthesis extraction models, we introduce a dataset of 230 synthesis procedures annotated by domain experts with labeled graphs that express the semantics of the synthesis sentences. The nodes in this graph are synthesis operations and their typed arguments, and labeled edges specify relations between the nodes. We describe this new resource in detail and highlight some specific challenges to annotating scientific text with shallow semantic structure. We make the corpus available to the community to promote further research and development of scientific information extraction systems.

[1]  Daniel Gildea,et al.  The Proposition Bank: An Annotated Corpus of Semantic Roles , 2005, CL.

[2]  Andrew McCallum,et al.  Inorganic Materials Synthesis Planning with Literature-Trained Neural Networks , 2018, J. Chem. Inf. Model..

[3]  Regina Barzilay,et al.  Prediction of Organic Reaction Outcomes Using Machine Learning , 2017, ACS central science.

[4]  Mark A. Przybocki,et al.  The Automatic Content Extraction (ACE) Program – Tasks, Data, and Evaluation , 2004, LREC.

[5]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[6]  Alexander J. Lawson,et al.  The Making of Reaxys—Towards Unobstructed Access to Relevant Chemistry Information , 2014 .

[7]  Kristin A. Persson,et al.  Commentary: The Materials Project: A materials genome approach to accelerating materials innovation , 2013 .

[8]  Yoko Yamakata,et al.  Flow Graph Corpus from Recipe Texts , 2014, LREC.

[9]  Ari Rappoport,et al.  The State of the Art in Semantic Representation , 2017, ACL.

[10]  Philipp Koehn,et al.  Abstract Meaning Representation for Sembanking , 2013, LAW@ACL.

[11]  Nathanael Chambers,et al.  CaTeRS: Causal and Temporal Relation Scheme for Semantic Annotation of Event Structures , 2016, EVENTS@HLT-NAACL.

[12]  Richard Johansson,et al.  The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies , 2008, CoNLL.

[13]  Henry S. Rzepa,et al.  Chemical Markup, XML, and the Worldwide Web. 1. Basic Principles , 1999, J. Chem. Inf. Comput. Sci..

[14]  Taylor D. Sparks,et al.  Performance and resource considerations of Li-ion battery electrode materials , 2015 .

[15]  Daniel Marcu,et al.  Extracting Biomolecular Interactions Using Semantic Parsing of Biomedical Text , 2015, AAAI.

[16]  Daniel W. Davies,et al.  Machine learning for molecular and materials science , 2018, Nature.

[17]  Andrew McCallum,et al.  Automatically Extracting Action Graphs from Materials Science Synthesis Procedures , 2017, ArXiv.

[18]  Raghu Machiraju,et al.  An Annotated Corpus for Machine Reading of Instructions in Wet Lab Protocols , 2018, NAACL.

[19]  Emma Strubell,et al.  Machine-learned and codified synthesis parameters of oxide materials , 2017, Scientific Data.

[20]  Callum Court,et al.  ChemDataExtractor: A toolkit for automated extraction of chemical information from the scientific literature , 2017 .

[21]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[22]  Collin F. Baker,et al.  A Frames Approach to Semantic Analysis , 2009 .

[23]  Ronen Tamari,et al.  Playing by the Book: An Interactive Game Approach for Action Graph Extraction from Text , 2018 .

[24]  Steven R. Young,et al.  Data Mining for better material synthesis: the case of pulsed laser deposition of complex oxides , 2017, 1710.07721.

[25]  Mike Preuss,et al.  Planning chemical syntheses with deep neural networks and symbolic AI , 2017, Nature.

[26]  Paul Raccuglia,et al.  Machine-learning-assisted materials discovery using failed experiments , 2016, Nature.

[27]  Hong-Xiu Yang,et al.  β-MnO2 nanowires: A novel ozonation catalyst for water treatment , 2009 .

[28]  A. McCallum,et al.  Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning , 2017 .

[29]  Yusuke Miyao,et al.  SemEval 2015 Task 18: Broad-Coverage Semantic Dependency Parsing , 2015, *SEMEVAL.