SC-CoMIcs: A Superconductivity Corpus for Materials Informatics

This paper describes a novel corpus tailored for the text mining of superconducting materials in Materials Informatics (MI), named SuperConductivety Corpus for Materials Informatics (SC-CoMIcs). Different from biomedical informatics, there exist very few corpora targeting Materials Science and Engineering (MSE). Especially, there is no sizable corpus which can be used to assist the search of superconducting materials. A team of materials scientists and natural language processing experts jointly designed the annotation and constructed a corpus consisting of manually-annotated 1,000 MSE abstracts related to superconductivity. We conducted experiments on the corpus with a neural Named Entity Recognition (NER) tool. The experimental results show that NER performance over the corpus is around 77% in terms of micro-F1, which is comparable to human annotator agreement rates. Using the trained NER model, we automatically annotated 9,000 abstracts and created a term retrieval tool based on the term similarity. This tool can find superconductivity terms relevant to a query term within a specified Named Entity category, which demonstrates the power of our SC-CoMIcs, efficiently providing knowledge for Materials Informatics applications from rapidly expanding publications.

[1]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[2]  Kyle Lo,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[3]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[4]  K. Müller,et al.  Possible highTc superconductivity in the Ba−La−Cu−O system , 1986 .

[5]  Suzuki Akira,et al.  Proposal for Automatic Extraction Framework of Superconductors Related Information from Scientific Literature , 2019 .

[6]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[7]  Gregory B Olson,et al.  Genomic materials design: The ferrous frontier , 2013 .

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[10]  Hwee Tou Ng,et al.  Towards Robust Linguistic Analysis using OntoNotes , 2013, CoNLL.

[11]  Ronen Tamari,et al.  Playing by the Book: An Interactive Game Approach for Action Graph Extraction from Text , 2018 .

[12]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[13]  A. Malozemoff,et al.  High-temperature cuprate superconductors get to work , 2005 .

[14]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[15]  Takuya Kadohira,et al.  Relation extraction with weakly supervised learning based on process-structure-property-performance reciprocity , 2018, Science and technology of advanced materials.

[16]  Anubhav Jain,et al.  Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature , 2019, J. Chem. Inf. Model..

[17]  Andrew McCallum,et al.  The Materials Science Procedural Text Corpus: Annotating Materials Synthesis Procedures with Shallow Semantic Structures , 2019, LAW@ACL.

[18]  Sampo Pyysalo,et al.  brat: a Web-based Tool for NLP-Assisted Text Annotation , 2012, EACL.

[19]  Q. Jia,et al.  Materials science challenges for high-temperature superconducting wire. , 2007, Nature materials.

[20]  Chiho Kim,et al.  Machine learning in materials informatics: recent applications and prospects , 2017, npj Computational Materials.