Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature

The number of published materials science articles has increased manyfold over the past few decades. Now, a major bottleneck in the materials discovery pipeline arises in connecting new results with the previously established literature. A potential solution to this problem is to map the unstructured raw-text of published articles onto structured database entries that allows for programmatic querying. To this end, we apply text-mining with named entity recognition (NER) for large-scale information extraction from the published materials science literature. The NER model is trained to extract summary-level information from materials science documents, including: inorganic material mentions, sample descriptors, phase labels, material properties and applications, as well as any synthesis and characterization methods used. Our classifier achieves an accuracy (f1) of 87%, and is applied to information extraction from 3.27 million materials science abstracts. We extract more than 80 million materials-science-related named entities, and the content of each abstract is represented as a database entry in a structured format. We demonstrate that simple database queries can be used to answer complex ``meta-questions" of the published literature that would have previously required laborious, manual literature searches to answer. All of our data and functionality has been made freely available (https://github.com/materialsintelligence/matscholar), and we expect these results to accelerate the pace of future materials science discovery.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Gebräuchliche Fertigarzneimittel,et al.  V , 1893, Therapielexikon Neurologie.

[3]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[4]  Rabe,et al.  Ab initio relativistic pseudopotential study of the zero-temperature structural properties of SnTe and PbTe. , 1985, Physical review. B, Condensed matter.

[5]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[6]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  David Yarowsky,et al.  Techniques in Speech Acoustics , 1999, Computational Linguistics.

[9]  Matthias Scheffler,et al.  Composition, structure, and stability of RuO2(110) as a function of oxygen pressure , 2001 .

[10]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  R. Ramesh,et al.  Epitaxial BiFeO3 Multiferroic Thin Film Heterostructures , 2003, Science.

[13]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[14]  Rodney X. Sturdivant,et al.  Applied Logistic Regression: Hosmer/Applied Logistic Regression , 2005 .

[15]  Yang Jin,et al.  Human Gene Name Normalization using Text Matching with Automatically Extracted Synonym Dictionaries , 2006, BioNLP@NAACL-HLT.

[16]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[17]  Yanchao Wang,et al.  Enhanced thermoelectric performance of PbTe within the orthorhombic Pnma phase , 2007 .

[18]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[19]  O Bodenreider,et al.  Biomedical ontologies in action: role in knowledge management, data integration and decision support. , 2008, Yearbook of medical informatics.

[20]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[21]  D. Arnold,et al.  Ferroelectric-Paraelectric Transition inBiFeO3: Crystal Structure of the OrthorhombicβPhase , 2008, 0811.1501.

[22]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[23]  G. J. Snyder,et al.  High thermoelectric figure of merit in heavy hole dominated PbTe , 2011 .

[24]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[25]  W. Marsden I and J , 2012 .

[26]  Anubhav Jain,et al.  Python Materials Genomics (pymatgen): A robust, open-source python library for materials analysis , 2012 .

[27]  Charles H. Ward Integrating Materials and Manufacturing Innovation: a new forum for the exchange of information to integrate materials, manufacturing, and design engineering innovations , 2012, Integrating Materials and Manufacturing Innovation.

[28]  Bradley Voytek,et al.  Automated cognome construction and semi-automated hypothesis generation , 2012, Journal of Neuroscience Methods.

[29]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[30]  Paloma Martínez,et al.  SemEval-2013 Task 9 : Extraction of Drug-Drug Interactions from Biomedical Texts (DDIExtraction 2013) , 2013, *SEMEVAL.

[31]  Alok Choudhary,et al.  Combinatorial screening for new materials in unconstrained composition space with machine learning , 2014 .

[32]  Dinh Phung,et al.  Journal of Machine Learning Research: Preface , 2014 .

[33]  Yanhui Zhao,et al.  Tetragonal-tetragonal-monoclinic-rhombohedral transition: Strain relaxation of heavily compressed BiFeO3 epitaxial thin films , 2014 .

[34]  Naomie Salim,et al.  Chemical named entities recognition: a review on approaches and applications , 2014, Journal of Cheminformatics.

[35]  L. Weston,et al.  Hybrid functional calculations of point defects and hydrogen inSrZrO3 , 2014 .

[36]  이화영 X , 1960, Chinese Plants Names Index 2000-2009.

[37]  Xiaoming Zhang,et al.  A survey on knowledge representation in materials science and engineering: An ontological perspective , 2015, Comput. Ind..

[38]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[39]  Zhiyong Lu,et al.  tmChem: a high performance approach for chemical named entity recognition and normalization , 2015, Journal of Cheminformatics.

[40]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[41]  Guillaume Lample,et al.  Neural Architectures for Named Entity Recognition , 2016, NAACL.

[42]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[43]  Jacqueline M. Cole,et al.  ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature , 2016, J. Chem. Inf. Model..

[44]  A. Valencia,et al.  Information Retrieval and Text Mining Technologies for Chemistry. , 2017, Chemical reviews.

[45]  Gorjan Alagic,et al.  #p , 2019, Quantum information & computation.

[46]  Stefanie Jegelka,et al.  Virtual screening of inorganic materials synthesis parameters with deep learning , 2017, npj Computational Materials.

[47]  D. P. Acharjya,et al.  An Information Retrieval and Recommendation System for Astronomical Observatories , 2017, 1710.05350.

[48]  Emma Strubell,et al.  Machine-learned and codified synthesis parameters of oxide materials , 2017, Scientific Data.

[49]  A. McCallum,et al.  Materials Synthesis Insights from Scientific Literature via Text Extraction and Machine Learning , 2017 .

[50]  Andrew McCallum,et al.  Automatically Extracting Action Graphs from Materials Science Synthesis Procedures , 2017, ArXiv.

[51]  Sapan Shah,et al.  A Relation Aware Search Engine for Materials Science , 2018, Integrating Materials and Manufacturing Innovation.

[52]  Callum J Court,et al.  Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction , 2018, Scientific Data.

[53]  Natalio Mingo,et al.  Materials Screening for the Discovery of New Half-Heuslers: Machine Learning versus ab Initio Methods. , 2017, The journal of physical chemistry. B.

[54]  H. Haubeck COMP , 2019, Springer Reference Medizin.

[55]  Olga Kononova,et al.  Unsupervised word embeddings capture latent knowledge from materials science literature , 2019, Nature.

[56]  M. Wuttig,et al.  Epitaxial BiFeO 3 Multiferroic Thin Film Heterostructures , 2019 .

[57]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[58]  Chem. , 2020, Catalysis from A to Z.

[59]  P. Alam ‘U’ , 2021, Composites Engineering: An A–Z Guide.

[60]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.