MatScIE: An automated tool for the generation of databases of methods and parameters used in the computational materials science literature

The number of published articles in the field of materials science is growing rapidly every year. This comparatively unstructured data source, which contains a large amount of information, has a restriction on its re-usability, as the information needed to carry out further calculations using the data in it must be extracted manually. It is very important to obtain valid and contextually correct information from the online (offline) data, as it can be useful not only to generate inputs for further calculations, but also to incorporate them into a querying framework. Retaining this context as a priority, we have developed an automated tool, MatScIE (Material Scince Information Extractor) that can extract relevant information from material science literature and make a structured database that is much easier to use for material simulations. Specifically, we extract the material details, methods, code, parameters, and structure from the various research articles. Finally, we created a web application where users can upload published articles and view/download the information obtained from this tool and can create their own databases for their personal uses.

[1]  Saleh Alsaleem,et al.  Automated Arabic Text Categorization Using SVM and NB , 2011, Int. Arab. J. e Technol..

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Myriam Abramson,et al.  Sequence Classification with Neural Conditional Random Fields , 2015, 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA).

[4]  Iz Beltagy,et al.  SciBERT: A Pretrained Language Model for Scientific Text , 2019, EMNLP.

[5]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[6]  Antony J. Williams,et al.  ChemSpider:: An Online Chemical Information Resource , 2010 .

[7]  Yanli Wang,et al.  PubChem: a public information system for analyzing bioactivities of small molecules , 2009, Nucleic Acids Res..

[8]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[9]  Jacob Eisenstein,et al.  Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling , 2019, EMNLP.

[10]  Damon Donald Ridley Information Retrieval: SciFinder , 2009 .

[11]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[12]  Anubhav Jain,et al.  Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature , 2019, J. Chem. Inf. Model..

[13]  Wei Xu,et al.  Bidirectional LSTM-CRF Models for Sequence Tagging , 2015, ArXiv.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Emma Strubell,et al.  Machine-learned and codified synthesis parameters of oxide materials , 2017, Scientific Data.

[16]  Olga Kononova,et al.  Unsupervised word embeddings capture latent knowledge from materials science literature , 2019, Nature.

[17]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[18]  Kyle Lo,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[19]  Shuo Xu,et al.  Bayesian Naïve Bayes classifiers to text classification , 2018, J. Inf. Sci..

[20]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[21]  Claire Cardie,et al.  Identifying Expressions of Opinion in Context , 2007, IJCAI.

[22]  Li-Rong Dai,et al.  Exploring Question Understanding and Adaptation in Neural-Network-Based Question Answering , 2017, ArXiv.

[23]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[24]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[25]  Renata C. Geer,et al.  The NCBI BioSystems database , 2009, Nucleic Acids Res..

[26]  Yoshua Bengio,et al.  Investigation of recurrent-neural-network architectures and learning methods for spoken language understanding , 2013, INTERSPEECH.

[27]  Mari Ostendorf,et al.  Multi-Task Identification of Entities, Relations, and Coreference for Scientific Knowledge Graph Construction , 2018, EMNLP.

[28]  Animesh Mukherjee,et al.  OCR++: A Robust Framework For Information Extraction from Scholarly Articles , 2016, COLING.

[29]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[30]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.