Using natural language processing techniques to extract information on the properties and functionalities of energetic materials from large text corpora

The number of scientific journal articles and reports being published about energetic materials every year is growing exponentially, and therefore extracting relevant information and actionable insights from the latest research is becoming a considerable challenge. In this work we explore how techniques from natural language processing and machine learning can be used to automatically extract chemical insights from large collections of documents. We first describe how to download and process documents from a variety of sources - journal articles, conference proceedings (including NTREM), the US Patent & Trademark Office, and the Defense Technical Information Center archive on this http URL. We present a custom NLP pipeline which uses open source NLP tools to identify the names of chemical compounds and relates them to function words ("underwater", "rocket", "pyrotechnic") and property words ("elastomer", "non-toxic"). After explaining how word embeddings work we compare the utility of two popular word embeddings - word2vec and GloVe. Chemical-chemical and chemical-application relationships are obtained by doing computations with word vectors. We show that word embeddings capture latent information about energetic materials, so that related materials appear close together in the word embedding space.

[1]  John S. Delaney,et al.  ESOL: Estimating Aqueous Solubility Directly from Molecular Structure , 2004, J. Chem. Inf. Model..

[2]  Emma Strubell,et al.  Machine-learned and codified synthesis parameters of oxide materials , 2017, Scientific Data.

[3]  Makoto Miwa,et al.  Extracting Drug-Drug Interactions with Attention CNNs , 2017, BioNLP.

[4]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[5]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[6]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[7]  L. H. Shu,et al.  Retrieving Causally Related Functions From Natural-Language Text for Biomimetic Design , 2014 .

[8]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[9]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[10]  Callum J Court,et al.  Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction , 2018, Scientific Data.

[11]  R. Socher,et al.  CS 224D: Deep Learning for NLP , 2015 .

[12]  Omer Levy,et al.  Improving Distributional Similarity with Lessons Learned from Word Embeddings , 2015, TACL.

[13]  Zhiyong Lu,et al.  Annotating chemicals , diseases and their interactions in biomedical literature , 2015 .

[14]  Lutz Bornmann,et al.  Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references , 2014, J. Assoc. Inf. Sci. Technol..

[15]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[16]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17]  Alexandros Nanopoulos,et al.  Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data , 2010, J. Mach. Learn. Res..

[18]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[19]  Ilia Korvigo,et al.  Putting hands to rest: efficient deep CNN-RNN architecture for chemical named entity recognition with no hand-crafted rules , 2018, bioRxiv.

[20]  Shasha Li,et al.  Drug-Drug Interaction Extraction via Recurrent Neural Network with Multiple Attention Layers , 2017, ADMA.

[21]  Jacqueline M. Cole,et al.  ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature , 2016, J. Chem. Inf. Model..

[22]  Xiao Sun,et al.  Multichannel Convolutional Neural Network for Biological Relation Extraction , 2016, BioMed research international.

[23]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[24]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[25]  Heng Ji,et al.  Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion , 2015, BioNLP@IJCNLP.

[26]  Yifan Peng,et al.  Deep learning for extracting protein-protein interactions from biomedical literature , 2017, BioNLP.

[27]  William D. Mattson,et al.  Machine Learning of Energetic Material Properties , 2018, 1807.06156.

[28]  Amir Bakarov,et al.  A Survey of Word Embeddings Evaluation Methods , 2018, ArXiv.

[29]  Daniel C Elton,et al.  Applying machine learning techniques to predict the properties of energetic materials , 2018, Scientific Reports.

[30]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[31]  Wei Li,et al.  Automated Extraction of Function Knowledge From Text , 2017 .

[32]  Stefanie Jegelka,et al.  Virtual screening of inorganic materials synthesis parameters with deep learning , 2017, npj Computational Materials.