Using word embeddings in abstracts to accelerate metallocene catalysis polymerization research

Abstract Natural language processing (NLP) and word embeddings trained neural networks were investigated as a more efficient method to extract useful information on catalytic polymerizations. Thousands of abstracts on metallocene-catalyzed polymerizations were accessed through journal Application Programming Interfaces. These abstracts were then used to create a group of related models to produce word embeddings, making use of the word2vec algorithm. This algorithm turns vocabulary into high dimensional vectors using unsupervised training. These vectors can then be used to show relationships between chemicals, suggest catalysts and activators combinations, understand acronyms, and categorize chemical compounds based on their reagent classification. We hypothesize that one can determine which areas of metallocene catalysis are understudied by comparing the predicted abstract and catalysts combinations with those found in existing abstracts, thereby guiding research to major breakthroughs as scientific literature continues to grow.

[1]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[2]  Anubhav Jain,et al.  Named Entity Recognition and Normalization Applied to Large-Scale Information Extraction from the Materials Science Literature , 2019, J. Chem. Inf. Model..

[3]  Takashi Kamachi,et al.  Machine Learning for Catalysis Informatics: Recent Applications and Prospects , 2020 .

[4]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[6]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[7]  Tingting Zhao,et al.  Extracting chemical–protein interactions from literature using sentence structure analysis and feature engineering , 2019, Database.

[8]  Chen Ling,et al.  Representing Multiword Chemical Terms through Phrase-Level Preprocessing and Word Embedding , 2019, ACS omega.

[9]  Kirk Roberts,et al.  Assessing the Corpus Size vs. Similarity Trade-off for Word Embeddings in Clinical NLP , 2016, ClinicalNLP@COLING 2016.

[10]  Mark Ware,et al.  The STM report: An overview of scientific and scholarly journal publishing fourth edition , 2015 .

[11]  Gadi Rothenberg,et al.  Data mining in catalysis: Separating knowledge from garbage , 2008 .

[12]  I. Jolliffe Principal Component Analysis , 2005 .

[13]  Gobinda G. Chowdhury,et al.  Natural language processing , 2005, Annu. Rev. Inf. Sci. Technol..

[14]  Juno Nam,et al.  Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions , 2016, ArXiv.

[15]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[16]  Olga Kononova,et al.  Unsupervised word embeddings capture latent knowledge from materials science literature , 2019, Nature.

[17]  Daniel W. Davies,et al.  Machine learning for molecular and materials science , 2018, Nature.

[18]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[19]  Naomi S. Altman,et al.  Points of Significance: Principal component analysis , 2017, Nature Methods.

[20]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[21]  John R. Kitchin,et al.  Machine learning in catalysis , 2018, Nature Catalysis.

[22]  Steven Skiena,et al.  The Expressive Power of Word Embeddings , 2013, ArXiv.

[23]  John Boyle,et al.  Chemlistem: chemical named entity recognition using recurrent neural networks , 2018, Journal of Cheminformatics.