Data-driven Chemical Reaction Classification with Attention-Based Neural Networks

Organic reactions are usually clustered in classes that collect entities undergoing similar structural rearrangement. The classification process is a tedious task, requiring first an accurate mapping of the rearrangement (atom mapping) followed by the identification of the corresponding reaction class template. In this work, we present two transformer-based models that infer reaction classes from the SMILES representation of chemical reactions. The first model, a sequence-2sequence model, reaches an accuracy of 93.8 % for a multi-class classification task involving several hundred different classes. Initial results show that the second model, a BERT classifier, is also able to achieve a high accuracy (95.3%) on this classification task. The attention weights provided by the sequence-2-sequence model gives an insight into what parts of the SMILES strings are taken into account for classification, based solely on data. We study the incorrect predictions of this model and show that it uncovers different biases and mistakes in the underlying data set.

[1]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[2]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[3]  J. S. Carey,et al.  Analysis of the reactions used for the preparation of drug candidate molecules. , 2006, Organic & biomolecular chemistry.

[4]  B. Grzybowski,et al.  The 'wired' universe of organic chemistry. , 2009, Nature chemistry.

[5]  William H. Green,et al.  Computer-Assisted Retrosynthesis Based on Molecular Similarity , 2017, ACS central science.

[6]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[7]  Christopher A. Hunter,et al.  Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction , 2018, ACS central science.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[10]  Mike Preuss,et al.  Planning chemical syntheses with deep neural networks and symbolic AI , 2017, Nature.

[11]  Gregory A Landrum,et al.  What's What: The (Nearly) Definitive Guide to Reaction Role Assignment , 2016, J. Chem. Inf. Model..

[12]  Constantine Bekas,et al.  “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models† †Electronic supplementary information (ESI) available: Time-split test set and example predictions, together with attention weights, confidence and token probabilities. See DO , 2017, Chemical science.

[13]  Daniel M. Lowe,et al.  Big Data from Pharmaceutical Patents: A Computational Analysis of Medicinal Chemists' Bread and Butter. , 2016, Journal of medicinal chemistry.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Guenter Grethe,et al.  Algorithm for Reaction Classification , 2013, J. Chem. Inf. Model..

[16]  Daniel M. Lowe,et al.  Development of a Novel Fingerprint for Chemical Reactions and Its Application to Large-Scale Reaction Classification and Similarity , 2015, J. Chem. Inf. Model..

[17]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .