Transformer Performance for Chemical Reactions: Analysis of Different Predictive and Evaluation Scenarios

The prediction of chemical reaction pathways has been accelerated by the development of novel machine learning architectures based on the deep learning paradigm. In this context, deep neural networks initially designed for language translation have been used to accurately predict a wide range of chemical reactions. Among models suited for the task of language translation, the recently introduced molecular transformer reached impressive performance in terms of forward-synthesis and retrosynthesis predictions. In this study, we first present an analysis of the performance of transformer models for product, reactant, and reagent prediction tasks under different scenarios of data availability and data augmentation. We find that the impact of data augmentation depends on the prediction task and on the metric used to evaluate the model performance. Second, we probe the contribution of different combinations of input formats, tokenization schemes, and embedding strategies to model performance. We find that less stable input settings generally lead to better performance. Lastly, we validate the superiority of round-trip accuracy over simpler evaluation metrics, such as top-k accuracy, using a committee of human experts and show a strong agreement for predictions that pass the round-trip test. This demonstrates the usefulness of more elaborate metrics in complex predictive scenarios and highlights the limitations of direct comparisons to a predefined database, which may include a limited number of chemical reaction pathways.

[1]  Guido Falk von Rudorff,et al.  SELFIES and the future of molecular string representations , 2022, Patterns.

[2]  M. Álvarez-Moreno,et al.  New Tools for Taming Complex Reaction Networks: The Unimolecular Decomposition of Indole Revisited , 2022, ACS physical chemistry Au.

[3]  C. Schmeck,et al.  Small molecules and their impact in drug discovery: a perspective on the occasion of the 125th anniversary of the Bayer Chemical Research Laboratory. , 2022, Drug discovery today.

[4]  C. Kwoh,et al.  Graph representation learning in bioinformatics: trends, methods and applications , 2021, Briefings Bioinform..

[5]  A. Shono,et al.  Reaction Engineering with Recurrent Neural Network: Kinetic Study of Dushman Reaction , 2021, Chemical Engineering Journal Advances.

[6]  Jiaxin Zheng,et al.  Extracting Predictive Representations from Hundreds of Millions of Molecules. , 2021, The journal of physical chemistry letters.

[7]  Shuan Chen,et al.  Deep Retrosynthetic Reaction Prediction using Local Reactivity and Global Attention , 2021, JACS Au.

[8]  P. Schwaller,et al.  Extraction of organic chemistry grammar from unsupervised learning of chemical reactions , 2021, Science Advances.

[9]  O. A. von Lilienfeld,et al.  Ab Initio Machine Learning in Chemical Compound Space , 2020, Chemical reviews.

[10]  Richard Bonneau,et al.  Masked graph modeling for molecule generation , 2020, Nature Communications.

[11]  Jean-Louis Reymond,et al.  Mapping the space of chemical reactions using attention-based neural networks , 2020, Nature Machine Intelligence.

[12]  Alain C. Vaucher,et al.  Prediction of chemical reaction yields using deep learning , 2020, Mach. Learn. Sci. Technol..

[13]  D. Fourches,et al.  SMILES Pair Encoding: A Data-Driven Substructure Tokenization Algorithm for Deep Learning , 2020, J. Chem. Inf. Model..

[14]  D. Horvath,et al.  Discovery of novel chemical reactions by deep generative recurrent neural network , 2020, Scientific Reports.

[15]  Florian Schmidt,et al.  How does BERT capture semantics? A closer look at polysemous words , 2020, BLACKBOXNLP.

[16]  Benoit Da Mota,et al.  EvoMol: a flexible and interpretable evolutionary algorithm for unbiased de novo molecular generation , 2020, Journal of Cheminformatics.

[17]  Ling Wang,et al.  Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level , 2020, Molecules.

[18]  H. Nakai,et al.  Quantum Chemical Reaction Prediction Method Based on Machine Learning , 2020 .

[19]  I. Tetko,et al.  State-of-the-art augmented NLP transformer models for direct and single-step retrosynthesis , 2020, Nature Communications.

[20]  Brian C. Barnes,et al.  Data Augmentation and Pretraining for Template-Based Retrosynthetic Prediction in Computer-Aided Synthesis Planning , 2020, J. Chem. Inf. Model..

[21]  J. Reymond,et al.  Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc04944d , 2019, Chemical science.

[22]  Ling Wang,et al.  Retrosynthesis with Attention-Based NMT Model and Chemical Analysis of the "Wrong" Predictions , 2019, ArXiv.

[23]  Yuedong Yang,et al.  Predicting Retrosynthetic Reaction using Self-Corrected Transformer Neural Networks , 2019, ArXiv.

[24]  Pascal Friederich,et al.  Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation , 2019, Mach. Learn. Sci. Technol..

[25]  G. Schneider,et al.  Concepts of Artificial Intelligence for Computer-Assisted Drug Discovery. , 2019, Chemical reviews.

[26]  William H. Green,et al.  Using Machine Learning To Predict Suitable Conditions for Organic Reactions , 2018, ACS central science.

[27]  Constantine Bekas,et al.  Molecular Transformer for Chemical Reaction Prediction and Uncertainty Estimation , 2018, ArXiv.

[28]  Alán Aspuru-Guzik,et al.  Inverse molecular design using machine learning: Generative models for matter engineering , 2018, Science.

[29]  K. Butler,et al.  Machine learning for molecular and materials science , 2018, Nature.

[30]  Derek T. Ahneman,et al.  Predicting reaction performance in C–N cross-coupling using machine learning , 2018, Science.

[31]  Constantine Bekas,et al.  “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models† †Electronic supplementary information (ESI) available: Time-split test set and example predictions, together with attention weights, confidence and token probabilities. See DO , 2017, Chemical science.

[32]  Joseph Gomes,et al.  MoleculeNet: a benchmark for molecular machine learning† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc02664a , 2017, Chemical science.

[33]  Mike Preuss,et al.  Learning to Plan Chemical Syntheses , 2017, ArXiv.

[34]  Bowen Liu,et al.  Retrosynthetic Reaction Prediction Using Neural Sequence-to-Sequence Models , 2017, ACS central science.

[35]  Marwin H. S. Segler,et al.  Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. , 2017, Chemistry.

[36]  D. Scott,et al.  Small molecules, big targets: drug discovery faces the protein–protein interaction challenge , 2016, Nature Reviews Drug Discovery.

[37]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[38]  Alexandre Varnek,et al.  Estimation of the size of drug-like chemical space based on GDB-17 data , 2013, Journal of Computer-Aided Molecular Design.

[39]  Stephen R. Heller,et al.  InChI - the worldwide chemical structure identifier standard , 2013, Journal of Cheminformatics.

[40]  Noel M. O'Boyle Towards a Universal SMILES representation - A standard method to generate canonical SMILES based on the InChI , 2012, Journal of Cheminformatics.

[41]  Peter Ertl,et al.  Cheminformatics Analysis of Organic Substituents: Identification of the Most Common Substituents, Calculation of Substituent Properties, and Automatic Identification of Drug-like Bioisosteric Groups , 2003, J. Chem. Inf. Comput. Sci..

[42]  W. Guida,et al.  The art and practice of structure‐based drug design: A molecular modeling perspective , 1996, Medicinal research reviews.

[43]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..