Transfer learning enables the molecular transformer to predict regio- and stereoselective reactions on carbohydrates

Organic synthesis methodology enables the synthesis of complex molecules and materials used in all fields of science and technology and represents a vast body of accumulated knowledge optimally suited for deep learning. While most organic reactions involve distinct functional groups and can readily be learned by deep learning models and chemists alike, regio- and stereoselective transformations are more challenging because their outcome also depends on functional group surroundings. Here, we challenge the Molecular Transformer model to predict reactions on carbohydrates where regio- and stereoselectivity are notoriously difficult to predict. We show that transfer learning of the general patent reaction model with a small set of carbohydrate reactions produces a specialized model returning predictions for carbohydrate reactions with remarkable accuracy. We validate these predictions experimentally with the synthesis of a lipid-linked oligosaccharide involving regioselective protections and stereoselective glycosylations. The transfer learning approach should be applicable to any reaction class of interest.

[1]  Matt J. Kusner,et al.  A Generative Model For Electron Paths , 2018, ICLR.

[2]  Ryan-Rhys Griffiths,et al.  Dataset Bias in the Natural Sciences: A Case Study in Chemical Reaction Prediction and Synthesis Design , 2018, ArXiv.

[3]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[4]  Pavlo O. Dral,et al.  Quantum Chemistry in the Age of Machine Learning. , 2020, The journal of physical chemistry letters.

[5]  Arzucan Özgür,et al.  Exploring Chemical Space using Natural Language Processing Methodologies for Drug Discovery , 2020, Drug discovery today.

[6]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[7]  Regina Barzilay,et al.  Predicting Organic Reaction Outcomes with Weisfeiler-Lehman Network , 2017, NIPS.

[8]  P. Seeberger,et al.  2009 Claude S. Hudson Award in Carbohydrate Chemistry. Carbohydrates: a frontier in medicinal chemistry. , 2009, Journal of medicinal chemistry.

[9]  Quoc V. Le,et al.  Unsupervised Pretraining for Sequence to Sequence Learning , 2016, EMNLP.

[10]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[11]  Christopher A. Hunter,et al.  Molecular Transformer: A Model for Uncertainty-Calibrated Chemical Reaction Prediction , 2018, ACS central science.

[12]  Connor W. Coley,et al.  A graph-convolutional neural network model for the prediction of chemical reactivity , 2018, Chemical science.

[13]  Daniel M. Lowe Extraction of chemical structures and reactions from the literature , 2012 .

[14]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[15]  M. Kamat,et al.  Revisiting the armed-disarmed concept rationale: s-benzoxazolyl glycosides in chemoselective oligosaccharide synthesis. , 2005, Organic letters.

[16]  Suvarn S. Kulkarni,et al.  Total Syntheses of Conjugation-ready Trisaccharide Repeating Units of Pseudomonas aeruginosa O11 and Staphylococcus aureus Type 5 Capsular Polysaccharide for Vaccine Development. , 2019, Journal of the American Chemical Society.

[17]  Deniz Yuret,et al.  Transfer Learning for Low-Resource Neural Machine Translation , 2016, EMNLP.

[18]  Sebastian Ruder,et al.  Neural transfer learning for natural language processing , 2019 .

[19]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[20]  E. Corey,et al.  The Logic of Chemical Synthesis: Multistep Synthesis of Complex Carbogenic Molecules (Nobel Lecture)† , 1991 .

[21]  J. Reymond,et al.  Structure and mechanism of the ER-based glucosyltransferase ALG6 , 2020, Nature.

[22]  David J. Smith,et al.  Artisanal fish fences pose broad and unexpected threats to the tropical coastal seascape , 2019, Nature Communications.

[23]  Philippe Schwaller,et al.  Data-Driven Learning Systems for Chemical Reaction Prediction: An Analysis of Recent Approaches , 2019, ACS Symposium Series.

[24]  Teodoro Laino,et al.  Data-driven Chemical Reaction Prediction and Retrosynthesis. , 2019, Chimia.

[25]  Jonathan D Hirst,et al.  Machine learning in virtual screening. , 2009, Combinatorial chemistry & high throughput screening.

[26]  G. Hart,et al.  Carbohydrates in chemistry and biology , 2000 .

[27]  D. Crich,et al.  Synthesis and Stereocontrolled Equatorially Selective Glycosylation Reactions of a Pseudaminic Acid Donor: Importance of the Side-Chain Conformation and Regioselective Reduction of Azide Protecting Groups. , 2018, Journal of the American Chemical Society.

[28]  J. Reymond,et al.  Characterization of the single-subunit oligosaccharyltransferase STT3A from Trypanosoma brucei using synthetic peptides and lipid-linked oligosaccharide analogs , 2017, Glycobiology.

[29]  Constantine Bekas,et al.  “Found in Translation”: predicting outcomes of complex organic chemistry reactions using neural sequence-to-sequence models† †Electronic supplementary information (ESI) available: Time-split test set and example predictions, together with attention weights, confidence and token probabilities. See DO , 2017, Chemical science.

[30]  Riccardo Petraglia,et al.  Predicting retrosynthetic pathways using a combined linguistic model and hyper-graph exploration strategy , 2019 .

[31]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[32]  Russ B Altman,et al.  Machine learning in chemoinformatics and drug discovery. , 2018, Drug discovery today.

[33]  Anthony Wood,et al.  Organic synthesis provides opportunities to transform drug discovery , 2018, Nature Chemistry.

[34]  M. Kunitski,et al.  Double-slit photoelectron interference in strong-field ionization of the neon dimer , 2018, Nature Communications.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Svetha Venkatesh,et al.  Graph Transformation Policy Network for Chemical Reaction Prediction , 2018, KDD.

[37]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[38]  J. Reymond,et al.  Synthesis of Lipid-Linked Oligosaccharides (LLOs) and Their Phosphonate Analogues as Probes To Study Protein Glycosylation Enzymes , 2018, Synthesis.

[39]  Kipton Barros,et al.  Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning , 2019, Nature Communications.

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  R. Kondor,et al.  Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. , 2009, Physical review letters.

[42]  Riccardo Petraglia,et al.  Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy† , 2020, Chemical science.

[43]  Chung-Yi Wu,et al.  Synthetic carbohydrate-based vaccines: challenges and opportunities , 2020, Journal of Biomedical Science.

[44]  Thomas Wolf,et al.  Transfer Learning in Natural Language Processing , 2019, NAACL.

[45]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[46]  L. Mulard,et al.  Classical and novel strategies to develop a Shigella glycoconjugate vaccine: from concept to efficacy in human , 2019, Human vaccines & immunotherapeutics.

[47]  Juno Nam,et al.  Linking the Neural Machine Translation and the Prediction of Organic Chemistry Reactions , 2016, ArXiv.

[48]  P. Seeberger,et al.  Identification and Design of Synthetic B Cell Epitopes for Carbohydrate-Based Vaccines. , 2017, Methods in enzymology.