Augmenting Molecular Deep Generative Models with Topological Data Analysis Representations

Deep generative models have emerged as a powerful tool for learning informative molecular representations and designing novel molecules with desired properties, with applications in drug discovery and material design. Deep generative auto-encoders defined over molecular SMILES strings have been a popular choice for that purpose. However, capturing salient molecular properties like quantumchemical energies remains challenging and requires sophisticated neural net models of molecular graphs or geometry-based information. As a simpler and more efficient alternative, we present a SMILES Variational Auto-Encoder (VAE) augmented with topological data analysis (TDA) representations of molecules, known as persistence images. Our experiments show that this TDA augmentation enables a SMILES VAE to capture the complex relation between 3D geometry and electronic properties, and allows generation of novel, diverse, and valid molecules with geometric features consistent with the training data, which exhibit a varying range of global electronic structural properties, such as a small HOMO-LUMO gap – a critical property for designing organic solar cells. We demonstrate that our TDA augmentation yields better success in downstream tasks compared to models trained without these representations and can assist in targeted molecule discovery.

[1]  Nicola De Cao,et al.  MolGAN: An implicit generative model for small molecular graphs , 2018, ArXiv.

[2]  Jos'e Miguel Hern'andez-Lobato,et al.  Symmetry-Aware Actor-Critic for 3D Molecular Design , 2021, ICLR.

[3]  Jin Woo Kim,et al.  Molecular generative model based on conditional variational autoencoder for de novo molecular design , 2018, Journal of Cheminformatics.

[4]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[5]  Pablo G. Cámara,et al.  Topological methods for genomics: present and future directions. , 2017, Current opinion in systems biology.

[6]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[7]  Regina Barzilay,et al.  Junction Tree Variational Autoencoder for Molecular Graph Generation , 2018, ICML.

[8]  Theodore L. Willke,et al.  Persistent Homology for Virtual Screening , 2018 .

[9]  R. Ghrist Barcodes: The persistent topology of data , 2007 .

[10]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[11]  Alán Aspuru-Guzik,et al.  Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models , 2018, Frontiers in Pharmacology.

[12]  K. Ramamurthy,et al.  Characterizing the Latent Space of Molecular Deep Generative Models with Persistent Homology Metrics , 2020, ArXiv.

[13]  Sivaraman Balakrishnan,et al.  Confidence sets for persistence diagrams , 2013, The Annals of Statistics.

[14]  Afra Zomorodian,et al.  The Theory of Multidimensional Persistence , 2007, SCG '07.

[15]  Yibo Li,et al.  Multi-objective de novo drug design with conditional graph generative model , 2018, Journal of Cheminformatics.

[16]  Steven Skiena,et al.  Syntax-Directed Variational Autoencoder for Structured Data , 2018, ICLR.

[17]  Steve Oudot,et al.  Sliced Wasserstein Kernel for Persistence Diagrams , 2017, ICML.

[18]  Cassie Putman Micucci,et al.  Representation of molecular structures with persistent homology for machine learning applications in chemistry , 2020, Nature Communications.

[19]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[20]  Andrew J. Blumberg,et al.  Multiparameter Persistence Image for Topological Machine Learning , 2020, NeurIPS.

[21]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[22]  Herbert Edelsbrunner,et al.  Topological Persistence and Simplification , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[23]  Sayan Mukherjee,et al.  Fréchet Means for Distributions of Persistence Diagrams , 2012, Discrete & Computational Geometry.

[24]  Peter Bubenik,et al.  Statistical topological data analysis using persistence landscapes , 2012, J. Mach. Learn. Res..

[25]  Facundo Mémoli,et al.  Topological Methods for the Analysis of High Dimensional Data Sets and 3D Object Recognition , 2007, PBG@Eurographics.

[26]  Daniel C. Elton,et al.  Deep learning for molecular generation and optimization - a review of the state of the art , 2019, Molecular Systems Design & Engineering.

[27]  Niloy Ganguly,et al.  NeVAE: A Deep Generative Model for Molecular Graphs , 2018, AAAI.

[28]  Henry Adams,et al.  Persistence Images: A Stable Vector Representation of Persistent Homology , 2015, J. Mach. Learn. Res..

[29]  Kyunghyun Cho,et al.  Conditional molecular design with deep generative models , 2018, J. Chem. Inf. Model..

[30]  Christopher J. Tralie,et al.  Ripser.py: A Lean Persistent Homology Library for Python , 2018, J. Open Source Softw..

[31]  Djork-Arné Clevert,et al.  Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations , 2018, Chemical science.

[32]  Nikos Komodakis,et al.  GraphVAE: Towards Generation of Small Graphs Using Variational Autoencoders , 2018, ICANN.

[33]  Joseph H. Montoya,et al.  Machine learning with persistent homology and chemical word embeddings improves prediction accuracy and interpretability in metal-organic frameworks , 2020, Scientific Reports.

[34]  Bei Wang,et al.  A Kernel for Multi-Parameter Persistent Homology , 2018, Comput. Graph. X.

[35]  Oliver Vipond,et al.  Multiparameter Persistence Landscapes , 2018, J. Mach. Learn. Res..

[36]  Kar Wai Lim,et al.  Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models , 2020, NeurIPS 2020.

[37]  Gunnar E. Carlsson,et al.  Topology and data , 2009 .

[38]  David Cohen-Steiner,et al.  Stability of Persistence Diagrams , 2005, Discret. Comput. Geom..

[39]  S. Mukherjee,et al.  Probability measures on the space of persistence diagrams , 2011 .

[40]  Thomas Blaschke,et al.  Application of Generative Autoencoder in De Novo Molecular Design , 2017, Molecular informatics.

[41]  J. Reymond The chemical space project. , 2015, Accounts of chemical research.

[42]  Michael Lesnick,et al.  Interactive Visualization of 2-D Persistence Modules , 2015, ArXiv.

[43]  Olexandr Isayev,et al.  Deep reinforcement learning for de novo drug design , 2017, Science Advances.

[44]  Frank Noé,et al.  Generating valid Euclidean distance matrices , 2019, ArXiv.

[45]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[46]  Jos'e Miguel Hern'andez-Lobato,et al.  Reinforcement Learning for Molecular Design Guided by Quantum Mechanics , 2020, ICML.

[47]  Matt J. Kusner,et al.  Grammar Variational Autoencoder , 2017, ICML.

[48]  Makoto Yamada,et al.  Persistence Fisher Kernel: A Riemannian Manifold Kernel for Persistence Diagrams , 2018, NeurIPS.

[49]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[50]  Jean-Louis Reymond,et al.  Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17 , 2012, J. Chem. Inf. Model..

[51]  Leonidas J. Guibas,et al.  A Topology Layer for Machine Learning , 2019, AISTATS.

[52]  Radmila Sazdanovic,et al.  Simplicial Models and Topological Inference in Biological Systems , 2014, Discrete and Topological Models in Molecular Biology.

[53]  Michael Gastegger,et al.  Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules , 2019, NeurIPS.

[54]  K-R Müller,et al.  SchNet - A deep learning architecture for molecules and materials. , 2017, The Journal of chemical physics.

[55]  Ulrich Bauer,et al.  A stable multi-scale kernel for topological machine learning , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Volker Roth,et al.  3DMolNet: A Generative Network for Molecular Structures , 2020, ArXiv.

[57]  M. Bauchy,et al.  Revealing hidden medium-range order in amorphous materials using topological data analysis , 2020, Science Advances.

[58]  Jure Leskovec,et al.  Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation , 2018, NeurIPS.

[59]  Jian Tang,et al.  An End-to-End Framework for Molecular Conformation Generation via Bilevel Programming , 2021, ICML.

[60]  Tom Halverson,et al.  Topological Data Analysis of Biological Aggregation Models , 2014, PloS one.