A Systematic Survey of Molecular Pre-trained Models

Deep learning has achieved remarkable success in learning representations for molecules, which is crucial for various biochemical applications, rang-ing from property prediction to drug design. However, training Deep Neural Networks (DNNs) from scratch often requires abundant labeled molecules, which are expensive to acquire in the real world. To alleviate this issue, tremendous efforts have been devoted to Molecular Pre-trained Models (MPMs), where DNNs are pre-trained using large-scale unlabeled molecular databases and then fine-tuned over specific downstream tasks. Despite the prosperity, there lacks a systematic review of this fast-growing field. In this paper, we present the first survey that summarizes the current progress of MPMs. We first highlight the limitations of training molecular representation models from scratch to motivate MPM studies. Next, we systematically review recent ad-vances on this topic from several key perspectives, including molecular descriptors, encoder architectures, pre-training strategies, and applications. We also highlight the challenges and promising avenues for future research, providing a useful resource for both machine learning and scientific communities.

[1]  R. Nussinov,et al.  Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework , 2022, Nature Machine Intelligence.

[2]  Yaliang Li,et al.  MGMAE: Molecular Representation Learning by Reconstructing Heterogeneous Graphs with A High Mask Ratio , 2022, CIKM.

[3]  Yanqiao Zhu,et al.  Improving Molecular Pretraining with Complementary Featurizations , 2022, ArXiv.

[4]  Gabriel A. Pinheiro,et al.  SMICLR: Contrastive Learning on Multiple Molecular Representations for Semisupervised and Unsupervised Representation Learning , 2022, J. Chem. Inf. Model..

[5]  Tingjun Hou,et al.  MICER: a pre-trained encoder-decoder architecture for molecular image captioning , 2022, Bioinform..

[6]  Tie-Yan Liu,et al.  Unified 2D and 3D Pre-Training of Molecular Representations , 2022, KDD.

[7]  Ruoxi Sun,et al.  Does GNN Pretraining Help Molecular Representation? , 2022, NeurIPS.

[8]  Shengchao Liu,et al.  Molecular Geometry Pretraining with SE(3)-Invariant Denoising Distance Matching , 2022, ICLR.

[9]  James Martens,et al.  Pre-training via Denoising for Molecular Property Prediction , 2022, ICLR.

[10]  Hongxia Yang,et al.  GraphMAE: Self-Supervised Masked Graph Autoencoders , 2022, KDD.

[11]  Heng Ji,et al.  Translation between Molecules and Natural Language , 2022, EMNLP.

[12]  Zhaoping Xiong,et al.  PanGu Drug Model: learn a molecule like a human , 2022, bioRxiv.

[13]  Shengchao Liu,et al.  MolGenSurvey: A Systematic Survey in Machine Learning Models for Molecule Design , 2022, ArXiv.

[14]  Jieyu Zhang,et al.  A Survey on Deep Graph Generation: Methods and Applications , 2022, LoG.

[15]  S. Ermon,et al.  GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation , 2022, ICLR.

[16]  Wayne Xin Zhao,et al.  Neural Graph Matching for Pre-training Graph Neural Networks , 2022, SDM.

[17]  O. Isayev,et al.  The transformational role of GPU computing and deep learning in drug discovery , 2022, Nature Machine Intelligence.

[18]  Rishikesh Magar,et al.  Improving Molecular Contrastive Learning via Faulty Negative Mitigation and Decomposed Fragment Contrast , 2022, J. Chem. Inf. Model..

[19]  Maosong Sun,et al.  A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals , 2022, Nature Communications.

[20]  Stan Z. Li,et al.  SimGRACE: A Simple Framework for Graph Contrastive Learning without Data Augmentation , 2022, WWW.

[21]  Sung Ju Hwang,et al.  Graph Self-supervised Learning with Accurate Discrepancy Learning , 2022, NeurIPS.

[22]  Junzhou Huang,et al.  DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery - A Focus on Affinity Prediction Problems with Noise Annotations , 2022, ArXiv.

[23]  Yang Shen,et al.  Bringing Your Own View: Graph Contrastive Learning without Prefabricated Data Augmentations , 2022, WSDM.

[24]  Yingheng Wang,et al.  Molecular Graph Contrastive Learning with Parameterized Explainable Augmentations , 2021, bioRxiv.

[25]  Minghai Qin,et al.  Molecular Contrastive Learning with Chemical Element Knowledge Graph , 2021, AAAI.

[26]  Wei Cheng,et al.  InfoGCL: Information-Aware Graph Contrastive Learning , 2021, NeurIPS.

[27]  Wei Chen,et al.  SE(3) Equivariant Graph Neural Networks with Complete Local Frames , 2021, ICML.

[28]  Viraj Bagal,et al.  MolGPT: Molecular Generation Using a Transformer-Decoder Model , 2021, J. Chem. Inf. Model..

[29]  P. Lio’,et al.  3D Infomax improves GNNs for Molecular Property Prediction , 2021, ICML.

[30]  Shengchao Liu,et al.  Pre-training Molecular Graph Representation with 3D Geometry , 2021, ICLR.

[31]  Stan Z. Li,et al.  ProGCL: Rethinking Hard Negative Mining in Graph Contrastive Learning , 2021, ICML.

[32]  Chee-Kong Lee,et al.  Motif-based Graph Self-Supervised Learning for Molecular Property Prediction , 2021, NeurIPS.

[33]  Dejing Dou,et al.  GeomGCL: Geometric Graph Contrastive Learning for Molecular Property Prediction , 2021, AAAI.

[34]  Liang Du,et al.  Multilingual Molecular Representation Learning via Contrastive Pre-training , 2021, ACL.

[35]  Sen Song,et al.  Pairwise Half-graph Discrimination: A Simple Graph-level Self-supervised Strategy for Pre-training Graph Neural Networks , 2021, IJCAI.

[36]  Gisbert Schneider,et al.  Geometric deep learning on molecular representations , 2021, Nature Machine Intelligence.

[37]  Bang An,et al.  Adaptive Transfer Learning on Graph Neural Networks , 2021, KDD.

[38]  Junchi Yan,et al.  From Canonical Correlation Analysis to Self-supervised Graph Neural Networks , 2021, NeurIPS.

[39]  Tao Qin,et al.  Dual-view Molecule Pre-training , 2021, ArXiv.

[40]  Hua Wu,et al.  Geometry-enhanced molecular representation learning for property prediction , 2021, Nature Machine Intelligence.

[41]  Jennifer Neville,et al.  Adversarial Graph Augmentation to Improve Graph Contrastive Learning , 2021, NeurIPS.

[42]  Julien Mairal,et al.  GraphiT: Encoding Graph Structure in Transformers , 2021, ArXiv.

[43]  Zhangyang Wang,et al.  Graph Contrastive Learning Automated , 2021, ICML.

[44]  Bingbing Ni,et al.  Self-supervised Graph-level Representation Learning with Local and Global Structure , 2021, ICML.

[45]  Regina Barzilay,et al.  GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles , 2021, NeurIPS.

[46]  Chuan Shi,et al.  Learning to Pre-train Graph Neural Networks , 2021, AAAI.

[47]  Sen Song,et al.  An effective self-supervised framework for learning expressive molecular global representations to drug discovery , 2021, Briefings Bioinform..

[48]  Jure Leskovec,et al.  OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs , 2021, NeurIPS Datasets and Benchmarks.

[49]  Philip S. Yu,et al.  Graph Self-Supervised Learning: A Survey , 2021, IEEE Transactions on Knowledge and Data Engineering.

[50]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[51]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[52]  Victor Garcia Satorras,et al.  E(n) Equivariant Graph Neural Networks , 2021, ICML.

[53]  Jimeng Sun,et al.  Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development , 2021, NeurIPS Datasets and Benchmarks.

[54]  Shuiwang Ji,et al.  Spherical Message Passing for 3D Molecular Graphs , 2021, ICLR.

[55]  Zhangyang Wang,et al.  Graph Contrastive Learning with Augmentations , 2020, NeurIPS.

[56]  Samuel Kaski,et al.  Rethinking pooling in graph neural networks , 2020, NeurIPS.

[57]  Bharath Ramsundar,et al.  ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction , 2020, ArXiv.

[58]  S. Sra,et al.  Contrastive Learning with Hard Negative Samples , 2020, ICLR.

[59]  Yu Sun,et al.  Masked Label Prediction: Unified Massage Passing Model for Semi-Supervised Classification , 2020, IJCAI.

[60]  Xiaomin Luo,et al.  Pushing the boundaries of molecular representation for drug discovery with graph attention mechanism. , 2020, Journal of medicinal chemistry.

[61]  Alain C. Vaucher,et al.  Prediction of chemical reaction yields using deep learning , 2020, Mach. Learn. Sci. Technol..

[62]  Gisbert Schneider,et al.  Drug discovery with explainable artificial intelligence , 2020, Nature Machine Intelligence.

[63]  Yizhou Sun,et al.  GPT-GNN: Generative Pre-Training of Graph Neural Networks , 2020, KDD.

[64]  Yatao Bian,et al.  Self-Supervised Graph Transformer on Large-Scale Molecular Data , 2020, NeurIPS.

[65]  Kaveh Hassani,et al.  Contrastive Multi-View Representation Learning on Graphs , 2020, ICML.

[66]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[67]  Dominique Beaini,et al.  Principal Neighbourhood Aggregation for Graph Nets , 2020, NeurIPS.

[68]  Stephan Günnemann,et al.  Directional Message Passing for Molecular Graphs , 2020, ICLR.

[69]  Jure Leskovec,et al.  Learning to Simulate Complex Physics with Graph Networks , 2020, ICML.

[70]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[71]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[72]  Deng Cai,et al.  Graph Transformer for Graph-to-Sequence Learning , 2019, AAAI.

[73]  Shion Honda,et al.  SMILES Transformer: Pre-trained Molecular Fingerprint for Low Data Drug Discovery , 2019, ArXiv.

[74]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[75]  Junzhou Huang,et al.  SMILES-BERT: Large Scale Unsupervised Pre-Training for Molecular Property Prediction , 2019, BCB.

[76]  Jürgen Bajorath,et al.  Evolving Concept of Activity Cliffs , 2019, ACS omega.

[77]  Jian Tang,et al.  InfoGraph: Unsupervised and Semi-supervised Graph-Level Representation Learning via Mutual Information Maximization , 2019, ICLR.

[78]  Pengfei Chen,et al.  Alchemy: A Quantum Chemistry Dataset for Benchmarking AI Models , 2019, ArXiv.

[79]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[80]  Alán Aspuru-Guzik,et al.  Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation , 2019, Mach. Learn. Sci. Technol..

[81]  J. Leskovec,et al.  Strategies for Pre-training Graph Neural Networks , 2019, ICLR.

[82]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[83]  Pietro Liò,et al.  Deep Graph Infomax , 2018, ICLR.

[84]  R. Devon Hjelm,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[85]  Seongok Ryu,et al.  Deeply learning molecular structure-property relationships using attention- and gate-augmented graph convolutional network , 2018, 1805.10988.

[86]  Jae Yong Ryu,et al.  Deep learning improves prediction of drug–drug and drug–food interactions , 2018, Proceedings of the National Academy of Sciences.

[87]  Li Li,et al.  Tensor Field Networks: Rotation- and Translation-Equivariant Neural Networks for 3D Point Clouds , 2018, ArXiv.

[88]  K-R Müller,et al.  SchNet - A deep learning architecture for molecules and materials. , 2017, The Journal of chemical physics.

[89]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[90]  Mike Preuss,et al.  Planning chemical syntheses with deep neural networks and symbolic AI , 2017, Nature.

[91]  P. Hawkins Conformation Generation: The State of the Art , 2017, J. Chem. Inf. Model..

[92]  Connor W. Coley,et al.  Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction , 2017, J. Chem. Inf. Model..

[93]  Klaus-Robert Müller,et al.  SchNet: A continuous-filter convolutional neural network for modeling quantum interactions , 2017, NIPS.

[94]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[95]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[96]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[97]  Andrei A. Rusu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[98]  Yanli Wang,et al.  PubChem BioAssay: 2017 update , 2016, Nucleic Acids Res..

[99]  Max Welling,et al.  Variational Graph Auto-Encoders , 2016, ArXiv.

[100]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[101]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[102]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[103]  Wojciech M. Czarnecki,et al.  Learning to SMILE(S) , 2016, ArXiv.

[104]  I. Muegge,et al.  An overview of molecular fingerprint similarity search in virtual screening , 2016, Expert opinion on drug discovery.

[105]  John J. Irwin,et al.  ZINC 15 – Ligand Discovery for Everyone , 2015, J. Chem. Inf. Model..

[106]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[107]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[108]  Pavlo O. Dral,et al.  Quantum chemistry structures and properties of 134 kilo molecules , 2014, Scientific Data.

[109]  Alexandre Varnek,et al.  Estimation of the size of drug-like chemical space based on GDB-17 data , 2013, Journal of Computer-Aided Molecular Design.

[110]  S. Parasuraman,et al.  Protein data bank , 2012, Journal of pharmacology & pharmacotherapeutics.

[111]  Ryan G. Coleman,et al.  ZINC: A Free Tool to Discover Chemistry for Biology , 2012, J. Chem. Inf. Model..

[112]  R. Altman,et al.  Data-Driven Prediction of Drug Effects and Interactions , 2012, Science Translational Medicine.

[113]  Evan Bolton,et al.  PubChem's BioAssay Database , 2011, Nucleic Acids Res..

[114]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[115]  N. Meanwell Synopsis of some recent tactical application of bioisosteres in drug design. , 2011, Journal of medicinal chemistry.

[116]  M. Hahn,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[117]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[118]  Matthias Rarey,et al.  On the Art of Compiling and Using 'Drug‐Like' Chemical Fragment Spaces , 2008, ChemMedChem.

[119]  David S. Wishart,et al.  DrugBank: a knowledgebase for drugs, drug actions and drug targets , 2007, Nucleic Acids Res..

[120]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery, 6. Can Oral Bioavailability in Humans Be Effectively Predicted by Simple Molecular Property-Based Rules? , 2007, J. Chem. Inf. Model..

[121]  Matthew D. Segall,et al.  ADMET Property Prediction: The State of the Art and Current Challenges , 2006 .

[122]  T. Ashburn,et al.  Drug repositioning: identifying and developing new uses for existing drugs , 2004, Nature Reviews Drug Discovery.

[123]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[124]  H. L. Morgan The Generation of a Unique Machine Description for Chemical Structures-A Technique Developed at Chemical Abstracts Service. , 1965 .

[125]  Stephen Dunn Smiles , 1932 .

[126]  Guolin Ke,et al.  Uni-Mol: A Universal 3D Molecular Representation Learning Framework , 2023, ICLR.

[127]  Shengchao Liu,et al.  ChemSpacE: Interpretable and Interactive Chemical Space Exploration , 2023, Trans. Mach. Learn. Res..

[128]  Stan Z. Li,et al.  Mole-BERT: Rethinking Pre-training Graph Neural Networks for Molecules , 2023, ICLR.

[129]  Brian M. Belgodere,et al.  Molformer: Large Scale Chemical Language Representations Capture Molecular Structure and Properties , 2022 .

[130]  Pre-training Graph Neural Networks for Molecular Representations: Retrospect and Prospect , 2022 .

[131]  Alvaro Sanchez-Gonzalez,et al.  Simple GNN Regularisation for 3D Molecular Property Prediction and Beyond , 2022, ICLR.

[132]  Mohammed J. Zaki,et al.  Edge-augmented Graph Transformers: Global Self-attention is Enough for Graphs , 2021, ArXiv.

[133]  Jiayu Zhou,et al.  MoCL: Contrastive Learning on Molecular Graphs with Multi-level Domain Knowledge , 2021, ArXiv.

[134]  Michal Valko,et al.  Bootstrapped Representation Learning on Graphs , 2021, ArXiv.

[135]  Jianren Wang,et al.  MolCLR: Molecular Contrastive Learning of Representations via Graph Neural Networks , 2021, ArXiv.

[136]  Tianle Cai,et al.  Do Transformers Really Perform Badly for Graph Representation? , 2021, NeurIPS.

[137]  Daniel S. Weld,et al.  S2ORC: The Semantic Scholar Open Research Corpus , 2020, ACL.

[138]  Michael I. Jordan,et al.  AUTO-ENCODING VARIATIONAL BAYES , 2020 .

[139]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[140]  Andreas Mayr,et al.  Deep Learning as an Opportunity in Virtual Screening , 2015 .

[141]  Thin Nguyen1,et al.  GraphDTA: Predicting drug–target binding affinity with graph neural networks , 2022 .