Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network

As safety is one of the most important properties of drugs, chemical toxicology prediction has received increasing attentions in the drug discovery research. Traditionally, researchers rely on in vitro and in vivo experiments to test the toxicity of chemical compounds. However, not only are these experiments time consuming and costly, but experiments that involve animal testing are increasingly subject to ethical concerns. While traditional machine learning (ML) methods have been used in the field with some success, the limited availability of annotated toxicity data is the major hurdle for further improving model performance. Inspired by the success of semi-supervised learning (SSL) algorithms, we propose a Graph Convolution Neural Network (GCN) to predict chemical toxicity and trained the network by the Mean Teacher (MT) SSL algorithm. Using the Tox21 data, our optimal SSL-GCN models for predicting the twelve toxicological endpoints achieve an average ROC-AUC score of 0.757 in the test set, which is a 6% improvement over GCN models trained by supervised learning and conventional ML methods. Our SSL-GCN models also exhibit superior performance when compared to models constructed using the built-in DeepChem ML methods. This study demonstrates that SSL can increase the prediction power of models by learning from unannotated data. The optimal unannotated to annotated data ratio ranges between 1:1 and 4:1. This study demonstrates the success of SSL in chemical toxicity prediction; the same technique is expected to be beneficial to other chemical property prediction tasks by utilizing existing large chemical databases. Our optimal model SSL-GCN is hosted on an online server accessible through: https://app.cbbio.online/ssl-gcn/home .

[1]  Yoshua Bengio,et al.  Interpolation Consistency Training for Semi-Supervised Learning , 2019, IJCAI.

[2]  Christian Feldmann,et al.  Prediction of Promiscuity Cliffs Using Machine Learning , 2020, Molecular informatics.

[3]  Chaoyang Zhang,et al.  A review on machine learning methods for in silico toxicity prediction , 2018, Journal of environmental science and health. Part C, Environmental carcinogenesis & ecotoxicology reviews.

[4]  Alex Smola,et al.  Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs , 2019, ArXiv.

[5]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[6]  David Berthelot,et al.  ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring , 2019, ArXiv.

[7]  Supratik Mukhopadhyay,et al.  eToxPred: a machine learning-based approach to estimate the toxicity of drug candidates , 2019, BMC Pharmacology and Toxicology.

[8]  Yiming Li,et al.  Semi-Supervised Brain Lesion Segmentation with an Adapted Mean Teacher Model , 2019, IPMI.

[9]  Vijay S. Pande,et al.  Molecular graph convolutions: moving beyond fingerprints , 2016, Journal of Computer-Aided Molecular Design.

[10]  Guanyu Wang,et al.  Machine Learning Based Toxicity Prediction: From Chemical Structural Description to Transcriptome Analysis , 2018, International journal of molecular sciences.

[11]  Timo Aila,et al.  Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[12]  Sabine Schulze Screening Methods For Experimentation In Industry Drug Discovery And Genetics , 2016 .

[13]  Jarrod Bailey,et al.  Recent efforts to elucidate the scientific validity of animal-based drug tests by the pharmaceutical industry, pro-testing lobby groups, and animal welfare organisations , 2019, BMC Medical Ethics.

[14]  Weihua Li,et al.  In Silico Prediction of Chemical Toxicity for Drug Design Using Machine Learning Methods and Structural Alerts , 2018, Front. Chem..

[15]  Chen-Yang Jia,et al.  Graph attention convolutional neural network model for chemical poisoning of honey bees' prediction. , 2020, Science bulletin.

[16]  Tudor I. Oprea,et al.  Integrating virtual screening in lead discovery. , 2004, Current opinion in chemical biology.

[17]  Xiaofeng Liu,et al.  Developing a Multi-Dose Computational Model for Drug-Induced Hepatotoxicity Prediction Based on Toxicogenomics Data , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Bernard Ghanem,et al.  DeepGCNs: Can GCNs Go As Deep As CNNs? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Jure Leskovec,et al.  Graph Convolutional Neural Networks for Web-Scale Recommender Systems , 2018, KDD.

[20]  Pierre Baldi,et al.  Influence Relevance Voting: An Accurate And Interpretable Virtual High Throughput Screening Method , 2009, J. Chem. Inf. Model..

[21]  Vijay S. Pande,et al.  MoleculeNet: a benchmark for molecular machine learning , 2017, Chemical science.

[22]  Peter F. Stadler,et al.  Exploration of the chemical space and its three historical regimes , 2019, Proceedings of the National Academy of Sciences.

[23]  Andreas Verras,et al.  Is Multitask Deep Learning Practical for Pharma? , 2017, J. Chem. Inf. Model..

[24]  Julieta Noguez-Monroy,et al.  A computational toxicogenomics approach identifies a list of highly hepatotoxic compounds from a large microarray database , 2017, PloS one.

[25]  Huan Liu,et al.  Relational learning via latent social dimensions , 2009, KDD.

[26]  Holger H. Hoos,et al.  A survey on semi-supervised learning , 2019, Machine Learning.

[27]  M. C. Newman,et al.  The practice of structure activity relationships (SAR) in toxicology. , 2000, Toxicological sciences : an official journal of the Society of Toxicology.

[28]  Paulo S. C. Alencar,et al.  The use of machine learning algorithms in recommender systems: A systematic review , 2015, Expert Syst. Appl..

[29]  Deng Cai,et al.  Learning Graph-Level Representation for Drug Discovery , 2017, ArXiv.

[30]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[31]  Xavier Bresson,et al.  Geometric Matrix Completion with Recurrent Multi-Graph Neural Networks , 2017, NIPS.

[32]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[33]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..

[34]  Vladimir B Bajic,et al.  In silico toxicology: computational methods for the prediction of chemical toxicity , 2016, Wiley interdisciplinary reviews. Computational molecular science.

[35]  Hojung Nam,et al.  Artificial Intelligence in Drug Discovery: A Comprehensive Review of Data-driven and Machine Learning Approaches , 2020, Biotechnology and Bioprocess Engineering.

[36]  Robert P. Sheridan,et al.  Time-Split Cross-Validation as a Method for Estimating the Goodness of Prospective Prediction , 2013, J. Chem. Inf. Model..

[37]  Alexander Golbraikh,et al.  Quantitative Structure−Activity Relationship Analysis of Functionalized Amino Acid Anticonvulsant Agents Using k Nearest Neighbor and Simulated Annealing PLS Methods , 2002 .

[38]  Shirley W I Siu,et al.  Machine Learning Approaches for Quality Assessment of Protein Structures , 2020, Biomolecules.

[39]  Pierre Baldi,et al.  Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules , 2013, J. Chem. Inf. Model..

[40]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[41]  Yang Li,et al.  PotentialNet for Molecular Property Prediction , 2018, ACS central science.

[42]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[43]  Diego Marcheggiani,et al.  Encoding Sentences with Graph Convolutional Networks for Semantic Role Labeling , 2017, EMNLP.

[44]  Tapani Raiko,et al.  Semi-supervised Learning with Ladder Networks , 2015, NIPS.

[45]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[46]  Rhodri Hayward,et al.  Screening , 2008, The Lancet.

[47]  Harri Valpola,et al.  Weight-averaged consistency targets improve semi-supervised deep learning results , 2017, ArXiv.

[48]  Campbell McInnes,et al.  Virtual screening strategies in drug discovery. , 2007, Current opinion in chemical biology.

[49]  Publisher's Note , 2018, Anaesthesia.

[50]  G. Bemis,et al.  The properties of known drugs. 1. Molecular frameworks. , 1996, Journal of medicinal chemistry.

[51]  Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2018, KDD.

[52]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[53]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[54]  Bing Rao,et al.  ACP-GCN: The Identification of Anticancer Peptides Based on Graph Convolution Networks , 2020, IEEE Access.

[55]  David Berthelot,et al.  FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence , 2020, NeurIPS.

[56]  Vijay S. Pande,et al.  Low Data Drug Discovery with One-Shot Learning , 2016, ACS central science.

[57]  Krister Wennerberg,et al.  A transcriptomics data-driven gene space accurately predicts liver cytopathology and drug-induced liver injury , 2017, Nature Communications.

[58]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.