Evaluation of deep and shallow learning methods in chemogenomics for the prediction of drugs specificity

Chemogenomics, also called proteochemometrics, covers a range of computational methods that can be used to predict protein–ligand interactions at large scales in the protein and chemical spaces. They differ from more classical ligand-based methods (also called QSAR) that predict ligands for a given protein receptor. In the context of drug discovery process, chemogenomics allows to tackle the question of predicting off-target proteins for drug candidates, one of the main causes of undesirable side-effects and failure within drugs development processes. The present study compares shallow and deep machine-learning approaches for chemogenomics, and explores data augmentation techniques for deep learning algorithms in chemogenomics. Shallow machine-learning algorithms rely on expert-based chemical and protein descriptors, while recent developments in deep learning algorithms enable to learn abstract numerical representations of molecular graphs and protein sequences, in order to optimise the performance of the prediction task. We first propose a formulation of chemogenomics with deep learning, called the chemogenomic neural network (CN), as a feed-forward neural network taking as input the combination of molecule and protein representations learnt by molecular graph and protein sequence encoders. We show that, on large datasets, the deep learning CN model outperforms state-of-the-art shallow methods, and competes with deep methods with expert-based descriptors. However, on small datasets, shallow methods present better prediction performance than deep learning methods. Then, we evaluate data augmentation techniques, namely multi-view and transfer learning, to improve the prediction performance of the chemogenomic neural network. We conclude that a promising research direction is to integrate heterogeneous sources of data such as auxiliary tasks for which large datasets are available, or independently, multiple molecule and protein attribute views.

[1]  Lukasz Kurgan,et al.  DeepCNF-D: Predicting Protein Order/Disorder Regions by Weighted Deep Convolutional Neural Fields , 2015, International journal of molecular sciences.

[2]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[3]  J. Drews Drug discovery: a historical perspective. , 2000, Science.

[4]  Ping Zhang,et al.  Interpretable Drug Target Prediction Using Deep Neural Representation , 2018, IJCAI.

[5]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[6]  Tatsuya Akutsu,et al.  Protein homology detection using string alignment kernels , 2004, Bioinform..

[7]  Chunyan Miao,et al.  Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction , 2016, PLoS Comput. Biol..

[8]  Jure Leskovec,et al.  Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[9]  Neil Swainston,et al.  Towards a genome-scale kinetic model of cellular metabolism , 2010, BMC Systems Biology.

[10]  Tapio Pahikkala,et al.  Toward more realistic drug^target interaction predictions , 2014 .

[11]  Amit Sethi,et al.  Some New Layer Architectures for Graph CNN , 2018, ArXiv.

[12]  Elisa Michelini,et al.  Protein ligand interaction prediction , 2012 .

[13]  Yoshihiro Yamanishi,et al.  Supervised prediction of drug–target interactions using bipartite local models , 2009, Bioinform..

[14]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[15]  Jean-Philippe Vert,et al.  The optimal assignment kernel is not positive definite , 2008, ArXiv.

[16]  E. Marchiori,et al.  Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile , 2013, PloS one.

[17]  S. K. Riis,et al.  Improving prediction of protein secondary structure using structured neural networks and multiple sequence alignments. , 1996, Journal of computational biology : a journal of computational molecular cell biology.

[18]  Jun Sese,et al.  Compound‐protein interaction prediction with end‐to‐end learning of neural networks for graphs and sequences , 2018, Bioinform..

[19]  Yanli Wang,et al.  PubChem: Integrated Platform of Small Molecules and Biological Activities , 2008 .

[20]  Wei-keng Liao,et al.  Transfer Learning Using Ensemble Neural Networks for Organic Solar Cell Screening , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[21]  Jean-Philippe Vert,et al.  Virtual screening of GPCRs: An in silico chemogenomics approach , 2008, BMC Bioinformatics.

[22]  Yoshihiro Yamanishi,et al.  Inferring Chemogenomic Features from Drug‐Target Interaction Networks , 2013, Molecular informatics.

[23]  Jean-Philippe Vert,et al.  Machine Learning for In Silico Virtual Screening and Chemical Genomics: New Strategies , 2008, Combinatorial chemistry & high throughput screening.

[24]  Shi-Hua Zhang,et al.  DrugE-Rank: improving drug–target interaction prediction of new candidate drugs or targets by ensemble learning to rank , 2016, Bioinform..

[25]  Pierre Baldi,et al.  Statistical machine learning and data mining for chemoinformatics and drug discovery , 2010 .

[26]  Xiaobo Zhou,et al.  Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces , 2010, BMC Systems Biology.

[27]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[28]  Yoshihiro Yamanishi,et al.  Drug target prediction using adverse event report systems: a pharmacogenomic approach , 2012, Bioinform..

[29]  Yoshihiro Yamanishi,et al.  Drug-target interaction prediction from chemical, genomic and pharmacological data in an integrated framework , 2010, Bioinform..

[30]  Arzucan Özgür,et al.  DeepDTA: deep drug–target binding affinity prediction , 2018, Bioinform..

[31]  Mehmet Gönen,et al.  Predicting drug-target interactions from chemical and genomic kernels using Bayesian matrix factorization , 2012, Bioinform..

[32]  Chee Keong Kwoh,et al.  Drug-target interaction prediction by learning from local information and neighbors , 2013, Bioinform..

[33]  Artem Cherkasov,et al.  SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines , 2017, Journal of Cheminformatics.

[34]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[35]  Alexios Koutsoukas,et al.  Deep-learning: investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data , 2017, Journal of Cheminformatics.

[36]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[37]  Ole Winther,et al.  Protein Secondary Structure Prediction with Long Short Term Memory Networks , 2014, ArXiv.

[38]  Le Song,et al.  Discriminative Embeddings of Latent Variable Models for Structured Data , 2016, ICML.

[39]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[40]  Daniela Trisciuzzi,et al.  A New Approach for Drug Target and Bioactivity Prediction: The Multifingerprint Similarity Search Algorithm (MuSSeL) , 2018, J. Chem. Inf. Model..

[41]  Pierre Baldi,et al.  Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules , 2013, J. Chem. Inf. Model..

[42]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[43]  Junzhou Huang,et al.  Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery , 2017, BCB.

[44]  Woody Sherman,et al.  Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods , 2010, J. Cheminformatics.

[45]  Samuel Kaski,et al.  Kernelized Bayesian Matrix Factorization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[46]  Vikas Bansal,et al.  Fast individual ancestry inference from DNA sequence data leveraging allele frequencies for multiple populations , 2014, BMC Bioinformatics.

[47]  Elena Marchiori,et al.  Gaussian interaction profile kernels for predicting drug-target interaction , 2011, Bioinform..

[48]  Sungroh Yoon,et al.  DeepCCI: End-to-end Deep Learning for Chemical-Chemical Interaction Prediction , 2017, BCB.

[49]  Véronique Stoven,et al.  Efficient multi-task chemogenomics for drug specificity prediction , 2017, bioRxiv.

[50]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[51]  Gerard J. P. van Westen,et al.  Benchmarking of protein descriptor sets in proteochemometric modeling (part 1): comparative study of 13 amino acid descriptor sets , 2013, Journal of Cheminformatics.

[52]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[53]  Vasilis J. Promponas,et al.  Protein Secondary Structure Prediction with Bidirectional Recurrent Neural Nets: Can Weight Updating for Each Residue Enhance Performance? , 2010, AIAI.

[54]  Jinfeng Yi,et al.  Edge Attention-based Multi-Relational Graph Convolutional Networks , 2018, ArXiv.

[55]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[56]  Yoshihiro Yamanishi,et al.  Prediction of drug–target interaction networks from the integration of chemical and genomic spaces , 2008, ISMB.

[57]  Ole Winther,et al.  An introduction to deep learning on biological sequence data: examples and solutions , 2017, Bioinform..

[58]  Isidro Cortes-Ciriano,et al.  Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets , 2013, Journal of Cheminformatics.

[59]  van den Berg,et al.  UvA-DARE (Digital Academic Modeling Relational Data with Graph Convolutional Networks Modeling Relational Data with Graph Convolutional Networks , 2017 .

[60]  Zhi-Wei Cao,et al.  Efficacy of different protein descriptors in predicting protein functional families , 2007, BMC Bioinformatics.

[61]  Pierre Baldi,et al.  Kernels for small molecules and the prediction of mutagenicity, toxicity and anti-cancer activity , 2005, ISMB.

[62]  Kuldip K. Paliwal,et al.  Predicting backbone Cα angles and dihedrals from protein sequences by stacked sparse auto‐encoder deep neural network , 2014, J. Comput. Chem..

[63]  Regina Barzilay,et al.  Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction , 2017, J. Chem. Inf. Model..

[64]  Sereina Riniker,et al.  Open-source platform to benchmark fingerprints for ligand-based virtual screening , 2013, Journal of Cheminformatics.

[65]  Hans-Joachim Böhm,et al.  A guide to drug discovery: Hit and lead generation: beyond high-throughput screening , 2003, Nature Reviews Drug Discovery.

[66]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[67]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[68]  Isidro Cortes-Ciriano,et al.  Applications of proteochemometrics - from species extrapolation to cell line sensitivity modelling , 2015, BMC Bioinformatics.

[69]  Joseph Gomes,et al.  MoleculeNet: a benchmark for molecular machine learning† †Electronic supplementary information (ESI) available. See DOI: 10.1039/c7sc02664a , 2017, Chemical science.

[70]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[71]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[72]  Vijay S. Pande,et al.  Massively Multitask Networks for Drug Discovery , 2015, ArXiv.

[73]  Gerard J. P. van Westen,et al.  Proteochemometric modeling as a tool to design selective compounds and for extrapolating to novel targets , 2011 .

[74]  Hao Ding,et al.  Collaborative matrix factorization with multiple similarities for predicting drug-target interactions , 2013, KDD.

[75]  Yoshihiro Yamanishi,et al.  Predicting target proteins for drug candidate compounds based on drug-induced gene expression data in a chemical structure-independent manner , 2015, BMC Medical Genomics.

[76]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[77]  Jean-Philippe Vert,et al.  Protein-ligand interaction prediction: an improved chemogenomics approach , 2008, Bioinform..

[78]  Vijay S. Pande,et al.  Low Data Drug Discovery with One-Shot Learning , 2016, ACS central science.

[79]  Andreas Bender,et al.  How Similar Are Similarity Searching Methods? A Principal Component Analysis of Molecular Descriptor Space , 2009, J. Chem. Inf. Model..