Industry-scale application and evaluation of deep learning for drug target prediction

Artificial intelligence (AI) is undergoing a revolution thanks to the breakthroughs of machine learning algorithms in computer vision, speech recognition, natural language processing and generative modelling. Recent works on publicly available pharmaceutical data showed that AI methods are highly promising for Drug Target prediction. However, the quality of public data might be different than that of industry data due to different labs reporting measurements, different measurement techniques, fewer samples and less diverse and specialized assays. As part of a European funded project (ExCAPE), that brought together expertise from pharmaceutical industry, machine learning, and high-performance computing, we investigated how well machine learning models obtained from public data can be transferred to internal pharmaceutical industry data. Our results show that machine learning models trained on public data can indeed maintain their predictive power to a large degree when applied to industry data. Moreover, we observed that deep learning derived machine learning models outperformed comparable models, which were trained by other machine learning algorithms, when applied to internal pharmaceutical company datasets. To our knowledge, this is the first large-scale study evaluating the potential of machine learning and especially deep learning directly at the level of industry-scale settings and moreover investigating the transferability of publicly learned target prediction models towards industrial bioactivity prediction pipelines.

[1]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[2]  Lior Rokach,et al.  Introduction to Recommender Systems Handbook , 2011, Recommender Systems Handbook.

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[5]  Lars Carlsson,et al.  ExCAPE-DB: an integrated large scale dataset facilitating Big Data analysis in chemogenomics , 2017, Journal of Cheminformatics.

[6]  Andreas Bender,et al.  Target prediction utilising negative bioactivity data covering large chemical space , 2015, Journal of Cheminformatics.

[7]  Yadi Zhou,et al.  Exploring Tunable Hyperparameters for Deep Neural Networks with Industrial ADME Data Sets. , 2018, Journal of chemical information and modeling.

[8]  Sean Ekins The Next Era: Deep Learning in Pharmaceutical Research , 2016, Pharmaceutical Research.

[9]  K. Friedemann Schmidt,et al.  Predictive Multitask Deep Neural Network Models for ADME-Tox Properties: Learning from Large Data Sets , 2019, J. Chem. Inf. Model..

[10]  Chaoyang Zhang,et al.  Deep Learning Based Analysis of Histopathological Images of Breast Cancer , 2019, Front. Genet..

[11]  Navdeep Jaitly,et al.  Multi-task Neural Networks for QSAR Predictions , 2014, ArXiv.

[12]  Günter Klambauer,et al.  DeepTox: Toxicity Prediction using Deep Learning , 2016, Front. Environ. Sci..

[13]  Andrew R. Leach,et al.  Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery , 2019, Journal of Cheminformatics.

[14]  Nicolò Cesa-Bianchi,et al.  Advances in Neural Information Processing Systems 31 , 2018, NIPS 2018.

[15]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[16]  A. Vulpetti,et al.  Comparability of Mixed IC50 Data – A Statistical Analysis , 2013, PloS one.

[17]  George Papadatos,et al.  The ChEMBL database in 2017 , 2016, Nucleic Acids Res..

[18]  Andreas Bender,et al.  From in silico target prediction to multi-target drug design: current databases, methods and applications. , 2011, Journal of proteomics.

[19]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[20]  Sean Ekins,et al.  Exploiting machine learning for end-to-end drug discovery and development , 2019, Nature Materials.

[21]  Thomas Blaschke,et al.  The rise of deep learning in drug discovery. , 2018, Drug discovery today.

[22]  Scott Boyer,et al.  CHEMOINFORMATICS AND BEYOND , 2013 .

[23]  Nomenclature committee of the international union of biochemistry and molecular biology (NC-IUBMB), Enzyme Supplement 5 (1999). , 1999, European journal of biochemistry.

[24]  Herman van Vlijmen,et al.  Computational chemistry at Janssen , 2017, Journal of Computer-Aided Molecular Design.

[25]  Robert P. Sheridan,et al.  Deep Neural Nets as a Method for Quantitative Structure-Activity Relationships , 2015, J. Chem. Inf. Model..

[26]  Andy Liaw,et al.  Demystifying Multitask Deep Neural Networks for Quantitative Structure-Activity Relationships , 2017, J. Chem. Inf. Model..

[27]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[28]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[29]  Ron Artstein,et al.  Survey Article: Inter-Coder Agreement for Computational Linguistics , 2008, CL.

[30]  Andreas Verras,et al.  Is Multitask Deep Learning Practical for Pharma? , 2017, J. Chem. Inf. Model..

[31]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[32]  Evan Bolton,et al.  PubChem 2019 update: improved access to chemical data , 2018, Nucleic Acids Res..

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Tom Vander Aa,et al.  SMURFF: a High-Performance Framework for Matrix Factorization , 2019, 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS).

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Dong-Sheng Cao,et al.  Artificial intelligence facilitates drug design in the big data era , 2019, Chemometrics and Intelligent Laboratory Systems.

[37]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[38]  Bin Li,et al.  Applications of machine learning in drug discovery and development , 2019, Nature Reviews Drug Discovery.

[39]  Jan Martinovic,et al.  HyperLoom: A Platform for Defining and Executing Scientific Pipelines in Distributed Environments , 2018, PARMA-DITAM '18.

[40]  Lior Rokach,et al.  Recommender Systems Handbook , 2010 .

[41]  Marc A. Martí-Renom,et al.  Target Prediction for an Open Access Set of Compounds Active against Mycobacterium tuberculosis , 2013, PLoS Comput. Biol..

[42]  Hugo Ceulemans,et al.  Large-scale comparison of machine learning methods for drug target prediction on ChEMBL , 2018, Chemical science.

[43]  Yves Moreau,et al.  Macau: Scalable Bayesian factorization with high-dimensional side information using MCMC , 2017, 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP).

[44]  Andrea Volkamer,et al.  Advances and Challenges in Computational Target Prediction , 2019, J. Chem. Inf. Model..

[45]  Andreas Bender,et al.  In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Naïve Bayes and Parzen-Rosenblatt Window , 2013, J. Chem. Inf. Model..

[46]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[47]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[48]  Tipton Kf,et al.  Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme nomenclature. Recommendations 1992. Supplement: corrections and additions. , 1994 .

[49]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[50]  Egon L. Willighagen,et al.  The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching , 2017, Journal of Cheminformatics.

[51]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[52]  Knut Baumann,et al.  Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation , 2014, Journal of Cheminformatics.

[53]  Brian Kingsbury,et al.  New types of deep neural network learning for speech recognition and related applications: an overview , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[54]  Nina Jeliazkova,et al.  Ambit‐Tautomer: An Open Source Tool for Tautomer Generation , 2013, Molecular informatics.

[55]  Darko Butina,et al.  Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets , 1999, J. Chem. Inf. Comput. Sci..

[56]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..