All-Assay-Max2 pQSAR: Activity predictions as accurate as 4-concentration IC50s for 8,558 Novartis assays

Profile-QSAR (pQSAR) is a massively multi-task, 2-step machine learning method with unprecedented scope, accuracy and applicability domain. In step one, a “profile” of conventional single-assay random forest regression (RFR) models are trained on a very large number of biochemical and cellular pIC50 assays using Morgan 2 sub-structural fingerprints as compound descriptors. In step two, a panel of PLS models are built using the profile of pIC50 predictions from those RFR models as compound descriptors. Hence the name. Previously described for a panel of 728 biochemical and cellular kinase assays, we have now built an enormous pQSAR from 11,805 diverse Novartis IC50 and EC50 assays. This large number of assays, and hence of compound descriptors for PLS, dictated reducing the profile by only including RFR models whose predictions correlate with the assay being modeled. The RFR and pQSAR models were evaluated with our “realistically novel” held-out test set whose median average similarity to the nearest training set member across the 11,805 assays was only 0.34, thus testing a realistically large applicability domain. For the 11,805 single-assay RFR models, the median correlation of prediction with experiment was only R2 ext=0.05, virtually random, and only 8% of the models achieved our standard success threshold of R2 ext=0.30. For pQSAR, the median correlation was R2 ext=0.53, comparable to 4-concentration experimental IC50s, and 72% of the models met our R2 ext>0.30 standard, totaling 8558 successful models. The successful models included assays from all of the 51 annotated target sub-classes, as well as 4196 phenotypic assays, indicating that pQSAR can be applied to virtually any disease area. Every month, all models are updated to include new measurements, and predictions are made for 5.5 million Novartis compounds, totaling 50 billion predictions. Common uses have included virtual screening, selectivity design, toxicity and promiscuity prediction, mechanism-of-action prediction, and others.

[1]  Tomasz Puzyn,et al.  Multi-Objective Genetic Algorithm (MOGA) As a Feature Selecting Strategy in the Development of Ionic Liquids' Quantitative Toxicity-Toxicity Relationship Models , 2018, J. Chem. Inf. Model..

[2]  Hao Zhu,et al.  CIIPro: a new read‐across portal to fill data gaps using public large‐scale chemical and biological data , 2016, Bioinform..

[3]  Knut Baumann,et al.  Validation tools for variable subset regression , 2004, J. Comput. Aided Mol. Des..

[4]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[5]  Thomas Hartung,et al.  Nonanimal Models for Acute Toxicity Evaluations: Applying Data-Driven Profiling and Read-Across , 2019, Environmental health perspectives.

[6]  S. Nelson,et al.  Melanomas acquire resistance to B-RAF(V600E) inhibition by RTK or N-RAS upregulation , 2010, Nature.

[7]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[8]  Alexander Golbraikh,et al.  Predictive QSAR modeling workflow, model applicability domains, and virtual screening. , 2007, Current pharmaceutical design.

[9]  George Papadatos,et al.  The ChEMBL database in 2017 , 2016, Nucleic Acids Res..

[10]  David M. Rocke,et al.  Predicting ligand binding to proteins by affinity fingerprinting. , 1995, Chemistry & biology.

[11]  Eric J. Martin,et al.  Profile-QSAR 2.0: Kinase Virtual Screening Accuracy Comparable to Four-Concentration IC50s for Realistically Novel Compounds , 2017, J. Chem. Inf. Model..

[12]  Eric J. Martin,et al.  Profile-QSAR and Surrogate AutoShim Protein-Family Modeling of Proteases , 2012, J. Chem. Inf. Model..

[13]  M. Hayman,et al.  Molecular Mechanism for a Role of SHP2 in Epidermal Growth Factor Receptor Signaling , 2003, Molecular and Cellular Biology.

[14]  Matthew Clark,et al.  The Probability of Chance Correlation Using Partial Least Squares (PLS) , 1993 .

[15]  Ivan Rusyn,et al.  The Use of Cell Viability Assay Data Improves the Prediction Accuracy of Conventional Quantitative Structure Activity Relationship Models of Animal Carcinogenicity , 2007 .

[16]  Vijay S. Pande,et al.  Massively Multitask Networks for Drug Discovery , 2015, ArXiv.

[17]  I. Tsigelny,et al.  Disruption of angiogenesis and tumor growth with an orally active drug that stabilizes the inactive state of PDGFRβ/B-RAF , 2010, Proceedings of the National Academy of Sciences.

[18]  Benedict W J Irwin,et al.  Imputation of Assay Bioactivity Data Using Deep Learning , 2019, J. Chem. Inf. Model..

[19]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[20]  N. Rosen,et al.  Resistance to BRAF inhibition in melanomas. , 2011, The New England journal of medicine.

[21]  Eric J. Martin,et al.  Profile-QSAR: A Novel meta-QSAR Method that Combines Activities across the Kinase Family To Accurately Predict Affinity, Selectivity, and Cellular Activity , 2011, J. Chem. Inf. Model..

[22]  Gerta Rücker,et al.  y-Randomization and Its Variants in QSPR/QSAR , 2007, J. Chem. Inf. Model..

[23]  Darko Butina,et al.  Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets , 1999, J. Chem. Inf. Comput. Sci..

[24]  Xiang-Wei Zhu,et al.  Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO , 2015, J. Chem. Inf. Model..

[25]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..