A value-based approach for training of classifiers with high-throughput small molecule screening data

In many practical applications of machine learning, models are built using experimental data that are noisy, biased and of low quality. Binary classifiers trained with such data have low performance in independent and prospective tests. This work builds upon techniques for the estimation of the value of training data and evaluates a batch-based data valuation. Comparative experiments conducted in this work with seven challenging benchmarks, demonstrate that classification performance can be improved by 10% to 25% in independent tests, using value-based training of classifiers. Additionally, between 97% to 100% of class labels can be detected among low-valued training samples. Finally, results show that simpler and faster learning methods, such as generalized linear models, perform as well as complex gradient boosting trees when training data comprises only the high-valued samples extracted from high-throughput small molecule screens.

[1]  Scott Boyer,et al.  Choosing Feature Selection and Learning Algorithms in QSAR , 2014, J. Chem. Inf. Model..

[2]  Sercan O. Arik,et al.  Data Valuation using Reinforcement Learning , 2019, ICML.

[3]  A. Sali,et al.  Discovery of Competitive and Noncompetitive Ligands of the Organic Cation Transporter 1 (OCT1; SLC22A1). , 2017, Journal of medicinal chemistry.

[4]  A. Schlessinger,et al.  Molecular Modeling of Drug–Transporter Interactions—An International Transporter Consortium Perspective , 2018, Clinical pharmacology and therapeutics.

[5]  Daniel Gómez,et al.  Polynomial calculation of the Shapley value based on sampling , 2009, Comput. Oper. Res..

[6]  Natalia Khuri,et al.  Using game theory to guide the classification of inhibitors of human iodide transporters , 2021, SAC.

[7]  Irene Luque Ruiz,et al.  Building Highly Reliable Quantitative Structure-Activity Relationship Classification Models Using the Rivality Index Neighborhood Algorithm with Feature Selection , 2020, J. Chem. Inf. Model..

[8]  Fiorella Cravero,et al.  Hybridizing Feature Selection and Feature Learning Approaches in QSAR Modeling for Drug Discovery , 2017, Scientific Reports.

[9]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[10]  Costas J. Spanos,et al.  Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms , 2019, Proc. VLDB Endow..

[11]  L. Shapley A Value for n-person Games , 1988 .

[12]  Yanli Wang,et al.  FSelector: a Ruby gem for feature selection , 2012, Bioinform..

[13]  Ruth Nussinov,et al.  Artificial intelligence in COVID-19 drug repurposing , 2020, The Lancet Digital Health.

[14]  Evan Bolton,et al.  PubChem 2019 update: improved access to chemical data , 2018, Nucleic Acids Res..

[15]  James Y. Zou,et al.  Data Shapley: Equitable Valuation of Data for Machine Learning , 2019, ICML.

[16]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[17]  Nicholas R. Jennings,et al.  A linear approximation method for the Shapley value , 2008, Artif. Intell..

[18]  Andrej Sali,et al.  Discovery of potent, selective multidrug and toxin extrusion transporter 1 (MATE1, SLC47A1) inhibitors through prescription drug profiling and computational modeling. , 2013, Journal of medicinal chemistry.

[19]  M. Sirota,et al.  Drug–nutrient interactions: discovering prescription drug inhibitors of the thiamine transporter ThTR-2 (SLC19A3) , 2019, The American journal of clinical nutrition.

[20]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[21]  Kit-Kay Mak,et al.  Artificial intelligence in drug development: present status and future prospects. , 2019, Drug discovery today.

[22]  Avner Schlessinger,et al.  Molecular modeling and ligand docking for solute carrier (SLC) transporters. , 2013, Current topics in medicinal chemistry.

[23]  Annette M. Molinaro,et al.  Prediction error estimation: a comparison of resampling methods , 2005, Bioinform..

[24]  G. Schneider,et al.  Rethinking drug design in the artificial intelligence era , 2019, Nature Reviews Drug Discovery.

[25]  Pär Matsson,et al.  Profiling of a prescription drug library for potential renal drug-drug interactions mediated by the organic cation transporter 2. , 2011, Journal of medicinal chemistry.

[26]  Daniel L. Rubin,et al.  Data valuation for medical imaging using Shapley value and application to a large-scale chest X-ray dataset , 2020, Scientific Reports.

[27]  Shantanu Deshmukh,et al.  Machine Learning for Classification of Inhibitors of Hepatic Drug Transporters , 2018, 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA).

[28]  Ulf Norinder,et al.  Classification of Inhibitors of Hepatic Organic Anion Transporting Polypeptides (OATPs): Influence of Protein Expression on Drug–Drug Interactions , 2012, Journal of medicinal chemistry.

[29]  G. V. van Westen,et al.  Structure-Based Identification of OATP1B1/3 Inhibitors , 2013, Molecular Pharmacology.

[30]  Abubakar Abid,et al.  Interpretation of Neural Networks is Fragile , 2017, AAAI.

[31]  A. Sali,et al.  Computational Discovery and Experimental Validation of Inhibitors of the Human Intestinal Transporter OATP2B1 , 2017, J. Chem. Inf. Model..