Improving the drug discovery process by using multiple classifier systems

Abstract Machine learning methods have become an indispensable tool for utilizing large knowledge and data repositories in science and technology. In the context of the pharmaceutical domain, the amount of acquired knowledge about the design and synthesis of pharmaceutical agents and bioactive molecules (drugs) is enormous. The primary challenge for automatically discovering new drugs from molecular screening information is related to the high dimensionality of datasets, where a wide range of features is included for each candidate drug. Thus, the implementation of improved techniques to ensure an adequate manipulation and interpretation of data becomes mandatory. To mitigate this problem, our tool (called D2-MCS) can split homogeneously the dataset into several groups (the subset of features) and subsequently, determine the most suitable classifier for each group. Finally, the tool allows determining the biological activity of each molecule by a voting scheme. The application of the D2-MCS tool was tested on a standardized, high quality dataset gathered from ChEMBL and have shown outperformance of our tool when compare to well-known single classification models.

[1]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[2]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[3]  Andrew Charlesworth The ascent of smartphone , 2009 .

[4]  Antonio Lavecchia,et al.  Machine-learning approaches in drug discovery: methods and applications. , 2015, Drug discovery today.

[5]  A. Shrake,et al.  Environment and exposure to solvent of protein atoms. Lysozyme and insulin. , 1973, Journal of molecular biology.

[6]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[7]  Joel Lexchin,et al.  The cost of drug development: a systematic review. , 2011, Health policy.

[8]  David R. Gilbert,et al.  An Empirical Comparison of Supervised Machine Learning Techniques in Bioinformatics , 2003, APBC.

[9]  Matthew J. Saltzman,et al.  Statistical Analysis of Computational Tests of Algorithms and Heuristics , 2000, INFORMS J. Comput..

[10]  J. Friedman Regularized Discriminant Analysis , 1989 .

[11]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[12]  Kurt Hornik,et al.  R/Weka interface , 2015 .

[13]  Viv Bewick,et al.  Statistics review 13: Receiver operating characteristic curves , 2004, Critical care.

[14]  M. Pett Nonparametric Statistics for Health Care Research: Statistics for Small Samples and Unusual Distributions , 1997 .

[15]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[16]  R. H. Wilcox Adaptive control processes—A guided tour, by Richard Bellman, Princeton University Press, Princeton, New Jersey, 1961, 255 pp., $6.50 , 1961 .

[17]  Emanuel Carrilho,et al.  A review of DNA sequencing techniques , 2002, Quarterly Reviews of Biophysics.

[18]  Max Kuhn,et al.  Building Predictive Models in R Using the caret Package , 2008 .

[19]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[20]  Woody Sherman,et al.  In search of novel ligands using a structure-based approach: a case study on the adenosine A2A receptor , 2016, Journal of Computer-Aided Molecular Design.

[21]  Raphael Cohen-Almagor,et al.  Internet History , 2011, Int. J. Technoethics.

[22]  Michael T. M. Emmerich,et al.  Application of portfolio optimization to drug discovery , 2019, Inf. Sci..

[23]  H Christopher Frey,et al.  OF SENSITIVITY ANALYSIS , 2001 .

[24]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[25]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[26]  Roger A. Sayle,et al.  Comparing structural fingerprints using a literature-based similarity benchmark , 2016, Journal of Cheminformatics.

[27]  Jürgen Bajorath,et al.  Integration of virtual and high-throughput screening , 2002, Nature Reviews Drug Discovery.

[28]  Bogdan Gabrys,et al.  Classifier selection for majority voting , 2005, Inf. Fusion.

[29]  Van V. Brantner,et al.  Estimating the cost of new drug development: is it really 802 million dollars? , 2006, Health affairs.

[30]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[31]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[32]  Feng Liu,et al.  Deep Learning and Its Applications in Biomedicine , 2018, Genom. Proteom. Bioinform..

[33]  Stephen R. Johnson,et al.  Molecular properties that influence the oral bioavailability of drug candidates. , 2002, Journal of medicinal chemistry.

[34]  P. Selzer,et al.  Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. , 2000, Journal of medicinal chemistry.

[35]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[36]  Thomas Blaschke,et al.  The rise of deep learning in drug discovery. , 2018, Drug discovery today.

[37]  K. Hajian‐Tilaki,et al.  Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. , 2013, Caspian journal of internal medicine.

[38]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[39]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[40]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[41]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[42]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[43]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[44]  C. K. Chow,et al.  Statistical Independence and Threshold Functions , 1965, IEEE Trans. Electron. Comput..

[45]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[46]  Mark Culp,et al.  ada: An R Package for Stochastic Boosting , 2006 .

[47]  Minho Lee,et al.  Utilizing random Forest QSAR models with optimized parameters for target identification and its application to target-fishing server , 2017, BMC Bioinformatics.

[48]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[49]  A. Kosinski A weighted generalized score statistic for comparison of predictive values of diagnostic tests , 2013, Statistics in medicine.

[50]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[51]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[52]  F. Hefti Requirements for a lead compound to become a clinical candidate , 2008, BMC Neuroscience.

[53]  David Rogers,et al.  Extended-Connectivity Fingerprints , 2010, J. Chem. Inf. Model..

[54]  Matías Gámez,et al.  adabag: An R Package for Classification with Boosting and Bagging , 2013 .

[55]  Philip Sedgwick,et al.  Receiver operating characteristic curves , 2011, BMJ : British Medical Journal.

[56]  S D Walter,et al.  A reappraisal of the kappa coefficient. , 1988, Journal of clinical epidemiology.

[57]  Abdul Ghaaliq Lalkhen,et al.  Clinical tests: sensitivity and specificity , 2008 .

[58]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings , 1997 .

[59]  R. W. Hansen,et al.  The price of innovation: new estimates of drug development costs. , 2003, Journal of health economics.

[60]  William N. Venables,et al.  Modern Applied Statistics with S , 2010 .

[61]  Bernard F. Buxton,et al.  Drug Design by Machine Learning: Support Vector Machines for Pharmaceutical Data Analysis , 2001, Comput. Chem..

[62]  M. Civaner Sale strategies of pharmaceutical companies in a "pharmerging" country: the problems will not improve if the gaps remain. , 2012, Health policy.

[63]  Spyros Makridakis,et al.  Accuracy measures: theoretical and practical concerns☆ , 1993 .

[64]  Louis Vuurpijl,et al.  An overview and comparison of voting methods for pattern recognition , 2002, Proceedings Eighth International Workshop on Frontiers in Handwriting Recognition.

[65]  Alexander Golbraikh,et al.  A Novel Automated Lazy Learning QSAR (ALL-QSAR) Approach: Method Development, Applications, and Virtual Screening of Chemical Databases Using Validated ALL-QSAR Models , 2006, J. Chem. Inf. Model..

[66]  John P. Overington,et al.  Identification of Allosteric Modulators of Metabotropic Glutamate 7 Receptor Using Proteochemometric Modeling , 2017, J. Chem. Inf. Model..

[67]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[68]  John P. Overington,et al.  ChEMBL: a large-scale bioactivity database for drug discovery , 2011, Nucleic Acids Res..

[69]  Ivor W. Tsang,et al.  The Emerging "Big Dimensionality" , 2014, IEEE Computational Intelligence Magazine.