Machine Learning Methods in Chemoinformatics for Drug Discovery

It is well known that the structure of a molecule is responsible for its biological activity or physicochemical property. Here, we describe the role of machine learning (ML)/statistical methods for building reliable, predictive models in chemoinformatics. The ML methods are broadly divided into clustering, classification and regression techniques. However, the statistical/mathematical techniques which are part of the ML tools, such as artificial neural networks, hidden Markov models, support vector machine, decision tree learning, Random Forest and Naive Bayes and belief networks, are best suited for drug discovery and play an important role in lead identification and lead optimization steps. This chapter provides stepwise procedures for building ML-based classification and regression models using state-of-art open-source and proprietary tools. A few case studies using benchmark data sets have been carried out to demonstrate the efficacy of the ML-based classification for drug designing.

[1]  Yixin Chen,et al.  Application of artificial neural networks in the design of controlled release drug delivery systems. , 2003, Advanced drug delivery reviews.

[2]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[3]  John R. Koza,et al.  Genetic programming: a paradigm for genetically breeding populations of computer programs to solve problems , 1990 .

[4]  Ioannis G. Tsoulos,et al.  GDF: A tool for function estimation through grammatical evolution , 2006, Comput. Phys. Commun..

[5]  Elo Harald Hansen,et al.  New nitrate ion-selective electrodes based on quaternary ammonium compounds in nonporous polymer membranes , 1976 .

[6]  Muthukumarasamy Karthikeyan,et al.  General Melting Point Prediction Based on a Diverse Compound Data Set and Artificial Neural Networks , 2005, J. Chem. Inf. Model..

[7]  Athanasios Tsakonas,et al.  Symbolic regression via genetic programming in the optimization of a controlled release pharmaceutical formulation , 2011 .

[8]  J R Chretien,et al.  Application of Kohonen Neural Networks in classification of biologically active compounds. , 1998, SAR and QSAR in environmental research.

[9]  Mark A. Ragan,et al.  Supervised, semi-supervised and unsupervised inference of gene regulatory networks , 2013, Briefings Bioinform..

[10]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[11]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[12]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[13]  Eyke Hüllermeier,et al.  A WEKA Interface for fMRI Data , 2012, Neuroinformatics.

[14]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[15]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[16]  J. Devillers Prediction of mammalian toxicity of organophosphorus pesticides from QSTR modeling , 2004, SAR and QSAR in environmental research.

[17]  Alexander J. Smola,et al.  Learning with Kernels: support vector machines, regularization, optimization, and beyond , 2001, Adaptive computation and machine learning series.

[18]  Kenneth Hennessy,et al.  An improved genetic programming technique for the classification of Raman spectra , 2004, Knowl. Based Syst..

[19]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[20]  B. Kowalski,et al.  Partial least-squares regression: a tutorial , 1986 .

[21]  Sotiris B. Kotsiantis,et al.  Supervised Machine Learning: A Review of Classification Techniques , 2007, Informatica.

[22]  Brenda J. Andrews,et al.  Unsupervised Clustering of Subcellular Protein Expression Patterns in High-Throughput Microscopy Images Reveals Protein Complexes and Functional Relationships between Proteins , 2013, PLoS Comput. Biol..

[23]  Anthony E Klon Bayesian modeling in virtual high throughput screening. , 2009, Combinatorial chemistry & high throughput screening.

[24]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[25]  Paola Gramatica,et al.  Principles of QSAR models validation: internal and external , 2007 .

[26]  Robert F Murphy,et al.  An active role for machine learning in drug development. , 2011, Nature chemical biology.

[27]  Bertrand Clarke,et al.  Principles and Theory for Data Mining and Machine Learning , 2009 .

[28]  Jeffrey S. Simonoff,et al.  RE-EM trees: a data mining approach for longitudinal and clustered data , 2011, Machine Learning.

[29]  Johann Gasteiger,et al.  Neural networks and genetic algorithms in drug design , 2001 .

[30]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[31]  Dong-Sheng Cao,et al.  A new strategy of outlier detection for QSAR/QSPR , 2009, J. Comput. Chem..

[32]  Tingjun Hou,et al.  ADME evaluation in drug discovery , 2002, Journal of molecular modeling.

[33]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[34]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[35]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[36]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[37]  Roberto Todeschini,et al.  Comparison of Different Approaches to Define the Applicability Domain of QSAR Models , 2012, Molecules.

[38]  H. D. Stensel,et al.  A QSBR development procedure for aromatic xenobiotic degradation by unacclimated bacteria , 1993 .

[39]  Ozgur Kisi,et al.  Evapotranspiration Modeling Using Linear Genetic Programming Technique , 2010 .

[40]  Tingjun Hou,et al.  ADME Evaluation in Drug Discovery. 5. Correlation of Caco-2 Permeation with Simple Molecular Properties , 2004, J. Chem. Inf. Model..

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[43]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..