Machine learning methods in chemoinformatics

Machine learning algorithms are generally developed in computer science or adjacent disciplines and find their way into chemical modeling by a process of diffusion. Though particular machine learning methods are popular in chemoinformatics and quantitative structure–activity relationships (QSAR), many others exist in the technical literature. This discussion is methods‐based and focused on some algorithms that chemoinformatics researchers frequently use. It makes no claim to be exhaustive. We concentrate on methods for supervised learning, predicting the unknown property values of a test set of instances, usually molecules, based on the known values for a training set. Particularly relevant approaches include Artificial Neural Networks, Random Forest, Support Vector Machine, k‐Nearest Neighbors and naïve Bayes classifiers. WIREs Comput Mol Sci 2014, 4:468–481.

[1]  Thomas Sander,et al.  Toxicity-Indicating Structural Patterns , 2006, J. Chem. Inf. Model..

[2]  Robert C. Glen,et al.  Random Forest Models To Predict Aqueous Solubility , 2007, J. Chem. Inf. Model..

[3]  S. A. Salah,et al.  Feature extraction and classification of Chilean wines , 2006 .

[4]  F. Galton Vox Populi , 1907, Nature.

[5]  Oliver Korb,et al.  Efficient ant colony optimization algorithms for structure- and ligand-based drug design , 2009 .

[6]  Anatoly G Artemenko,et al.  Interpretation of QSAR Models Based on Random Forest Methods , 2011, Molecular informatics.

[7]  S C Basak,et al.  Predicting mutagenicity of chemicals using topological and quantum chemical parameters: a similarity based study. , 1995, Chemosphere.

[8]  Káthia M. Honório,et al.  A study on the influence of molecular properties in the psychoactivity of cannabinoid compounds , 2005, Journal of molecular modeling.

[9]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[10]  Lazaros Mavridis,et al.  Predicting the protein targets for athletic performance-enhancing substances , 2013, Journal of Cheminformatics.

[11]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[12]  Z. R. Li,et al.  Prediction of estrogen receptor agonists and characterization of associated molecular descriptors by statistical learning methods. , 2006, Journal of molecular graphics & modelling.

[13]  L. Hammett,et al.  Reaction Rates and Indicator Acidities. , 1935 .

[14]  John B. O. Mitchell,et al.  Toxicological relationships between proteins obtained from protein target predictions of large toxicity databases. , 2008, Toxicology and applied pharmacology.

[15]  Dirk Neumann,et al.  A Fully Computational Model for Predicting Percutaneous Drug Absorption , 2006, J. Chem. Inf. Model..

[16]  Z R Li,et al.  Quantitative structure-pharmacokinetic relationships for drug clearance by using statistical learning methods. , 2006, Journal of molecular graphics & modelling.

[17]  C E Berkoff,et al.  Substructural analysis. A novel approach to the problem of drug design. , 1974, Journal of medicinal chemistry.

[18]  Alexander Tropsha,et al.  k Nearest Neighbors QSAR Modeling as a Variational Problem: Theory and Applications , 2005, J. Chem. Inf. Model..

[19]  Scott Boyer,et al.  Interpretation of Nonlinear QSAR Models Applied to Ames Mutagenicity Data , 2009, J. Chem. Inf. Model..

[20]  Teruki Honma,et al.  Combining Machine Learning and Pharmacophore-Based Interaction Fingerprint for in Silico Screening , 2010, J. Chem. Inf. Model..

[21]  Leonard E. Trigg,et al.  Technical Note: Naive Bayes for Regression , 2000, Machine Learning.

[22]  Dong-Sheng Cao,et al.  Prediction of aqueous solubility of druglike organic compounds using partial least squares, back‐propagation network and support vector machine , 2010 .

[23]  Yan Zhao,et al.  Drug repositioning: a machine-learning approach through data integration , 2013, Journal of Cheminformatics.

[24]  François Petitet,et al.  In Silico Classification of hERG Channel Blockers: a Knowledge‐Based Strategy , 2006, ChemMedChem.

[25]  Dmitrij Frishman,et al.  Pitfalls of supervised feature selection , 2009, Bioinform..

[26]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[27]  Driss Zakarya,et al.  Structure–camphor odour relationships using the Generation and Selection of Pertinent Descriptors approach , 1999 .

[28]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[29]  John B. O. Mitchell,et al.  Can we predict lattice energy from molecular structure? , 2003, Acta Crystallographica Section B Structural Science.

[30]  Judith C. Madden,et al.  In Silico Prediction of Aqueous Solubility: The Solubility Challenge , 2009, J. Chem. Inf. Model..

[31]  Samuel H. Yalkowsky,et al.  Prediction of Drug Solubility by the General Solubility Equation (GSE) , 2001, J. Chem. Inf. Comput. Sci..

[32]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[33]  Lazaros Mavridis,et al.  Comprehensive Comparison of Ligand-Based Virtual Screening Tools Against the DUD Data set Reveals Limitations of Current 3D Methods , 2010, J. Chem. Inf. Model..

[34]  Chih-Jen Lin,et al.  A comparison of methods for multiclass support vector machines , 2002, IEEE Trans. Neural Networks.

[35]  M Karplus,et al.  Three-dimensional quantitative structure-activity relationships from molecular similarity matrices and genetic neural networks. 2. Applications. , 1997, Journal of medicinal chemistry.

[36]  Andreas Bender,et al.  Chemoinformatics-Based Classification of Prohibited Substances Employed for Doping in Sport , 2006, J. Chem. Inf. Model..

[37]  Stu Borman,et al.  New QSAR Techniques Eyed For Environmental Assessments: Expert system, spectroscopy method use readily available data to develop quantitative structure-activity relationships for broad compound classes , 1990 .

[38]  Alexander Tropsha,et al.  Novel Variable Selection Quantitative Structure-Property Relationship Approach Based on the k-Nearest-Neighbor Principle , 2000, J. Chem. Inf. Comput. Sci..

[39]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[40]  P. Khadikar,et al.  Prediction of intrinsic solubility of generic drugs using MLR, ANN and SVM analyses. , 2010, European journal of medicinal chemistry.

[41]  C. Hansch,et al.  p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure , 1964 .

[42]  Thorsten Joachims,et al.  Cutting-plane training of structural SVMs , 2009, Machine Learning.

[43]  David J. Livingstone,et al.  Application of QSPR to Mixtures , 2006, J. Chem. Inf. Model..

[44]  Sudhir A. Kulkarni,et al.  Three-Dimensional QSAR Using the k-Nearest Neighbor Method and Its Interpretation , 2006, J. Chem. Inf. Model..

[45]  Gerta Rücker,et al.  y-Randomization and Its Variants in QSPR/QSAR , 2007, J. Chem. Inf. Model..

[46]  Jiro Shimada,et al.  Hidden Active Information in a Random Compound Library: Extraction Using a Pseudo-Structure-Activity Relationship Model , 2008, J. Chem. Inf. Model..

[47]  Ralf Mikut,et al.  Interpretable Features for the Activity Prediction of Short Antimicrobial Peptides Using Fuzzy Logic , 2009, International Journal of Peptide Research and Therapeutics.

[48]  John B. O. Mitchell,et al.  Predicting intrinsic aqueous solubility by a thermodynamic cycle. , 2008, Molecular pharmaceutics.

[49]  Hans Briem,et al.  Classifying “Kinase Inhibitor‐Likeness” by Using Machine‐Learning Methods , 2005, Chembiochem : a European journal of chemical biology.

[50]  John B. O. Mitchell,et al.  A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking , 2010, Bioinform..

[51]  Emilio Xavier Esposito,et al.  Findings of the Challenge To Predict Aqueous Solubility , 2009, J. Chem. Inf. Model..

[52]  Shengang Yuan,et al.  Prediction of Mutagenic Toxicity by Combination of Recursive Partitioning and Support Vector Machines. , 2008 .

[53]  Stephen Muggleton,et al.  Support vector inductive logic programming outperforms the naive Bayes classifier and inductive logic programming for the classification of bioactive chemical compounds , 2007, J. Comput. Aided Mol. Des..

[54]  Bruce R. Kowalski,et al.  Pattern Recognition in Chemical Research , 1974 .

[55]  Kai Chen,et al.  Using support vector classification for SAR of fentanyl derivatives , 2005, Acta Pharmacologica Sinica.

[56]  Florian Nigsch,et al.  How To Winnow Actives from Inactives: Introducing Molecular Orthogonal Sparse Bigrams (MOSBs) and Multiclass Winnow , 2008, J. Chem. Inf. Model..

[57]  Paola Gramatica,et al.  The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models , 2003 .

[58]  Lars Carlsson,et al.  Beyond the Scope of Free-Wilson Analysis: Building Interpretable QSAR Models with Machine Learning Algorithms , 2013, J. Chem. Inf. Model..

[59]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[60]  Jan Kelder,et al.  Use of physicochemical calculation of pKa and CLogP to predict phospholipidosis-inducing potential: a case study with structurally related piperazines. , 2004, Experimental and toxicologic pathology : official journal of the Gesellschaft fur Toxikologische Pathologie.

[61]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[62]  K. Gasem,et al.  An Improved Structure−Property Model for Predicting Melting-Point Temperatures , 2006 .

[63]  Florian Nigsch,et al.  Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR Models of Solubility, Melting Point, and Log P , 2008, J. Chem. Inf. Model..

[64]  Mire Zloh,et al.  Prediction of aqueous solubility of drug-like molecules using a novel algorithm for automatic adjustment of relative importance of descriptors implemented in counter-propagation artificial neural networks. , 2012, International journal of pharmaceutics.

[65]  Andreas Bender,et al.  In Silico Target Predictions: Defining a Benchmarking Data Set and Comparison of Performance of the Multiclass Naïve Bayes and Parzen-Rosenblatt Window , 2013, J. Chem. Inf. Model..

[66]  M Karplus,et al.  Three-dimensional quantitative structure-activity relationships from molecular similarity matrices and genetic neural networks. 1. Method and validations. , 1997, Journal of medicinal chemistry.

[67]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[68]  John B. O. Mitchell,et al.  Development and Comparison of hERG Blocker Classifiers: Assessment on Different Datasets Yields Markedly Different Results , 2011, Molecular informatics.

[69]  Clayton Springer,et al.  An investigation into pharmaceutically relevant mutagenicity data and the influence on Ames predictive potential , 2011, J. Cheminformatics.

[70]  Mikko Kolehmainen,et al.  Structure-based classification of active and inactive estrogenic compounds by decision tree, LVQ and kNN methods. , 2006, Chemosphere.

[71]  David W. Opitz,et al.  Use of Statistical and Neural Net Approaches in Predicting Toxicity of Chemicals , 2000, J. Chem. Inf. Comput. Sci..

[72]  Dariusz Plewczynski,et al.  Assessing Different Classification Methods for Virtual Screening , 2006, J. Chem. Inf. Model..

[73]  M. Murcko,et al.  Consensus scoring: A method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. , 1999, Journal of medicinal chemistry.

[74]  Tudor I. Oprea,et al.  hERG classification model based on a combination of support vector machine method and GRIND descriptors. , 2008, Molecular pharmaceutics.

[75]  Gerald M. Maggiora,et al.  On Outliers and Activity Cliffs-Why QSAR Often Disappoints , 2006, J. Chem. Inf. Model..

[76]  Alan R. Kennedy,et al.  Targeted crystallisation of novel carbamazepine solvates based on a retrospective Random Forest classification , 2008 .

[77]  Robert C. Glen,et al.  Classifying Molecules Using a Sparse Probabilistic Kernel Binary Classifier , 2011, J. Chem. Inf. Model..

[78]  Douglas M. Hawkins,et al.  Tailored Similarity Spaces for the Prediction of Physicochemical Properties , 2002 .

[79]  Paul L. A. Popelier,et al.  pKa Prediction from "Quantum Chemical Topology" Descriptors , 2009, J. Chem. Inf. Model..

[80]  Roberto Todeschini,et al.  Comments on the Definition of the Q2 Parameter for QSAR Validation , 2009, J. Chem. Inf. Model..

[81]  Philip E. Bourne,et al.  A Machine Learning-Based Method To Improve Docking Scoring Functions and Its Application to Drug Repurposing , 2011, J. Chem. Inf. Model..

[82]  Peteris Prusis,et al.  Rough set‐based proteochemometrics modeling of G‐protein‐coupled receptor‐ligand interactions , 2006, Proteins.

[83]  Maik Moeller,et al.  An Introduction To Chemoinformatics , 2016 .

[84]  Yi-Zeng Liang,et al.  Monte Carlo cross‐validation for selecting a model and estimating the prediction error in multivariate calibration , 2004 .

[85]  Pierre Baldi,et al.  Deep Architectures and Deep Learning in Chemoinformatics: The Prediction of Aqueous Solubility for Drug-Like Molecules , 2013, J. Chem. Inf. Model..

[86]  David A. Winkler,et al.  Capturing the Crystal: Prediction of Enthalpy of Sublimation, Crystal Lattice Energy, and Melting Points of Organic Compounds , 2013, J. Chem. Inf. Model..

[87]  William Stafford Noble,et al.  Support vector machine , 2013 .

[88]  Gerhard F. Ecker,et al.  Classification Models for hERG Inhibitors by Counter‐Propagation Neural Networks , 2008, Chemical biology & drug design.

[89]  J. Gasteiger,et al.  Automatic generation of 3D-atomic coordinates for organic molecules , 1990 .

[90]  John B. O. Mitchell,et al.  Predicting the mechanism of phospholipidosis , 2012, Journal of Cheminformatics.

[91]  Alexander Golbraikh,et al.  Combinatorial QSAR of Ambergris Fragrance Compounds , 2004, J. Chem. Inf. Model..

[92]  Ruili Huang,et al.  Structure Based Model for the Prediction of Phospholipidosis Induction Potential of Small Molecules , 2012, J. Chem. Inf. Model..

[93]  Sunil S. Bhagwat,et al.  Prediction of Melting Points of Organic Compounds Using Extreme Learning Machines , 2008 .

[94]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[95]  John B. O. Mitchell,et al.  Simultaneous feature selection and parameter optimisation using an artificial ant colony: case study of melting point prediction , 2008, Chemistry Central journal.

[96]  Edmund A. Mennis The Wisdom of Crowds: Why the Many Are Smarter than the Few and How Collective Wisdom Shapes Business, Economies, Societies, and Nations , 2006 .

[97]  Florian Nigsch,et al.  A novel hybrid ultrafast shape descriptor method for use in virtual screening , 2008, Chemistry Central journal.

[98]  Artem Cherkasov,et al.  Comparative QSAR- and Fragments Distribution Analysis of Drugs, Druglikes, Metabolic Substances, and Antimicrobial Compounds , 2006, J. Chem. Inf. Model..

[99]  Robert C. Glen,et al.  Predicting Phospholipidosis Using Machine Learning , 2010, Molecular pharmaceutics.

[100]  Ralph Kühne,et al.  Model Selection Based on Structural Similarity-Method Description and Application to Water Solubility Prediction , 2006, J. Chem. Inf. Model..

[101]  Ingo Muegge,et al.  kScore: a novel machine learning approach that is not dependent on the data structure of the training set , 2007, J. Comput. Aided Mol. Des..

[102]  Andreas Bender,et al.  Prospective Validation of a Comprehensive In silico hERG Model and its Applications to Commercial Compound and Drug Databases , 2010, ChemMedChem.

[103]  Antony W. Goodwin,et al.  ELECTRICAL SYNAPSES IN THE MAMMALIAN BRAIN , 2010 .

[104]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[105]  C. Springer,et al.  PostDOCK: a structural, empirical approach to scoring protein ligand complexes. , 2005, Journal of medicinal chemistry.

[106]  Jano I. van Hemert,et al.  EnzML: multi-label prediction of enzyme classes using InterPro signatures , 2012, BMC Bioinformatics.

[107]  Christoph A. Sotriffer,et al.  SFCscoreRF: A Random Forest-Based Scoring Function for Improved Affinity Prediction of Protein-Ligand Complexes , 2013, J. Chem. Inf. Model..

[108]  Gerhard Klebe,et al.  Comparison of Automatic Three-Dimensional Model Builders Using 639 X-ray Structures , 1994, J. Chem. Inf. Comput. Sci..

[109]  Igor I. Baskin,et al.  Machine Learning Methods for Property Prediction in Chemoinformatics: Quo Vadis? , 2012, J. Chem. Inf. Model..

[110]  Andreas Bender,et al.  Melting Point Prediction Employing k-Nearest Neighbor Algorithms and Genetic Parameter Optimization , 2006, J. Chem. Inf. Model..

[111]  Andreas Bender,et al.  Ligand-Target Prediction Using Winnow and Naive Bayesian Algorithms and the Implications of Overall Performance Statistics , 2008, J. Chem. Inf. Model..

[112]  Michael G. Hutchings,et al.  Quantitative structure–sublimation enthalpy relationship studied by neural networks, theoretical crystal packing calculations and multilinear regression analysis , 1995 .

[113]  Jianhua Yao,et al.  Prediction of mutagenic toxicity by combination of Recursive Partitioning and Support Vector Machines , 2007, Molecular Diversity.

[114]  Harshinder Singh,et al.  Application of the Random Forest Method in Studies of Local Lymph Node Assay Based Skin Sensitization Data , 2005, J. Chem. Inf. Model..