A machine learning approach towards the prediction of protein–ligand binding affinity based on fundamental molecular properties

There is an exigency of transformation of the enormous amount of biological data available in various forms into some significant knowledge. We have tried to implement Machine Learning (ML) algorithm models on the protein–ligand binding affinity data already available to predict the binding affinity of the unknown. ML methods are appreciably faster and cheaper as compared to traditional experimental methods or computational scoring approaches. The prerequisites of this prediction are sufficient and unbiased features of training data and a prediction model which can fit the data well. In our study, we have applied Random forest and Gaussian process regression algorithms from the Weka package on protein–ligand binding affinity, which encompasses protein and ligand binding information from PdbBind database. The models are trained on the basis of selective fundamental information of both proteins and ligand, which can be effortlessly fetched from online databases or can be calculated with the availability of structure. The assessment of the models was made on the basis of correlation coefficient (R2) and root mean square error (RMSE). The Random forest model gave R2 and RMSE of 0.76 and 1.31 respectively. We have also used our features and prediction models on the dataset used by others and found that our model with our features outperformed the existing ones.

[1]  Georg E. Schulz,et al.  Principles of Protein Structure , 1979 .

[2]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[3]  CHUN WEI YAP,et al.  PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints , 2011, J. Comput. Chem..

[4]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[5]  Ruth Nussinov,et al.  Principles of docking: An overview of search algorithms and a guide to scoring functions , 2002, Proteins.

[6]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings , 1997 .

[7]  Matthias Rarey,et al.  Small Molecule Docking and Scoring , 2001 .

[8]  Charles J. Manly,et al.  The impact of informatics and computational chemistry on synthesis and screening. , 2001, Drug discovery today.

[9]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[10]  Kevin Burrage,et al.  Prediction of protein solvent accessibility using support vector machines , 2002, Proteins.

[11]  Mary C. Chervenak,et al.  A Direct Measure of the Contribution of Solvent Reorganization to the Enthalpy of Binding , 1994 .

[12]  R. W. Hansen,et al.  The price of innovation: new estimates of drug development costs. , 2003, Journal of health economics.

[13]  P. Colman,et al.  Structure-based drug design. , 1994, Current opinion in structural biology.

[14]  Jean-Philippe Vert,et al.  Protein-ligand interaction prediction: an improved chemogenomics approach , 2008, Bioinform..

[15]  Doreen Meier,et al.  Fundamentals Of Neural Networks Architectures Algorithms And Applications , 2016 .

[16]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[17]  L M Amzel,et al.  Structure-based drug design. , 1998, Current opinion in biotechnology.

[18]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[19]  Renxiao Wang,et al.  The PDBbind database: methodologies and updates. , 2005, Journal of medicinal chemistry.

[20]  F. Lombardo,et al.  Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. , 2001, Advanced drug delivery reviews.

[21]  G. Nienhaus,et al.  Ligand binding and conformational motions in myoglobin , 2000, Nature.

[22]  Wei Deng,et al.  Predicting Protein‐Ligand Binding Affinities Using Novel Geometrical Descriptors and Machine‐Learning Methods. , 2004 .

[23]  Renxiao Wang,et al.  The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. , 2004, Journal of medicinal chemistry.

[24]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[25]  H. Akaike A new look at the statistical model identification , 1974 .

[26]  S. Jones,et al.  Principles of protein-protein interactions. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Kuo-Chen Chou,et al.  Prediction of Protein Structural Classes by Support Vector Machines , 2002, Comput. Chem..

[28]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[29]  C. Lipinski Lead- and drug-like compounds: the rule-of-five revolution. , 2004, Drug discovery today. Technologies.

[30]  P. Wolynes,et al.  The energy landscapes and motions of proteins. , 1991, Science.

[31]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[32]  Kentaro Shimizu,et al.  Tyrosine Kinase Ligand-Receptor Pair Prediction by Using Support Vector Machine , 2015, Adv. Bioinformatics.

[33]  John G. Cleary,et al.  K*: An Instance-based Learner Using and Entropic Distance Measure , 1995, ICML.

[34]  Ruisheng Zhang,et al.  QSAR Models for the Prediction of Binding Affinities to Human Serum Albumin Using the Heuristic Method and a Support Vector Machine. , 2004 .

[35]  Michael M. Mysinger,et al.  Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking , 2012, Journal of medicinal chemistry.

[36]  Natasja Brooijmans,et al.  Molecular recognition and docking algorithms. , 2003, Annual review of biophysics and biomolecular structure.

[37]  Yu Wang,et al.  A comparative study of family-specific protein–ligand complex affinity prediction based on random forest approach , 2015, Journal of Computer-Aided Molecular Design.

[38]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[39]  V. Lee Peptide and protein drug delivery , 1991 .

[40]  Gang Fu,et al.  PubChem Substance and Compound databases , 2015, Nucleic Acids Res..

[41]  Peter Gedeck,et al.  Global Free Energy Scoring Functions Based on Distance-Dependent Atom-Type Pair Descriptors , 2011, J. Chem. Inf. Model..

[42]  Eibe Frank,et al.  Introducing Machine Learning Concepts with WEKA , 2016, Statistical Genomics.

[43]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.