论文信息 - Low-Quality Structural and Interaction Data Improves Binding Affinity Prediction via Random Forest - 字舞流文

Low-Quality Structural and Interaction Data Improves Binding Affinity Prediction via Random Forest

Docking scoring functions can be used to predict the strength of protein-ligand binding. It is widely believed that training a scoring function with low-quality data is detrimental for its predictive performance. Nevertheless, there is a surprising lack of systematic validation experiments in support of this hypothesis. In this study, we investigated to which extent training a scoring function with data containing low-quality structural and binding data is detrimental for predictive performance. We actually found that low-quality data is not only non-detrimental, but beneficial for the predictive performance of machine-learning scoring functions, though the improvement is less important than that coming from high-quality data. Furthermore, we observed that classical scoring functions are not able to effectively exploit data beyond an early threshold, regardless of its quality. This demonstrates that exploiting a larger data volume is more important for the performance of machine-learning scoring functions than restricting to a smaller set of higher data quality.

Kwong-Sak Leung | Pedro J. Ballester | Hongjian Li | Man-Hon Wong | M. Wong | K. Leung | Hongjian Li | P. Ballester

[1] Renxiao Wang,et al. The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. , 2004, Journal of medicinal chemistry.

[2] Jacob D. Durrant,et al. NNScore 2.0: A Neural-Network Receptor–Ligand Scoring Function , 2011, J. Chem. Inf. Model..

[3] Zhihai Liu,et al. Comparative Assessment of Scoring Functions on a Diverse Test Set , 2009, J. Chem. Inf. Model..

[4] Kwong-Sak Leung,et al. Improving AutoDock Vina Using Random Forest: The Growing Accuracy of Binding Affinity Prediction by the Effective Exploitation of Larger Data Sets , 2015, Molecular informatics.

[5] Bo Wang,et al. Support Vector Regression Scoring of Receptor-Ligand Complexes for Rank-Ordering and Virtual Screening of Chemical Libraries , 2011, J. Chem. Inf. Model..

[6] John B. O. Mitchell,et al. Comments on "Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets": Significance for the Validation of Scoring Functions , 2011, J. Chem. Inf. Model..

[7] Jinyan Li,et al. Binding Affinity Prediction for Protein-Ligand Complexes Based on β Contacts and B Factor , 2013, J. Chem. Inf. Model..

[8] Peter Gedeck,et al. Leave-Cluster-Out Cross-Validation Is Appropriate for Scoring Functions Derived from Diverse Protein Data Sets , 2010, J. Chem. Inf. Model..

[9] Arthur J. Olson,et al. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading , 2009, J. Comput. Chem..

[10] John B. O. Mitchell,et al. A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking , 2010, Bioinform..

[11] Lin-Li Li,et al. ID-Score: A New Empirical Scoring Function Based on a Comprehensive Set of Descriptors Related to Protein-Ligand Interactions , 2013, J. Chem. Inf. Model..

[12] Zhihai Liu,et al. Comparative Assessment of Scoring Functions on an Updated Benchmark: 2. Evaluation Methods and General Results , 2014, J. Chem. Inf. Model..

[13] Kwong-Sak Leung,et al. The Impact of Docking Pose Generation Error on the Prediction of Binding Affinity , 2014, CIBB.

[14] Kwong-Sak Leung,et al. Substituting random forest for multiple linear regression improves binding affinity prediction of scoring functions: Cyscore as a case study , 2014, BMC Bioinformatics.

[15] Tom L. Blundell,et al. Does a More Precise Chemical Description of Protein–Ligand Complexes Lead to More Accurate Prediction of Binding Affinity? , 2014, J. Chem. Inf. Model..

[16] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[17] Emidio Capriotti,et al. Bioinformatics and variability in drug response: a protein structural perspective , 2012, Journal of The Royal Society Interface.

[18] Jian Wang,et al. Characterization of Small Molecule Binding. I. Accurate Identification of Strong Inhibitors in Virtual Screening , 2013, J. Chem. Inf. Model..

[19] John B. O. Mitchell,et al. Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification , 2012, Journal of The Royal Society Interface.

[20] Tom Blundell,et al. CREDO: A Protein–Ligand Interaction Database for Drug Discovery , 2009, Chemical biology & drug design.

[21] Kwong-Sak Leung,et al. istar: A Web Platform for Large-Scale Protein-Ligand Docking , 2014, PloS one.

[22] Anthony Nicholls,et al. Essential considerations for using protein-ligand structures in drug discovery. , 2012, Drug discovery today.

[23] Pedro J. Ballester,et al. Machine Learning Scoring Functions Based on Random Forest and Support Vector Regression , 2012, PRIB.

[24] Liwei Li,et al. Target-Specific Support Vector Machine Scoring in Structure-Based Virtual Screening: Computational Validation, In Vitro Testing in Kinases, and Effects on Lung Cancer Cell Proliferation , 2011, J. Chem. Inf. Model..

[25] Jie Li,et al. Comparative Assessment of Scoring Functions on an Updated Benchmark: 1. Compilation of the Test Set , 2014, J. Chem. Inf. Model..