Benchmark of four popular virtual screening programs: construction of the active/decoy dataset remains a major determinant of measured performance

AbstractBackground In a structure-based virtual screening, the choice of the docking program is essential for the success of a hit identification. Benchmarks are meant to help in guiding this choice, especially when undertaken on a large variety of protein targets. Here, the performance of four popular virtual screening programs, Gold, Glide, Surflex and FlexX, is compared using the Directory of Useful Decoys-Enhanced database (DUD-E), which includes 102 targets with an average of 224 ligands per target and 50 decoys per ligand, generated to avoid biases in the benchmarking. Then, a relationship between these program performances and the properties of the targets or the small molecules was investigated. ResultsThe comparison was based on two metrics, with three different parameters each. The BEDROC scores with α = 80.5, indicated that, on the overall database, Glide succeeded (score > 0.5) for 30 targets, Gold for 27, FlexX for 14 and Surflex for 11. The performance did not depend on the hydrophobicity nor the openness of the protein cavities, neither on the families to which the proteins belong. However, despite the care in the construction of the DUD-E database, the small differences that remain between the actives and the decoys likely explain the successes of Gold, Surflex and FlexX. Moreover, the similarity between the actives of a target and its crystal structure ligand seems to be at the basis of the good performance of Glide. When all targets with significant biases are removed from the benchmarking, a subset of 47 targets remains, for which Glide succeeded for only 5 targets, Gold for 4 and FlexX and Surflex for 2.ConclusionThe performance dramatic drop of all four programs when the biases are removed shows that we should beware of virtual screening benchmarks, because good performances may be due to wrong reasons. Therefore, benchmarking would hardly provide guidelines for virtual screening experiments, despite the tendency that is maintained, i.e., Glide and Gold display better performance than FlexX and Surflex. We recommend to always use several programs and combine their results. Graphical AbstractSummary of the results obtained by virtual screening with the four programs, Glide, Gold, Surflex and FlexX, on the 102 targets of the DUD-E database. The percentage of targets with successful results, i.e., with BDEROC(α = 80.5) > 0.5, when the entire database is considered are in Blue, and when targets with biased chemical libraries are removed are in Red.

[1]  György M. Keserü,et al.  A virtual high throughput screen for high affinity cytochrome P450cam substrates. Implications for in silico prediction of drug metabolism , 2001, J. Comput. Aided Mol. Des..

[2]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[3]  Ruben Abagyan,et al.  Docking and scoring with ICM: the benchmarking results and strategies for improvement , 2012, Journal of Computer-Aided Molecular Design.

[4]  Tom Halgren,et al.  New Method for Fast and Accurate Binding‐site Identification and Analysis , 2007, Chemical biology & drug design.

[5]  H. Yabuuchi,et al.  Analysis of multiple compound–protein interactions reveals novel bioactive molecules , 2011, Molecular systems biology.

[6]  Ajay N. Jain,et al.  Surflex-Dock: Docking benchmarks and real-world application , 2012, Journal of Computer-Aided Molecular Design.

[7]  Didier Rognan,et al.  Comparative evaluation of eight docking tools for docking and virtual screening accuracy , 2004, Proteins.

[8]  D. Higgins,et al.  Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega , 2011, Molecular systems biology.

[9]  Ajay N. Jain Surflex: fully automatic flexible molecular docking using a molecular similarity-based search engine. , 2003, Journal of medicinal chemistry.

[10]  Matthieu Montes,et al.  Benchmarking Data Sets for the Evaluation of Virtual Ligand Screening Methods: Review and Perspectives , 2015, J. Chem. Inf. Model..

[11]  P Willett,et al.  Development and validation of a genetic algorithm for flexible docking. , 1997, Journal of molecular biology.

[12]  Peter Kolb,et al.  Automated docking of highly flexible ligands by genetic algorithms: A critical assessment , 2004, J. Comput. Chem..

[13]  Lemont B. Kier,et al.  Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information , 1995, J. Chem. Inf. Comput. Sci..

[14]  M. Murcko,et al.  Consensus scoring: A method for obtaining improved hit rates from docking databases of three-dimensional structures into proteins. , 1999, Journal of medicinal chemistry.

[15]  Oliver Korb,et al.  Pose prediction and virtual screening performance of GOLD scoring functions in a standardized test , 2012, Journal of Computer-Aided Molecular Design.

[16]  Michael M. Mysinger,et al.  Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking , 2012, Journal of medicinal chemistry.

[17]  Woody Sherman,et al.  Analysis and comparison of 2D fingerprints: insights into database screening performance using eight fingerprint methods , 2010, J. Cheminformatics.

[18]  Liliane Mouawad,et al.  vSDC: a method to improve early recognition in virtual screening when limited experimental resources are available , 2016, Journal of Cheminformatics.

[19]  R. Glen,et al.  Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation. , 1995, Journal of molecular biology.

[20]  Dariusz Plewczynski,et al.  Can we trust docking results? Evaluation of seven commonly used programs on PDBbind database , 2011, J. Comput. Chem..

[21]  John J. Irwin,et al.  Community benchmarks for virtual screening , 2008, J. Comput. Aided Mol. Des..

[22]  C. E. Peishoff,et al.  A critical assessment of docking programs and scoring functions. , 2006, Journal of medicinal chemistry.

[23]  Thomas Lengauer,et al.  A fast flexible docking method using an incremental construction algorithm. , 1996, Journal of molecular biology.

[24]  Matthias Rarey,et al.  A consistent description of HYdrogen bond and DEhydration energies in protein–ligand complexes: methods behind the HYDE scoring function , 2012, Journal of Computer-Aided Molecular Design.

[25]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[26]  Yongbo Hu,et al.  Comparison of Several Molecular Docking Programs: Pose Prediction and Virtual Screening Accuracy , 2009, J. Chem. Inf. Model..

[27]  Martin Stahl,et al.  Binding site characteristics in structure-based virtual screening: evaluation of current docking tools , 2003, Journal of molecular modeling.

[28]  M Rarey,et al.  Detailed analysis of scoring functions for virtual screening. , 2001, Journal of medicinal chemistry.

[29]  J. Irwin,et al.  Benchmarking sets for molecular docking. , 2006, Journal of medicinal chemistry.

[30]  Matthew P. Repasky,et al.  Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. , 2004, Journal of medicinal chemistry.

[31]  Gerhard Klebe,et al.  Molecular Docking Screens Using Comparative Models of Proteins , 2009, J. Chem. Inf. Model..

[32]  Yu-chian Chen Beware of docking! , 2015, Trends in pharmacological sciences.

[33]  Hege S. Beard,et al.  Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. , 2004, Journal of medicinal chemistry.

[34]  Tudor I. Oprea,et al.  Optimization of CAMD techniques 3. Virtual screening enrichment studies: a help or hindrance in tool selection? , 2008, J. Comput. Aided Mol. Des..