Considerations in Compound Database Preparation-"Hidden" Impact on Virtual Screening Results

Structure-based virtual screening (SBVS) utilizing docking algorithms has become an essential tool in the drug discovery process, and significant progress has been made in successfully applying the technique to a wide range of receptor targets. In silico validation of virtual screening protocols before application to a receptor target using a corporate or commercially available compound collection is key to establishing a successful process. Ultimately, retrieval of a set of active compounds from a database of inactives is required, and the metric of enrichment (E) is habitually used to discern the quality of separation of the two. Numerous reports have addressed the performance of docking algorithms with regard to the quality of binding mode prediction and the issue of postprocessing "hit lists" of docked ligands. However, the impact of ligand database preprocessing has yet to be examined in the context of virtual screening and prioritization of compounds for biological evaluation. We provide an insight into the implications of cheminformatic preprocessing of a validation database of compounds where multiple protonated, tautomeric, stereochemical, and conformational states have been enumerated. Several commonly used methods for the generation of ligand conformations and conformational ensembles are examined, paired with an exhaustive rigid-body algorithm for the docking of different "multimeric" compound representations to the ligand binding site of the human estrogen receptor alpha. Chemgauss, a shapegaussian scoring function with intrinsic chemical knowledge, was combined with PLP as a consensus-scoring scheme to rank output from the docking protocol and enrichment rates calculated for each screen. The overheads of CPU consumption and the effect on relative database size (disk requirement) for each of the protocols employed are considered. Assessment of these parameters indicates that SBVS enrichments are highly dependent on the initial cheminformatic treatment(s) used in database construction. The interplay of SMILES representations, stereochemical information, protonation state enumeration, and ligand conformation ensembles are critical in achieving optimum enrichment rates in such screening.