Similarity to Molecules in the Training Set Is a Good Discriminator for Prediction Accuracy in QSAR

How well can a QSAR model predict the activity of a molecule not in the training set used to create the model? A set of retrospective cross-validation experiments using 20 diverse in-house activity sets were done to find a good discriminator of prediction accuracy as measured by root-mean-square difference between observed and predicted activity. Among the measures we tested, two seem useful: the similarity of the molecule to be predicted to the nearest molecule in the training set and/or the number of neighbors in the training set, where neighbors are those more similar than a user-chosen cutoff. The molecules with the highest similarity and/or the most neighbors are the best-predicted. This trend holds true for narrow training sets and, to a lesser degree, for many diverse training sets and does not depend on which QSAR method or descriptor is used. One may define the similarity using a different descriptor than that used for the QSAR model. The similarity dependence for diverse training sets is somewhat unexpected. It appears to be greater for those data sets where the association of similar activities vs similar structures (as encoded in the Patterson plot) is stronger. We propose a way to estimate the reliability of the prediction of an arbitrary chemical structure on a given QSAR model, given the training set from which the model was derived.

[1]  N. Nikolova,et al.  International Union of Pure and Applied Chemistry, LUMO energy ± The Lowest Unoccupied Molecular Orbital (LUMO) , 2022 .

[2]  Jürgen Bajorath,et al.  Selected Concepts and Investigations in Compound Classification, Molecular Descriptor Analysis, and Virtual Screening , 2001, J. Chem. Inf. Comput. Sci..

[3]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[4]  Robert P. Sheridan,et al.  The Centroid Approximation for Mixtures: Calculating Similarity and Deriving Structure-Activity Relationships , 2000, J. Chem. Inf. Comput. Sci..

[5]  Bruce L. Bush,et al.  Sample-distance partial least squares: PLS optimized for many variables, with application to CoMFA , 1993, J. Comput. Aided Mol. Des..

[6]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[7]  Y. Martin,et al.  Do structurally similar molecules have similar biological activity? , 2002, Journal of medicinal chemistry.

[8]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[9]  David J. Livingstone,et al.  The Characterization of Chemical Structures Using Molecular Properties. A Survey , 2000, J. Chem. Inf. Comput. Sci..

[10]  Darko Butina,et al.  Unsupervised Data Base Clustering Based on Daylight's Fingerprint and Tanimoto Similarity: A Fast and Automated Way To Cluster Small and Large Data Sets , 1999, J. Chem. Inf. Comput. Sci..

[11]  Ramaswamy Nilakantan,et al.  Topological torsion: a new molecular descriptor for SAR applications. Comparison with other descriptors , 1987, J. Chem. Inf. Comput. Sci..

[12]  R. Venkataraghavan,et al.  Atom pairs as molecular features in structure-activity studies: definition and applications , 1985, J. Chem. Inf. Comput. Sci..

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  Marina Lasagni,et al.  New molecular descriptors for 2D and 3D structures. Theory , 1994 .

[15]  Bruce L. Bush,et al.  Extending the trend vector: The trend matrix and sample-based partial least squares , 1994, J. Comput. Aided Mol. Des..

[16]  J. Gasteiger,et al.  Automatic generation of 3D-atomic coordinates for organic molecules , 1990 .