OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets

Compound selection methods currently available to chemists are based on maximum or minimum dissimilarity selection or on hierarchical clustering. Optimizable K-Dissimilarity Selection (OptiSim) is a novel and efficient stochastic selection algorithm which includes maximum and minimum dissimilarity-based selection as special cases. By adjusting the subsample size parameter K, it is possible to adjust the balance between representativeness and diversity in the compounds selected. The OptiSim algorithm is described, along with some analytical tools for comparing it to other selection methods. Such comparisons indicate that OptiSim can mimic the representativeness of selections based on hierarchical clustering and, at least in some cases, improve upon them.

[1]  J. Wolfowitz,et al.  An Introduction to the Theory of Statistics , 1951, Nature.

[2]  Dimitris K. Agrafiotis,et al.  Stochastic Algorithms for Maximizing Molecular Diversity , 1997, J. Chem. Inf. Comput. Sci..

[3]  P. Willett,et al.  A Fast Algorithm For Selecting Sets Of Dissimilar Molecules From Large Chemical Databases , 1995 .

[4]  R. Mannhold,et al.  Comparative evaluation of the predictive power of calculation procedures for molecular lipophilicity. , 1995, Journal of pharmaceutical sciences.

[5]  M S Lajiness,et al.  Implementing drug screening programs using molecular similarity methods. , 1989, Progress in clinical and biological research.

[6]  B. M. Brown,et al.  Practical Non-Parametric Statistics. , 1981 .

[7]  Robert D Clark,et al.  Bioisosterism as a molecular diversity descriptor: steric fields of single "topomeric" conformers. , 1996, Journal of medicinal chemistry.

[8]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[9]  Peter Willett,et al.  Definitions of "Dissimilarity" for Dissimilarity-Based Compound Selection , 1996 .

[10]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[11]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[12]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[13]  Cheng Cheng,et al.  Four Association Coefficients for Relating Molecular Similarity Measures , 1996, J. Chem. Inf. Comput. Sci..

[14]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[15]  John M. Barnard,et al.  Clustering of chemical structures on the basis of two-dimensional similarity measures , 1992, J. Chem. Inf. Comput. Sci..