Comparison of methods based on diversity and similarity for molecule selection and the analysis of drug discovery data.

The concepts of diversity and similarity of molecules are widely used in quantitative methods for designing (selecting) a representative set of molecules and for analyzing the relationship between chemical structure and biological activity. We review methods and algorithms for design of a diverse set of molecules in the chemical space using clustering, cell-based partitioning, or other distance-based approaches. Analogous cell-based and clustering methods are described for analyzing drug-discovery data to predict activity in virtual screening. Some performance comparisons are made. The choice of descriptor variables to characterize chemical structure is also included in the comparative study. We find that the diversity of a selected set is quite sensitive to both the statistical selection method and the choice of molecular descriptors and that, for the dataset used in this study, random selection works surprisingly well in providing a set of data for analysis.

[1]  Gunnar Rätsch,et al.  Active Learning with Support Vector Machines in the Drug Discovery Process , 2003, J. Chem. Inf. Comput. Sci..

[2]  F. Burden Molecular identification number for substructure searches , 1989, J. Chem. Inf. Comput. Sci..

[3]  David J. Cummins,et al.  Molecular Diversity in Chemical Databases: Comparison of Medicinal Chemistry Knowledge Bases and Databases of Commercially Available Compounds , 1996, J. Chem. Inf. Comput. Sci..

[4]  Roberto Todeschini,et al.  Handbook of Molecular Descriptors , 2002 .

[5]  M. E. Johnson,et al.  Minimax and maximin distance designs , 1990 .

[6]  Ian A. Watson,et al.  Experimental Designs for Selecting Molecules from Large Chemical Databases , 1997, J. Chem. Inf. Comput. Sci..

[7]  L. A. Stone,et al.  Computer Aided Design of Experiments , 1969 .

[8]  Lei Zhu,et al.  A Factorial Design To Optimize Cell-Based Drug Discovery Analysis , 2002, J. Chem. Inf. Comput. Sci..

[9]  Lap-Hing Raymond Lam Design and analysis of large chemical databases for drug discovery , 2001 .

[10]  Minge Xie,et al.  A Sequential Approach for Identifying Lead Compounds in Large Chemical Databases , 2001 .

[11]  D J Gans,et al.  On the significance of clusters in the graphical display of structure-activity data. , 1986, Journal of medicinal chemistry.

[12]  Yuanyuan Wang,et al.  Predictive Toxicology: Benchmarking Molecular Descriptors and Statistical Methods , 2003, J. Chem. Inf. Comput. Sci..

[13]  M F Engels,et al.  Smart screening: approaches to efficient HTS. , 2001, Current opinion in drug discovery & development.

[14]  Alan B. Forsythe,et al.  Strategy in drug design. Cluster anlysis as an aid in the selection of substituents , 1973 .

[15]  K L Spear,et al.  Retrospective analysis of an experimental high-throughput screening data set by recursive partitioning. , 2001, Journal of combinatorial chemistry.

[16]  K. M. Smith,et al.  Novel software tools for chemical diversity , 1998 .

[17]  William J. Welch,et al.  Uniform Coverage Designs for Molecule Selection , 2002, Technometrics.

[18]  D K Jones-Hertzog,et al.  Use of recursive partitioning in the sequential screening of G-protein-coupled receptors. , 1999, Journal of pharmacological and toxicological methods.

[19]  Jonathan S. Mason,et al.  Chemistry Space Metrics in Diversity Analysis, Library Design, and Compound Selection , 1998, J. Chem. Inf. Comput. Sci..

[20]  Peter J. Zemroch,et al.  Cluster Analysis as an Experimental Design Generator, With Application to Gasoline Blend ing Experiments , 1986 .

[21]  Louis Hodes,et al.  Clustering a large number of compounds. 1. Establishing the method on an initial sample , 1989, J. Chem. Inf. Comput. Sci..

[22]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..