A Fractal Approach for Selecting an Appropriate Bin Size for Cell-Based Diversity Estimation

A novel approach for selecting an appropriate bin size for cell-based diversity assessment is presented. The method measures the sensitivity of the diversity index as a function of grid resolution, using a box-counting algorithm that is reminiscent of those used in fractal analysis. It is shown that the relative variance of the diversity score (sum of squared cell occupancies) of several commonly used molecular descriptor sets exhibits a bell-shaped distribution, whose exact characteristics depend on the distribution of the data set, the number of points considered, and the dimensionality of the feature space. The peak of this distribution represents the optimal bin size for a given data set and sample size. Although box counting can be performed in an algorithmically efficient manner, the ability of cell-based methods to distinguish between subsets of different spread falls sharply with dimensionality, and the method becomes useless beyond a few dimensions.

[1]  Dimitris K. Agrafiotis,et al.  Multidimensional scaling of combinatorial libraries without explicit enumeration , 2001, J. Comput. Chem..

[2]  N. Trinajstic,et al.  Information theory, distance matrix, and molecular branching , 1977 .

[3]  P. Kollman,et al.  Encyclopedia of computational chemistry , 1998 .

[4]  Dimitris K. Agrafiotis,et al.  Advances in diversity profiling and combinatorial series design , 2004, Molecular Diversity.

[5]  Christos Faloutsos,et al.  Fast feature selection using fractal dimension , 2010, J. Inf. Data Manag..

[6]  Dimitris K. Agrafiotis,et al.  An Efficient Implementation of Distance-Based Diversity Measures Based on k-d Trees , 1999, J. Chem. Inf. Comput. Sci..

[7]  David J. Cummins,et al.  Molecular Diversity in Chemical Databases: Comparison of Medicinal Chemistry Knowledge Bases and Databases of Commercially Available Compounds , 1996, J. Chem. Inf. Comput. Sci..

[8]  Sun-Ok Gwon University of Texas at Austin의 연구 현황 , 2002 .

[9]  Dimitris K. Agrafiotis,et al.  Multidimensional scaling and visualization of large molecular similarity tables , 2001 .

[10]  D K Agrafiotis,et al.  Kolmogorov-Smirnov statistic and its application in library design. , 2000, Journal of molecular graphics & modelling.

[11]  A. Ghose,et al.  Prediction of Hydrophobic (Lipophilic) Properties of Small Organic Molecules Using Fragmental Methods: An Analysis of ALOGP and CLOGP Methods , 1998 .

[12]  Dimitris K. Agrafiotis Multiobjective optimization of combinatorial libraries , 2001, IBM J. Res. Dev..

[13]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[14]  Robert S. Pearlman,et al.  Metric Validation and the Receptor-Relevant Subspace Concept , 1999, J. Chem. Inf. Comput. Sci..

[15]  P. Schleyer Encyclopedia of computational chemistry , 1998 .

[16]  G. Schneider,et al.  Virtual Screening for Bioactive Molecules , 2000 .

[17]  Dimitris K. Agrafiotis,et al.  A Constant Time Algorithm for Estimating the Diversity of Large Chemical Libraries , 2001, J. Chem. Inf. Comput. Sci..

[18]  Marvin Waldman,et al.  Evaluation of Reagent-Based and Product-Based Strategies in the Design of Combinatorial Library Subsets , 2000, J. Chem. Inf. Comput. Sci..