Maximum-Score Diversity Selection for Early Drug Discovery

Diversity selection is a common task in early drug discovery. One drawback of current approaches is that usually only the structural diversity is taken into account, therefore, activity information is ignored. In this article, we present a modified version of diversity selection, which we term Maximum-Score Diversity Selection, that additionally takes the estimated or predicted activities of the molecules into account. We show that finding an optimal solution to this problem is computationally very expensive (it is NP-hard), and therefore, heuristic approaches are needed. After a discussion of existing approaches, we present our new method, which is computationally far more efficient but at the same time produces comparable results. We conclude by validating these theoretical differences on several data sets.

[1]  Stephen D. Pickett,et al.  Diversity Profiling and Design Using 3D Pharmacophores: Pharmacophore-Derived Queries (PDQ) , 1996, J. Chem. Inf. Comput. Sci..

[2]  E. Erkut The discrete p-dispersion problem , 1990 .

[3]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[4]  Dimitris K. Agrafiotis,et al.  Stochastic Algorithms for Maximizing Molecular Diversity , 1997, J. Chem. Inf. Comput. Sci..

[5]  Robert D. Clark,et al.  Balancing Representativeness Against Diversity using Optimizable K-Dissimilarity and Hierarchical Clustering , 1998, J. Chem. Inf. Comput. Sci..

[6]  C. Selassie,et al.  History of Quantitative Structure–Activity Relationships , 2010 .

[7]  R. Brown,et al.  Genetic diversity: applications of evolutionary algorithms to combinatorial library design , 1998 .

[8]  P. Willett,et al.  A Fast Algorithm For Selecting Sets Of Dissimilar Molecules From Large Chemical Databases , 1995 .

[9]  H. Matter,et al.  Selecting optimally diverse compounds from structure databases: a validation study of two-dimensional and three-dimensional molecular descriptors. , 1997, Journal of medicinal chemistry.

[10]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[11]  Roberto Todeschini,et al.  Molecular descriptors for chemoinformatics , 2009 .

[12]  David J. Cummins,et al.  Molecular Diversity in Chemical Databases: Comparison of Medicinal Chemistry Knowledge Bases and Databases of Commercially Available Compounds , 1996, J. Chem. Inf. Comput. Sci..

[13]  Peter J. Fleming,et al.  Combinatorial Library Design Using a Multiobjective Genetic Algorithm , 2002, J. Chem. Inf. Comput. Sci..

[14]  Christian Borgelt,et al.  Canonical Forms for Frequent Graph Mining , 2006, GfKl.

[15]  David E. Goldberg,et al.  A niched Pareto genetic algorithm for multiobjective optimization , 1994, Proceedings of the First IEEE Conference on Evolutionary Computation. IEEE World Congress on Computational Intelligence.

[16]  Xin Wen,et al.  BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities , 2006, Nucleic Acids Res..

[17]  Christian Borgelt,et al.  Mining molecular fragments: finding relevant substructures of molecules , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[18]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[19]  K. Wanner,et al.  Methods and Principles in Medicinal Chemistry , 2007 .

[20]  Roberto Cordone,et al.  Tabu Search versus GRASP for the maximum diversity problem , 2008, 4OR.

[21]  Svetoslav H. Slavov,et al.  Quantitative Correlation of Physical and Chemical Properties with Chemical Structure: Utility for Prediction , 2011 .

[22]  Thorsten Meinl,et al.  Maximum-score diversity selection for early drug discovery , 2010, J. Cheminformatics.

[23]  Thorsten Meinl,et al.  Crossover operators for multiobjective k-subset selection , 2009, GECCO.

[24]  David Pisinger,et al.  Upper bounds and exact algorithms for p-dispersion problems , 2006, Comput. Oper. Res..

[25]  Michael M. Sørensen,et al.  New facets and a branch-and-cut algorithm for the weighted clique problem , 2004, Eur. J. Oper. Res..

[26]  Robert D. Clark,et al.  OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets , 1997, J. Chem. Inf. Comput. Sci..

[27]  박경철,et al.  An Extended Formulation Approach to the Edge-weighted Maximal Clique Problem , 1995 .

[28]  R. Ruffolo,et al.  Drug discovery , 2005, Nature Biotechnology.

[29]  I. Moon,et al.  An Analysis of Network Location Problems with Distance Constraints , 1984 .

[30]  Jiawei Han,et al.  Extracting redundancy-aware top-k patterns , 2006, KDD '06.

[31]  S. Hakimi Optimum Distribution of Switching Centers in a Communication Network and Some Related Graph Theoretic Problems , 1965 .