Optimization of Molecular Representativeness

Representative subsets selected from within larger data sets are useful in many chemoinformatics applications including the design of information-rich compound libraries, the selection of compounds for biological evaluation, and the development of reliable quantitative structure-activity relationship (QSAR) models. Such subsets can overcome many of the problems typical of diverse subsets, most notably the tendency of the latter to focus on outliers. Yet only a few algorithms for the selection of representative subsets have been reported in the literature. Here we report on the development of two algorithms for the selection of representative subsets from within parent data sets based on the optimization of a newly devised representativeness function either alone or simultaneously with the MaxMin function. The performances of the new algorithms were evaluated using several measures representing their ability to produce (1) subsets which are, on average, close to data set compounds; (2) subsets which, on average, span the same space as spanned by the entire data set; (3) subsets mirroring the distribution of biological indications in a parent data set; and (4) test sets which are well predicted by qualitative QSAR models built on data set compounds. We demonstrate that for three data sets (containing biological indication data, logBBB permeation data, and Plasmodium falciparum inhibition data), subsets obtained using the new algorithms are more representative than subsets obtained by hierarchical clustering, k-means clustering, or the MaxMin optimization at least in three of these measures.

[1]  Alexandre Varnek,et al.  Correlation of blood-brain penetration using structural descriptors. , 2006, Bioorganic & medicinal chemistry.

[2]  Alexander Golbraikh,et al.  Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling? , 2012, J. Chem. Inf. Model..

[3]  Peter Willett,et al.  Designing focused libraries using MoSELECT. , 2002, Journal of molecular graphics & modelling.

[4]  Jörg Huwyler,et al.  A Binary Ant Colony Optimization Classifier for Molecular Activities , 2011, J. Chem. Inf. Model..

[5]  Nathan Brown,et al.  Multi-objective optimization methods in drug design. , 2013, Drug discovery today. Technologies.

[6]  M Waldman,et al.  Novel algorithms for the optimization of molecular diversity of combinatorial libraries. , 2000, Journal of molecular graphics & modelling.

[7]  Valerie J. Gillet,et al.  Diversity selection algorithms , 2011 .

[8]  Marvin Waldman,et al.  Optimization and visualization of molecular diversity of combinatorial libraries , 1996, Molecular Diversity.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Robert D. Clark,et al.  OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets , 1997, J. Chem. Inf. Comput. Sci..

[11]  Alexander Golbraikh,et al.  Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection , 2002, J. Comput. Aided Mol. Des..

[12]  K. M. Smith,et al.  Novel software tools for chemical diversity , 1998 .

[13]  Peter J. Fleming,et al.  Combinatorial Library Design Using a Multiobjective Genetic Algorithm , 2002, J. Chem. Inf. Comput. Sci..

[14]  Iain M. McLay,et al.  Similarity Measures for Rational Set Selection and Analysis of Combinatorial Libraries: The Diverse Property-Derived (DPD) Approach , 1997, Journal of chemical information and computer sciences.

[15]  J. Platts,et al.  Correlation and prediction of a large blood-brain distribution data set--an LFER study. , 2001, European journal of medicinal chemistry.

[16]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[17]  Alexander Tropsha,et al.  Best Practices for QSAR Model Development, Validation, and Exploitation , 2010, Molecular informatics.

[18]  Alexander Golbraikh,et al.  QSAR Modeling of the Blood–Brain Barrier Permeability for Diverse Organic Compounds , 2008, Pharmaceutical Research.

[19]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[20]  Dimitris K. Agrafiotis,et al.  Multiobjective optimization of combinatorial libraries , 2002, J. Comput. Aided Mol. Des..

[21]  Nathan Brown,et al.  Molecular optimization using computational multi-objective methods. , 2007, Current opinion in drug discovery & development.

[22]  Iain M. McLay,et al.  Similarity Measures for Rational Set Selection and Analysis of Combinatorial Libraries: The Diverse Property-Derived (DPD) Approach. , 2010 .

[23]  Constantino Tsallis,et al.  Optimization by Simulated Annealing: Recent Progress , 1995 .

[24]  Hong Li,et al.  Novel algorithms for the optimization of molecular diversity of combinatorial libraries11Color Plates for this article are on pages 533–536. , 2000 .

[25]  Dimitris K. Agrafiotis,et al.  Stochastic Algorithms for Maximizing Molecular Diversity , 1997, J. Chem. Inf. Comput. Sci..

[26]  Robert D. Clark,et al.  Balancing Representativeness Against Diversity using Optimizable K-Dissimilarity and Hierarchical Clustering , 1998, J. Chem. Inf. Comput. Sci..

[27]  Bo Yu,et al.  Size estimation of chemical space: how big is it? , 2012, The Journal of pharmacy and pharmacology.

[28]  Robert D. Clark,et al.  OptDesign: Extending Optimizable k-Dissimilarity Selection to Combinatorial Library Design , 2003, J. Chem. Inf. Comput. Sci..

[29]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[30]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[31]  Paola Gramatica,et al.  Introduction General Considerations , 2022 .

[32]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .