Comparison of Combinatorial Clustering Methods on Pharmacological Data Sets Represented by Machine Learning-Selected Real Molecular Descriptors

Cluster algorithms play an important role in diversity related tasks of modern chemoinformatics, with the widest applications being in pharmaceutical industry drug discovery programs. The performance of these grouping strategies depends on various factors such as molecular representation, mathematical method, algorithmical technique, and statistical distribution of data. For this reason, introduction and comparison of new methods are necessary in order to find the model that best fits the problem at hand. Earlier comparative studies report on Ward's algorithm using fingerprints for molecular description as generally superior in this field. However, problems still remain, i.e., other types of numerical descriptions have been little exploited, current descriptors selection strategy is trial and error-driven, and no previous comparative studies considering a broader domain of the combinatorial methods in grouping chemoinformatic data sets have been conducted. In this work, a comparison between combinatorial methods is performed,with five of them being novel in cheminformatics. The experiments are carried out using eight data sets that are well established and validated in the medical chemistry literature. Each drug data set was represented by real molecular descriptors selected by machine learning techniques, which are consistent with the neighborhood principle. Statistical analysis of the results demonstrates that pharmacological activities of the eight data sets can be modeled with a few of families with 2D and 3D molecular descriptors, avoiding classification problems associated with the presence of nonrelevant features. Three out of five of the proposed cluster algorithms show superior performance over most classical algorithms and are similar (or slightly superior in the most optimistic sense) to Ward's algorithm. The usefulness of these algorithms is also assessed in a comparative experiment to potent QSAR and machine learning classifiers, where they perform similarly in some cases.

[1]  Jürgen Bajorath,et al.  Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. , 2007, Drug discovery today.

[2]  Harald Mauser,et al.  A robust clustering method for chemical structures. , 2005, Journal of medicinal chemistry.

[3]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[4]  János Podani New combinatorial clustering methods , 1989 .

[5]  Jonathan S. Mason,et al.  Rational Screening Set Design and Compound Selection: Cascaded Clustering , 1998, J. Chem. Inf. Comput. Sci..

[6]  Huan Liu,et al.  Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7]  George W. Adamson,et al.  A Comparison of the Performance of Some Similarity and Dissimilarity Measures in the Automatic Classification of Chemical Structures , 1975, J. Chem. Inf. Comput. Sci..

[8]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[9]  Irene Luque Ruiz,et al.  Clustering Chemical Databases Using Adaptable Projection Cells and MCS Similarity Values , 2005, J. Chem. Inf. Model..

[10]  Maciej Haranczyk,et al.  Comparison of Similarity Coefficients for Clustering and Compound Selection , 2008, J. Chem. Inf. Model..

[11]  W. Warde,et al.  A mathematical comparison of the members of an infinite family of agglomerative clustering algorithms , 1979 .

[12]  R. Todeschini,et al.  Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing / Volume II: Appendices, References , 2009 .

[13]  R. Iman,et al.  Rank Transformations as a Bridge between Parametric and Nonparametric Statistics , 1981 .

[14]  W. Kruskal,et al.  Use of Ranks in One-Criterion Variance Analysis , 1952 .

[15]  David Bawden,et al.  Comparison of hierarchical cluster analysis techniques for automatic classification of chemical structures , 1981, J. Chem. Inf. Comput. Sci..

[16]  J. Biggs THE ROLE OF METALEARNING IN STUDY PROCESSES , 1985 .

[17]  Lori B. Pfahler,et al.  Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds , 1998, J. Chem. Inf. Comput. Sci..

[18]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[19]  Dimitris K. Agrafiotis,et al.  A Cluster-Based Strategy for Assessing the Overlap between Large Chemical Libraries and Its Application to a Recent Acquisition , 2006, J. Chem. Inf. Model..

[20]  Robert D Clark,et al.  Neighborhood behavior: a useful concept for validation of "molecular diversity" descriptors. , 1996, Journal of medicinal chemistry.

[21]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[22]  Jonathan D. Hirst,et al.  Contemporary QSAR Classifiers Compared , 2007, J. Chem. Inf. Model..

[23]  Sergei V. Trepalin,et al.  Hierarchical Clustering of Large Databases and Classification of Antibiotics at High Noise Levels , 2008, Algorithms.

[24]  Benno Stein,et al.  On Cluster Validity and the Information Need of Users , 2003 .

[25]  N. Nikolova,et al.  International Union of Pure and Applied Chemistry, LUMO energy ± The Lowest Unoccupied Molecular Orbital (LUMO) , 2022 .

[26]  Z. Hubálek COEFFICIENTS OF ASSOCIATION AND SIMILARITY, BASED ON BINARY (PRESENCE‐ABSENCE) DATA: AN EVALUATION , 1982 .

[27]  P. Willett Searching techniques for databases of two- and three-dimensional chemical structures. , 2005, Journal of medicinal chemistry.

[28]  R. Glen,et al.  Molecular similarity: a key technique in molecular informatics. , 2004, Organic & biomolecular chemistry.

[29]  P. Willett A comparison of some hierarchal agglomerative clustering algorithms for structure—property correlation , 1982 .

[30]  George W. Adamson,et al.  A method for the automatic classification of chemical structures , 1973, Inf. Storage Retr..

[31]  J. Sutherland,et al.  A comparison of methods for modeling quantitative structure-activity relationships. , 2004, Journal of medicinal chemistry.

[32]  Andrew I Su,et al.  HierS: hierarchical scaffold clustering using topological chemical graphs. , 2005, Journal of medicinal chemistry.

[33]  Alexander Schliep,et al.  Ranking and selecting clustering algorithms using a meta-learning approach , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[34]  Gisbert Schneider,et al.  NIPALSTREE: A New Hierarchical Clustering Approach for Large Compound Libraries and Its Application to Virtual Screening , 2006, J. Chem. Inf. Model..

[35]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[36]  Gerhard Klebe,et al.  Comparison of Automatic Three-Dimensional Model Builders Using 639 X-ray Structures , 1994, J. Chem. Inf. Comput. Sci..

[37]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[38]  Wilfried N. Gansterer,et al.  On the Relationship Between Feature Selection and Classification Accuracy , 2008, FSDM.

[39]  Weizhong Li A Fast Clustering Algorithm for Analyzing Highly Similar Compounds of Very Large Libraries , 2006, J. Chem. Inf. Model..

[40]  Gisbert Schneider,et al.  A Hierarchical Clustering Approach for Large Compound Libraries , 2005, J. Chem. Inf. Model..

[41]  George Michailidis,et al.  The Ensemble Bridge Algorithm: A New Modeling Tool for Drug Discovery Problems , 2010, J. Chem. Inf. Model..

[42]  Vladimir Batagelj,et al.  Note on ultrametric hierarchical clustering algorithms , 1981 .

[43]  M. S. Tomás,et al.  Assessment of the performance of cluster analysis grouping using pharmacophores as molecular descriptors , 2005 .

[44]  Mahdi Mahfouf,et al.  Clustering Files of Chemical Structures Using the Fuzzy k-Means Clustering Method , 2004, J. Chem. Inf. Model..

[45]  Peter Willett,et al.  Similarity Searching and Clustering of Chemical-Structure Databases Using Molecular Property Data , 1994, J. Chem. Inf. Comput. Sci..

[46]  Peter Willett,et al.  Similarity-based virtual screening using 2D fingerprints. , 2006, Drug discovery today.

[47]  Hanna Geppert,et al.  Advances in 2D fingerprint similarity searching , 2010, Expert opinion on drug discovery.

[48]  Jun Xu A new approach to finding natural chemical structure classes. , 2002, Journal of medicinal chemistry.

[49]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[50]  Peter Willett,et al.  A comparison of some hierarchal monothetic divisive clustering algorithms for structure-property correlation , 1983 .

[51]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[52]  Mark A. Johnson A review and examination of the mathematical spaces underlying molecular similarity analysis , 1989 .

[53]  A. Raftery,et al.  Variable Selection for Model-Based Clustering , 2006 .

[54]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[55]  Herman van Vlijmen,et al.  Recent advances in chemoinformatics. , 2007, Journal of chemical information and modeling.

[56]  Luis Talavera,et al.  Dependency-based feature selection for clustering symbolic data , 2000, Intell. Data Anal..

[57]  B. Fan,et al.  Molecular similarity and diversity in chemoinformatics: From theory to applications , 2006, Molecular Diversity.

[58]  Robert P Sheridan,et al.  Why do we need so many chemical similarity search methods? , 2002, Drug discovery today.

[59]  Gisbert Schneider,et al.  Status of HTS Data Mining Approaches , 2004 .

[60]  Maciej Haranczyk,et al.  Comparison of Nonbinary Similarity Coefficients for Similarity Searching, Clustering and Compound Selection , 2009, J. Chem. Inf. Model..

[61]  János Podani,et al.  Explanatory Variables in Classifications and the Detection of the Optimum Number of Clusters , 1998 .

[62]  Ovidiu Ivanciuc,et al.  Applications of Support Vector Machines in Chemistry , 2007 .

[63]  Peter Willett,et al.  Evaluation of relocation clustering algorithms for the automatic classification of chemical structures , 1984, J. Chem. Inf. Comput. Sci..

[64]  P. Matsakis,et al.  The use of force histograms for affine-invariant relative position description , 2004 .

[65]  G. Milligan Ultrametric hierarchical clustering algorithms , 1979 .

[66]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.