论文信息 - Comparison of Combinatorial Clustering Methods on Pharmacological Data Sets Represented by Machine Learning-Selected Real Molecular Descriptors

Comparison of Combinatorial Clustering Methods on Pharmacological Data Sets Represented by Machine Learning-Selected Real Molecular Descriptors

Cluster algorithms play an important role in diversity related tasks of modern chemoinformatics, with the widest applications being in pharmaceutical industry drug discovery programs. The performance of these grouping strategies depends on various factors such as molecular representation, mathematical method, algorithmical technique, and statistical distribution of data. For this reason, introduction and comparison of new methods are necessary in order to find the model that best fits the problem at hand. Earlier comparative studies report on Ward's algorithm using fingerprints for molecular description as generally superior in this field. However, problems still remain, i.e., other types of numerical descriptions have been little exploited, current descriptors selection strategy is trial and error-driven, and no previous comparative studies considering a broader domain of the combinatorial methods in grouping chemoinformatic data sets have been conducted. In this work, a comparison between combinatorial methods is performed,with five of them being novel in cheminformatics. The experiments are carried out using eight data sets that are well established and validated in the medical chemistry literature. Each drug data set was represented by real molecular descriptors selected by machine learning techniques, which are consistent with the neighborhood principle. Statistical analysis of the results demonstrates that pharmacological activities of the eight data sets can be modeled with a few of families with 2D and 3D molecular descriptors, avoiding classification problems associated with the presence of nonrelevant features. Three out of five of the proposed cluster algorithms show superior performance over most classical algorithms and are similar (or slightly superior in the most optimistic sense) to Ward's algorithm. The usefulness of these algorithms is also assessed in a comparative experiment to potent QSAR and machine learning classifiers, where they perform similarly in some cases.

[1] Jürgen Bajorath,et al. Molecular similarity analysis in virtual screening: foundations, limitations and novel approaches. , 2007, Drug discovery today.

[2] Harald Mauser,et al. A robust clustering method for chemical structures. , 2005, Journal of medicinal chemistry.

[3] Ron Kohavi,et al. Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[4] János Podani. New combinatorial clustering methods , 1989 .

[5] Jonathan S. Mason,et al. Rational Screening Set Design and Compound Selection: Cascaded Clustering , 1998, J. Chem. Inf. Comput. Sci..

[6] Huan Liu,et al. Feature selection for clustering - a filter solution , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[7] George W. Adamson,et al. A Comparison of the Performance of Some Similarity and Dissimilarity Measures in the Automatic Classification of Chemical Structures , 1975, J. Chem. Inf. Comput. Sci..

[8] Tom Fawcett,et al. An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[9] Irene Luque Ruiz,et al. Clustering Chemical Databases Using Adaptable Projection Cells and MCS Similarity Values , 2005, J. Chem. Inf. Model..

[10] Maciej Haranczyk,et al. Comparison of Similarity Coefficients for Clustering and Compound Selection , 2008, J. Chem. Inf. Model..

[11] W. Warde,et al. A mathematical comparison of the members of an infinite family of agglomerative clustering algorithms , 1979 .

[12] R. Todeschini,et al. Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing / Volume II: Appendices, References , 2009 .

[13] R. Iman,et al. Rank Transformations as a Bridge between Parametric and Nonparametric Statistics , 1981 .

[14] W. Kruskal,et al. Use of Ranks in One-Criterion Variance Analysis , 1952 .

[15] David Bawden,et al. Comparison of hierarchical cluster analysis techniques for automatic classification of chemical structures , 1981, J. Chem. Inf. Comput. Sci..

[16] J. Biggs. THE ROLE OF METALEARNING IN STUDY PROCESSES , 1985 .

[17] Lori B. Pfahler,et al. Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds , 1998, J. Chem. Inf. Comput. Sci..

[18] G. N. Lance,et al. A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..