Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures.

This paper compares several published methods for clustering chemical structures, using both graph- and fingerprint-based similarity measures. The clusterings from each method were compared to determine the degree of cluster overlap. Each method was also evaluated on how well it grouped structures into clusters possessing a non-trivial substructural commonality. The methods which employ adjustable parameters were tested to determine the stability of each parameter for datasets of varying size and composition. Our experiments suggest that both graph- and fingerprint-based similarity measures can be used effectively for generating chemical clusterings; it is also suggested that the CAST and Yin-Chen methods, suggested recently for the clustering of gene expression patterns, may also prove effective for the clustering of 2D chemical structures.

[1]  C. John Blankley,et al.  Comparison of 2D Fingerprint Types and Hierarchy Level Selection Methods for Structural Grouping Using Ward's Clustering , 2000, J. Chem. Inf. Comput. Sci..

[2]  Ling-Hwei Chen,et al.  A new non-iterative approach for clustering , 1994, Pattern Recognit. Lett..

[3]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[4]  Dauid F. Percy Cluster Analysis (3rd Edition) , 1994 .

[5]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[6]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[7]  P. J. Harrison,et al.  A Method of Cluster Analysis and Some Applications , 1968 .

[8]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[9]  Rafael Martí,et al.  Experimental Testing of Advanced Scatter Search Designs for Global Optimization of Multimodal Functions , 2005, J. Glob. Optim..

[10]  Soumen Chakrabarti,et al.  Similarity and Clustering , 2003 .

[11]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[12]  Reinhard Diestel,et al.  Graph Theory , 1997 .

[13]  David Bawden,et al.  Comparison of hierarchical cluster analysis techniques for automatic classification of chemical structures , 1981, J. Chem. Inf. Comput. Sci..

[14]  Brian Everitt,et al.  Cluster analysis , 1974 .

[15]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[16]  Yvonne C. Martin,et al.  Use of Structure-Activity Data To Compare Structure-Based Clustering Methods and Descriptors for Use in Compound Selection , 1996, J. Chem. Inf. Comput. Sci..

[17]  Robert C. Kohberger,et al.  Cluster Analysis (3rd ed.) , 1994 .

[18]  Peter Willett,et al.  Heuristics for Similarity Searching of Chemical Graphs Using a Maximum Common Edge Subgraph Algorithm , 2002, J. Chem. Inf. Comput. Sci..

[19]  Peter Willett,et al.  RASCAL: Calculation of Graph Similarity using Maximum Common Edge Subgraphs , 2002, Comput. J..

[20]  P. Sneath Relations between chemical structure and biological activity in peptides. , 1966, Journal of theoretical biology.

[21]  Paolo Toth,et al.  Algorithms and codes for the assignment problem , 1988 .

[22]  R. M. Umesh,et al.  A technique for cluster formation , 1988, Pattern Recognit..

[23]  Peter Willett,et al.  Promoting Access to White Rose Research Papers Effectiveness of Graph-based and Fingerprint-based Similarity Measures for Virtual Screening of 2d Chemical Structure Databases , 2022 .

[24]  George W. Adamson,et al.  A method for the automatic classification of chemical structures , 1973, Inf. Storage Retr..

[25]  Yvonne C. Martin,et al.  The Information Content of 2D and 3D Structural Descriptors Relevant to Ligand-Receptor Binding , 1997, J. Chem. Inf. Comput. Sci..

[26]  F. James Rohlf,et al.  12 Single-link clustering algorithms , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[27]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[28]  Peter Willett,et al.  A line graph algorithm for clustering chemical structures based on common substructural cores , 2003 .

[29]  Dan Gusfield,et al.  Partition-distance: A problem and class of perfect graphs arising in clustering , 2002, Inf. Process. Lett..

[30]  Peter Willett,et al.  Similarity Searching and Clustering of Chemical-Structure Databases Using Molecular Property Data , 1994, J. Chem. Inf. Comput. Sci..

[31]  L. Kelley,et al.  An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. , 1996, Protein engineering.

[32]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[33]  A. Volgenant,et al.  Linear assignment procedures , 1999, Eur. J. Oper. Res..