Hierarchical Clustering of Large Databases and Classification of Antibiotics at High Noise Levels

Abstract: A new algorithm for divisive hierarchical clustering of chemical compounds based on 2D structural fragments is suggested. The algorithm is deterministic, and given a random ordering of the input, will always give the same clustering and can process a database up to 2 million records on a standard PC. The algorithm was used for classification of 1,183 antibiotics mixed with 999,994 random chemical structures. Similarity threshold, at which best separation of active and non active compounds took place, was estimated as 0.6. 85.7% of the antibiotics were successfully classified at this threshold with 0.4% of inaccurate compounds. A .sdf file was created with the probe molecules for clustering of external databases. Keywords: Molecular structure, hierarchical clustering, algorithm, classification of antibiotics 1. Introduction The problem of clustering can be defined as follows. The given N data points in a D -dimensional space should be organized into K clusters. Data points from one cluster should have more similarities than those from different clusters. Clustering algorithms can be classified as partition algorithms and hierarchical ones [1]. Partition algorithms are fast and require small memory. K-mean clustering is an example of a partition algorithm [2,3]. Hierarchical algorithms combine agglomerative and divisive algorithms. Generally, hierarchical algorithms are quite demonstrative. Agglomerative algorithms are

[1]  Robert D. Clark,et al.  OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets , 1997, J. Chem. Inf. Comput. Sci..

[2]  Dimitris K. Agrafiotis,et al.  A Cluster-Based Strategy for Assessing the Overlap between Large Chemical Libraries and Its Application to a Recent Acquisition , 2006, J. Chem. Inf. Model..

[3]  Weizhong Li A Fast Clustering Algorithm for Analyzing Highly Similar Compounds of Very Large Libraries , 2006, J. Chem. Inf. Model..

[4]  Gisbert Schneider,et al.  A Hierarchical Clustering Approach for Large Compound Libraries , 2005, J. Chem. Inf. Model..

[5]  Sergei V. Trepalin,et al.  New Diversity Calculations Algorithms Used for Compound Selection , 2002, J. Chem. Inf. Comput. Sci..

[6]  Yao Wang,et al.  A robust and scalable clustering algorithm for mixed type attributes in large database environment , 2001, KDD '01.

[7]  Peter Willett,et al.  A comparison of some hierarchal monothetic divisive clustering algorithms for structure-property correlation , 1983 .

[8]  L. Kelley,et al.  An automated approach for clustering an ensemble of NMR-derived protein structures into conformationally related subfamilies. , 1996, Protein engineering.

[9]  P. Willett,et al.  Implementation of nonhierarchic cluster analysis methods in chemical information structure search , 1986 .

[10]  Ray A. Jarvis,et al.  Clustering Using a Similarity Measure Based on Shared Near Neighbors , 1973, IEEE Transactions on Computers.

[11]  M S Lajiness,et al.  Implementing drug screening programs using molecular similarity methods. , 1989, Progress in clinical and biological research.

[12]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[13]  S. Heller,et al.  An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier , 2003 .

[14]  Peter Willett,et al.  Similarity-based virtual screening using 2D fingerprints. , 2006, Drug discovery today.

[15]  Dimitris K. Agrafiotis,et al.  Radial Clustergrams: Visualizing the Aggregate Properties of Hierarchical Clusters , 2007, J. Chem. Inf. Model..

[16]  Sergei V. Trepalin,et al.  The centroidal algorithm in molecular similarity and diversity calculations on confidential datasets , 2005, J. Comput. Aided Mol. Des..

[17]  David Weininger,et al.  SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules , 1988, J. Chem. Inf. Comput. Sci..

[18]  P. Willett,et al.  Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures. , 2004, Organic & biomolecular chemistry.

[19]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[20]  Lori B. Pfahler,et al.  Lead Discovery Using Stochastic Cluster Analysis (SCA): A New Method for Clustering Structurally Similar Compounds , 1998, J. Chem. Inf. Comput. Sci..

[21]  Peter Willett,et al.  Implementation of nonhierarchic cluster analysis methods in chemical information systems: selection of compounds for biological testing and clustering of substructure search output , 1986, J. Chem. Inf. Comput. Sci..

[22]  Johnz Willett Similarity and Clustering in Chemical Information Systems , 1987 .

[23]  Gisbert Schneider,et al.  NIPALSTREE: A New Hierarchical Clustering Approach for Large Compound Libraries and Its Application to Virtual Screening , 2006, J. Chem. Inf. Model..

[24]  William Lingran Chen,et al.  MCSS: a new algorithm for perception of maximal common substructures and its application to NMR spectral studies. 1. The algorithm , 1992, J. Chem. Inf. Comput. Sci..

[26]  Christos A. Nicolaou,et al.  Ties in Proximity and Clustering Compounds , 2001, J. Chem. Inf. Comput. Sci..

[27]  P. Gács,et al.  Algorithms , 1992 .

[28]  P. Willett A comparison of some hierarchal agglomerative clustering algorithms for structure—property correlation , 1982 .

[29]  George Karypis,et al.  A Comparison of Document Clustering Techniques , 2000 .

[30]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[31]  David Bawden,et al.  Comparison of hierarchical cluster analysis techniques for automatic classification of chemical structures , 1981, J. Chem. Inf. Comput. Sci..

[32]  S. Wold,et al.  Fuzzy clustering of 627 alcohols, guided by a strategy for cluster analysis of chemical compounds for combinatorial chemistry , 1998 .

[33]  Sergei V. Trepalin,et al.  Advanced Exact Structure Searching in Large Databases of Chemical Compounds , 2003, J. Chem. Inf. Comput. Sci..

[34]  Sergei V. Trepalin,et al.  CheD: Chemical Database Compilation Tool, Internet Server, and Client for SQL Servers , 2001, J. Chem. Inf. Comput. Sci..

[35]  P. Willett,et al.  A Fast Algorithm For Selecting Sets Of Dissimilar Molecules From Large Chemical Databases , 1995 .

[36]  A. Bender,et al.  Circular fingerprints: flexible molecular descriptors with applications from physical chemistry to ADME. , 2006, IDrugs : the investigational drugs journal.

[37]  R. Mojena,et al.  Hierarchical Grouping Methods and Stopping Rules: An Evaluation , 1977, Comput. J..

[38]  John M. Barnard,et al.  Chemical Similarity Searching , 1998, J. Chem. Inf. Comput. Sci..

[39]  W. Bremser Hose — a novel substructure code , 1978 .

[40]  Jennifer R. Krumrine,et al.  Statistical tools for virtual screening. , 2005, Journal of medicinal chemistry.

[41]  John M. Barnard,et al.  Clustering Methods and Their Uses in Computational Chemistry , 2003 .