How frequently do clusters occur in hierarchical clustering analysis? A graph theoretical approach to studying ties in proximity

BackgroundHierarchical cluster analysis (HCA) is a widely used classificatory technique in many areas of scientific knowledge. Applications usually yield a dendrogram from an HCA run over a given data set, using a grouping algorithm and a similarity measure. However, even when such parameters are fixed, ties in proximity (i.e. two equidistant clusters from a third one) may produce several different dendrograms, having different possible clustering patterns (different classifications). This situation is usually disregarded and conclusions are based on a single result, leading to questions concerning the permanence of clusters in all the resulting dendrograms; this happens, for example, when using HCA for grouping molecular descriptors to select that less similar ones in QSAR studies.ResultsRepresenting dendrograms in graph theoretical terms allowed us to introduce four measures of cluster frequency in a canonical way, and use them to calculate cluster frequencies over the set of all possible dendrograms, taking all ties in proximity into account. A toy example of well separated clusters was used, as well as a set of 1666 molecular descriptors calculated for a group of molecules having hepatotoxic activity to show how our functions may be used for studying the effect of ties in HCA analysis. Such functions were not restricted to the tie case; the possibility of using them to derive cluster stability measurements on arbitrary sets of dendrograms having the same leaves is discussed, e.g. dendrograms from variations of HCA parameters. It was found that ties occurred frequently, some yielding tens of thousands of dendrograms, even for small data sets.ConclusionsOur approach was able to detect trends in clustering patterns by offering a simple way of measuring their frequency, which is often very low. This would imply, that inferences and models based on descriptor classifications (e.g. QSAR) are likely to be biased, thereby requiring an assessment of their reliability. Moreover, any classification of molecular descriptors is likely to be far from unique. Our results highlight the need for evaluating the effect of ties on clustering patterns before classification results can be used accurately.Graphical abstractFour cluster contrast functions identifying statistically sound clusters within dendrograms considering ties in proximity

[1]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[2]  Junwei Zhang,et al.  VISCANA: Visualized Cluster Analysis of Protein-Ligand Interaction Based on the ab Initio Fragment Molecular Orbital Method for Virtual Ligand Screening , 2006, J. Chem. Inf. Model..

[3]  Naomie Salim,et al.  Voting-based consensus clustering for combining multiple clusterings of chemical structures , 2012, Journal of Cheminformatics.

[4]  Evan Bolton,et al.  PubChem structure–activity relationship (SAR) clusters , 2015, Journal of Cheminformatics.

[5]  Carolina L. Bellera,et al.  Application of Computer-Aided Drug Repurposing in the Search of New Cruzipain Inhibitors: Discovery of Amiodarone and Bromocriptine Inhibitory Effects , 2013, J. Chem. Inf. Model..

[6]  Brian Everitt,et al.  Cluster analysis , 1974 .

[7]  M. Muller,et al.  Gold(III) Macrocycles: Nucleotide-Specific Unconventional Catalytic Inhibitors of Human Topoisomerase I , 2014, Journal of the American Chemical Society.

[8]  Sergio Gómez,et al.  Solving Non-Uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms , 2006, J. Classif..

[9]  John M. Barnard,et al.  Clustering Methods and Their Uses in Computational Chemistry , 2003 .

[10]  Valentina Eigner-Pitto,et al.  ChemProspector and generic structures: advanced mining and searching of chemical content , 2012, Journal of Cheminformatics.

[11]  G. N. Lance,et al.  A General Theory of Classificatory Sorting Strategies: 1. Hierarchical Systems , 1967, Comput. J..

[12]  K. Bailey Typologies and taxonomies: An introduction to classification techniques. , 1994 .

[13]  Joachim Schummer,et al.  The Chemical Core of Chemistry I: A Conceptual Approach , 1998 .

[14]  B. Everitt,et al.  Cluster Analysis: Everitt/Cluster Analysis , 2011 .

[15]  Aapo Hyvärinen,et al.  Independent Component Analysis For Binary Data: An Experimental Study , 2001 .

[16]  Iliana Avila-Campillo,et al.  Control of yeast filamentous-form growth by modules in an integrated molecular network. , 2004, Genome research.

[17]  P. Sopp Cluster analysis. , 1996, Veterinary immunology and immunopathology.

[18]  Christos A. Nicolaou,et al.  Ties in Proximity and Clustering Compounds , 2001, J. Chem. Inf. Comput. Sci..

[19]  Arnold Robbins Effective AWK Programming , 1997 .

[20]  Huifen Chen,et al.  Atom-Atom-Path similarity and Sphere Exclusion clustering: tools for prioritizing fragment hits , 2015, Journal of Cheminformatics.

[21]  R Todeschini,et al.  Molecular Descriptors for Chemoinformatics. Vol. 1. Alphabetical Listing; Vol. 2. Appendices, References , 2009 .

[22]  Patrice Bertrand Structural Properties of Pyramidal Clustering , 1993, Partitioning Data Sets.

[23]  Gerald J. Niemi,et al.  Predicting properties of molecules using graph invariants , 1991 .

[24]  Hengwei Lin,et al.  Preoxidation for colorimetric sensor array detection of VOCs. , 2011, Journal of the American Chemical Society.

[25]  Guillermo Restrepo,et al.  Three Dissimilarity Measures to Contrast Dendrograms , 2007, J. Chem. Inf. Model..

[26]  Guillermo Restrepo,et al.  Topological Study of the Periodic System , 2004, J. Chem. Inf. Model..

[27]  Daniel Svozil,et al.  InCHlib – interactive cluster heatmap for web applications , 2014, Journal of Cheminformatics.

[28]  Ignacio Marín,et al.  Iterative Cluster Analysis of Protein Interaction Data , 2005, Bioinform..

[29]  Joseph Felsenstein,et al.  The number of evolutionary trees , 1978 .

[30]  Dariusz Plewczynski,et al.  Assessing Different Classification Methods for Virtual Screening , 2006, J. Chem. Inf. Model..

[31]  R. Todeschini,et al.  Molecular Descriptors for Chemoinformatics: Volume I: Alphabetical Listing / Volume II: Appendices, References , 2009 .

[32]  Paul Graham ANSI Common Lisp , 1995 .

[33]  John MacCuish,et al.  Clustering in Bioinformatics and Drug Discovery , 2010 .

[34]  G. Restrepo,et al.  A Network Study of Chemical Elements: From Binary Compounds to Chemical Trends , 2012 .

[35]  Ulrike von Luxburg,et al.  Clustering Stability: An Overview , 2010, Found. Trends Mach. Learn..

[36]  Stefan Kramer,et al.  CheS-Mapper - Chemical Space Mapping and Visualization in 3D , 2012, Journal of Cheminformatics.

[37]  Guillermo Restrepo,et al.  On the Topological Sense of Chemical Sets , 2006 .

[38]  Artem Cherkasov,et al.  Using Molecular Docking, 3D-QSAR, and Cluster Analysis for Screening Structurally Diverse Data Sets of Pharmacological Interest , 2008, J. Chem. Inf. Model..