Algorithms for hierarchical clustering: an overview

We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self‐organizing maps, and mixture models. We review grid‐based clustering, focusing on hierarchical density‐based approaches. Finally, we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid‐based algorithm. © 2011 Wiley Periodicals, Inc.

[1]  Peter Grabusts,et al.  Using grid-clustering methods in data classification , 2002, Proceedings. International Conference on Parallel Computing in Electrical Engineering.

[2]  Melvin F. Janowitz,et al.  Ordinal and Relational Clustering , 2010, Interdisciplinary Mathematical Sciences.

[3]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[4]  Aidong Zhang,et al.  WaveCluster: a wavelet-based clustering approach for spatial data in very large databases , 2000, The VLDB Journal.

[5]  B. L. Roux,et al.  Geometric Data Analysis: From Correspondence Analysis to Structured Data Analysis , 2004 .

[6]  Fionn Murtagh,et al.  The Haar Wavelet Transform of a Dendrogram , 2006, J. Classif..

[7]  Hans-Peter Kriegel,et al.  A distribution-based clustering algorithm for mining in large spatial databases , 1998, Proceedings 14th International Conference on Data Engineering.

[8]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory , 1988 .

[9]  Edie M. Rasmussen,et al.  Efficiency of Hierarchic Agglomerative Clustering using the ICL Distributed array Processor , 1989, J. Documentation.

[10]  Huan Liu,et al.  '1+1>2': merging distance and density based clustering , 2001, Proceedings Seventh International Conference on Database Systems for Advanced Applications. DASFAA 2001.

[11]  E. Oja,et al.  Clustering Properties of Hierarchical Self-Organizing Maps , 1992 .

[12]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[13]  Daniel A. Keim,et al.  Optimal Grid-Clustering: Towards Breaking the Curse of Dimensionality in High-Dimensional Clustering , 1999, VLDB.

[14]  Erich Schikuta,et al.  Grid-clustering: an efficient hierarchical clustering method for very large data sets , 1996, Proceedings of 13th International Conference on Pattern Recognition.

[15]  Hans-Peter Kriegel,et al.  A Fast Parallel Clustering Algorithm for Large Spatial Databases , 1999, Data Mining and Knowledge Discovery.

[16]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[17]  F. Murtagh Symmetry in data mining and analysis: A unifying view based on hierarchy , 2008, 0805.2744.

[18]  Chi-Hoon Lee,et al.  Clustering spatial data in the presence of obstacles: a density-based approach , 2002, Proceedings International Database Engineering and Applications Symposium.

[19]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[20]  Fionn Murtagh,et al.  Ward’s Hierarchical Agglomerative Clustering Method: Which Algorithms Implement Ward’s Criterion? , 2011, Journal of Classification.

[21]  Fionn Murtagh,et al.  Fast Hierarchical Clustering from the Baire Distance , 2010 .

[22]  Toshiji Kawagoe,et al.  Voice matters in a dictator game , 2008 .

[23]  John Bradshaw,et al.  Similarity and Dissimilarity Methods for Processing Chemical Structure Databases , 1998, Comput. J..

[24]  Risto Mukkulainen,et al.  Script Recognition with Hierarchical Feature Maps , 1990 .

[25]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[26]  J. Juan Programme de classification hiérarchique par l'algorithme de la recherche en chaîne des voisins réciproques , 1982 .

[27]  Masahiro Ueno,et al.  A Clustering Method Using Hierarchical Self-Organizing Maps , 2002, J. VLSI Signal Process..

[28]  F. Murtagh,et al.  The Kohonen self-organizing map method: An assessment , 1995 .

[29]  R K Blashfield,et al.  The Literature On Cluster Analysis. , 1978, Multivariate behavioral research.

[30]  H. Edelsbrunner,et al.  Efficient algorithms for agglomerative hierarchical clustering methods , 1984 .

[31]  Charles E. Heckler,et al.  Correspondence Analysis and Data Coding With Java and R , 2007, Technometrics.

[32]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[33]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[34]  Gilbert Saporta,et al.  L'analyse des données , 1981 .

[35]  Sun-Yuan Kung,et al.  Probabilistic principal component subspaces: a hierarchical finite mixture model for data visualization , 2000, IEEE Trans. Neural Networks Learn. Syst..

[36]  C. de Rham,et al.  La classification hiérarchique ascendante selon la méthode des voisins réciproques , 1980 .

[37]  Fionn Murtagh,et al.  The structure of narrative: The case of film scripts , 2008, Pattern Recognit..

[38]  A. D. Gordon A Review of Hierarchical Classification , 1987 .

[39]  Huan Liu,et al.  Merging Distance and Density Based Clustering , 2001 .

[40]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[41]  M. Gondran,et al.  Valeurs Propres et Vecteurs Propres en Calssification Hiérarchique , 1976, RAIRO Theor. Informatics Appl..

[42]  Fionn Murtagh,et al.  Multidimensional clustering algorithms , 1985 .

[43]  Michel Bruynooghe,et al.  Méthodes nouvelles en classification automatique de données taxinomiques nombreuses , 1977 .

[44]  Peter Willett,et al.  Hierarchic Agglomerative Clustering Methods for Automatic Document Classification , 1984, J. Documentation.

[45]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[46]  Risto Miikkulainen,et al.  Script Recognition with Hierarchical Feature Maps , 1992 .

[47]  Elena Deza,et al.  Encyclopedia of Distances , 2014 .

[48]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[49]  Li Wang,et al.  CUBN: A clustering algorithm based on density and distance , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[50]  Adrian E. Raftery,et al.  Bayesian inference for multiband image segmentation via model-based cluster trees , 2005, Image Vis. Comput..

[51]  Peter Tiño,et al.  Hierarchical GTM: Constructing Localized Nonlinear Projection Manifolds in a Principled Way , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  Kian-Lee Tan,et al.  Fast hierarchical clustering and its validation , 2003, Data Knowl. Eng..

[53]  Andreas Rauber,et al.  Uncovering hierarchical structure in data using the growing hierarchical self-organizing map , 2002, Neurocomputing.

[54]  M. V. Velzen,et al.  Self-organizing maps , 2007 .

[55]  A. Vellido,et al.  Review of Hierarchical Models for Data Clustering and Visualization , 2004 .

[56]  Fionn Murtagh,et al.  Hierarchical Clustering of Massive, High Dimensional Data Sets by Exploiting Ultrametric Embedding , 2008, SIAM J. Sci. Comput..

[57]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[58]  Michalis Vazirgiannis,et al.  c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. On Clustering Validation Techniques , 2022 .

[59]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[60]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[61]  Ronald L. Graham,et al.  On the History of the Minimum Spanning Tree Problem , 1985, Annals of the History of Computing.

[62]  Francisco Azuaje,et al.  Cluster validation techniques for genome expression data , 2003, Signal Process..

[63]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[64]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[65]  Won Suk Lee,et al.  Statistical grid-based clustering over data streams , 2004, SGMD.

[66]  Boris Mirkin,et al.  Mathematical Classification and Clustering , 1996 .

[67]  K. McCain,et al.  Visualization of Literatures. , 1997 .

[68]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[69]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[70]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[71]  Jae-Woo Chang,et al.  A new cell-based clustering method for large, high-dimensional data in data mining applications , 2002, SAC '02.

[72]  Salvatore T. March,et al.  Techniques for Structuring Database Records , 1983, CSUR.

[73]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[74]  Pedro Albornoz,et al.  Search and retrieval in massive data collections , 2010 .