Scalable Hierarchical Clustering: Twister Tries with a Posteriori Trie Elimination

Exact methods for Agglomerative Hierarchical Clustering (AHC) with average linkage do not scale well when the number of items to be clustered is large. The best known algorithms are characterized by quadratic complexity. This is a generally accepted fact and cannot be improved without using specifics of certain metric spaces. Twister tries is an algorithm that produces a dendrogram (i.e., Outcome of a hierarchical clustering) which resembles the one produced by AHC, while only needing linear space and time. However, twister tries are sensitive to rare, but still possible, hash evaluations. These might have a disastrous effect on the final outcome. We propose the use of a metaheuristic algorithm to overcome this sensitivity and show how approximate computations of dendrogram quality can help to evaluate the heuristic within reasonable time. The proposed metaheuristic is based on an evolutionary framework and integrates a surrogate model of the fitness within it to enhance the algorithmic performance in terms of computational time.

[1]  Hisashi Koga,et al.  Fast Hierarchical Clustering Algorithm Using Locality-Sensitive Hashing , 2004, Discovery Science.

[2]  Stefan Van Aelst,et al.  Fast and robust bootstrap for multivariate inference: The R package FRB , 2013 .

[3]  Nils M. Kriege,et al.  SAHN Clustering in Arbitrary Metric Spaces Using Heuristic Nearest Neighbor Search , 2014, WALCOM.

[4]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[5]  Shlomo Moran,et al.  Optimal implementations of UPGMA and other common clustering algorithms , 2007, Inf. Process. Lett..

[6]  David B. Fogel,et al.  An Introduction to Evolutionary Computation , 2022 .

[7]  David Naso,et al.  Compact Differential Evolution , 2011, IEEE Transactions on Evolutionary Computation.

[8]  Adam Prügel-Bennett,et al.  Benefits of a Population: Five Mechanisms That Advantage Population-Based Algorithms , 2010, IEEE Transactions on Evolutionary Computation.

[9]  Fabio Caraffini,et al.  An analysis on separability for Memetic Computing automatic design , 2014, Inf. Sci..

[10]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[11]  R. Forthofer,et al.  Rank Correlation Methods , 1981 .

[12]  Michael Cochez,et al.  Twister Tries: Approximate Hierarchical Agglomerative Clustering for Average Distance in Linear Time , 2015, SIGMOD Conference.

[13]  Ferrante Neri,et al.  Optimization of Delayed-State Kalman-Filter-Based Algorithm via Differential Evolution for Sensorless Control of Induction Motors , 2010, IEEE Transactions on Industrial Electronics.

[14]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[15]  Pablo Moscato,et al.  Handbook of Memetic Algorithms , 2011, Studies in Computational Intelligence.

[16]  Edmund K. Burke,et al.  A Separability Prototype for Automatic Memes with Adaptive Operator Selection , 2014, 2014 IEEE Symposium on Foundations of Computational Intelligence (FOCI).

[17]  David H. Wolpert,et al.  No free lunch theorems for optimization , 1997, IEEE Trans. Evol. Comput..

[18]  M. El-Sharkawi,et al.  Introduction to Evolutionary Computation , 2008 .

[19]  David Eppstein,et al.  Fast hierarchical clustering and other applications of dynamic closest pairs , 1999, SODA '98.

[20]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[21]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Ian Davidson,et al.  Efficient hierarchical clustering of large high dimensional datasets , 2013, CIKM.

[23]  Daniel Müllner,et al.  Modern hierarchical, agglomerative clustering algorithms , 2011, ArXiv.

[24]  Ferrante Neri,et al.  An Optimization Spiking Neural P System for Approximately Solving Combinatorial Optimization Problems , 2014, Int. J. Neural Syst..

[25]  Meelis Kull,et al.  Fast approximate hierarchical clustering using similarity heuristics , 2008, BioData Mining.

[26]  Hisao Ishibuchi,et al.  Learning of fuzzy reference sets in nearest neighbor classification , 1999, 18th International Conference of the North American Fuzzy Information Processing Society - NAFIPS (Cat. No.99TH8397).

[27]  Mark Sumner,et al.  A Fast Adaptive Memetic Algorithm for Online and Offline Control Design of PMSM Drives , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[28]  Giovanni Iacca,et al.  Multi-Strategy coevolving aging Particle Optimization , 2014, Int. J. Neural Syst..

[29]  Santosh Biswas,et al.  Distance Based Fast Hierarchical Clustering Method for Large Datasets , 2010, RSCTC.