Fast hierarchical clustering and other applications of dynamic closest pairs

We develop data structures for dynamic closest pair problems with arbitrary distance functions, that do not necessarily come from any geometric structure on the objects. Based on a technique previously used by the author for Euclidean closest pairs, we show how to insert and delete objects from an <i>n</i>-object set, maintaining the closest pair, in <i>O</i>(<i>n</i> log<sup>2</sup> <i>n</i>) time per update and <i>O</i>(<i>n</i>) space. With quadratic space, we can instead use a quadtree-like structure to achieve an optimal time bound, <i>O</i>(<i>n</i>) per update. We apply these data structures to hierarchical clustering, greedy matching, and TSP heuristics, and discuss other potential applications in machine learning, Gröbner bases, and local improvement algorithms for partition and placement problems. Experiments show our new methods to be faster in practice than previously used heuristics.

[1]  R. Sokal,et al.  A QUANTITATIVE APPROACH TO A PROBLEM IN CLASSIFICATION† , 1957, Evolution; International Journal of Organic Evolution.

[2]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[3]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[4]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[5]  B. S. Duran,et al.  Cluster Analysis: A Survey , 1974 .

[6]  H. Kunzi,et al.  Lectu re Notes in Economics and Mathematical Systems , 1975 .

[7]  Daniel J. Rosenkrantz,et al.  An Analysis of Several Heuristics for the Traveling Salesman Problem , 1977, SIAM J. Comput..

[8]  Robert E. Tarjan,et al.  On a Greedy Heuristic for Complete Matching , 1981, SIAM J. Comput..

[9]  Brian Everitt,et al.  Clustering of large data sets , 1983 .

[10]  R. Weiner Lecture Notes in Economics and Mathematical Systems , 1985 .

[11]  Peter J. Rousseeuw,et al.  CLUSTERING LARGE DATA SETS , 1986 .

[12]  Bruno Buchberger,et al.  Applications of Gröbner Bases in Non-linear Computational Geometry , 1987, Trends in Computer Algebra.

[13]  Bruno Buchberger,et al.  Applications of Gro¨bner bases in non-linear computational geometry , 1988 .

[14]  F. Corpet Multiple sequence alignment with hierarchical clustering. , 1988, Nucleic acids research.

[15]  Jon Louis Bentley,et al.  Experiments on traveling salesman heuristics , 1990, SODA '90.

[16]  Alan M. Frieze,et al.  Greedy Matching on the Line , 1990, SIAM J. Comput..

[17]  Kenneth J. Supowit,et al.  New techniques for some dynamic closest-point and farthest-point problems , 1990, SODA '90.

[18]  Carlo Traverso,et al.  “One sugar cube, please” or selection strategies in the Buchberger algorithm , 1991, ISSAC '91.

[19]  Stephen R. Czapor A heuristic selection strategy for lexicographic Gröner bases? , 1991, ISSAC '91.

[20]  David P. Dobkin,et al.  Maintenance of geometric extrema ∈ , 1991, JACM.

[21]  Michiel H. M. Smid Maintaining the minimal distance of a point set in polylogarithmic time , 1991, SODA '91.

[22]  Yossi Matias Semi-dynamic Closest-pair Algorithms , 1993, CCCG.

[23]  X. Cheng,et al.  Cluster Analysis of the Northern Hemisphere Wintertime 500-hPa Height Field: Spatial Patterns , 1993 .

[24]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[25]  Michiel H. M. Smid,et al.  Randomized data structures for the dynamic closest-pair problem , 1998, SODA '93.

[26]  Osamu Gotoh,et al.  Further improvement in methods of group-to-group sequence alignment with generalized profile operations , 1994, Comput. Appl. Biosci..

[27]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[28]  B. Sturmfels Gröbner bases and convex polytopes , 1995 .

[29]  David Eppstein,et al.  Dynamic Euclidean minimum spanning trees and extrema of binary functions , 1995, Discret. Comput. Geom..

[30]  Christos Levcopoulos,et al.  The First Subquadratic Algorithm for Complete Linkage Clustering , 1995, ISAAC.

[31]  Russell Impagliazzo,et al.  Using the Groebner basis algorithm to find proofs of unsatisfiability , 1996, STOC '96.

[32]  Mikkel Thorup,et al.  On the approximability of numerical taxonomy (fitting distances by tree metrics) , 1996, SODA '96.

[33]  Babu O. Narayanan,et al.  On the approximability of numerical taxonomy , 1996 .

[34]  Christos Levcopoulos,et al.  Fast Algorithms for Complete Linkage Clustering , 1998, Discret. Comput. Geom..

[35]  M. Pazzani Constructive Induction of Cartesian Product Attributes , 1998 .

[36]  David Eppstein,et al.  Raising roofs, crashing cycles, and playing pool: applications of a data structure for finding pairwise interactions , 1998, SCG '98.

[37]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.