Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms

Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by eagerly performing additional swaps in each iteration. With the substantially faster SWAP, we can now explore faster initialization strategies, because (i) the classic ("BUILD") initialization now becomes the bottleneck, and (ii) our swap is fast enough to compensate for worse starting conditions. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications. While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Luiz Antonio Nogueira Lorena,et al.  Lagrangean/Surrogate Heuristics for p-Median Problems , 2000 .

[3]  José Luis González Velarde,et al.  Computing tools for modeling, optimization and simulation : interfaces in computer science and operations research , 2000 .

[4]  François Fleuret,et al.  K-Medoids For K-Means Seeding , 2016, NIPS.

[5]  J. Reese,et al.  Solution methods for the p‐median problem: An annotated bibliography , 2006, Networks.

[6]  Peter J. Rousseeuw,et al.  CLUSTERING LARGE DATA SETS , 1986 .

[7]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[8]  Arthur Zimek,et al.  ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg" , 2019, ArXiv.

[9]  Christian Rehtanz,et al.  The Generation of Distribution Grid Models on the Basis of Public Available Data , 2017, IEEE Transactions on Power Systems.

[10]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[11]  C. B. Lucasius,et al.  On k-medoid clustering of large data sets with the aid of a genetic algorithm: background, feasiblity and comparison , 1993 .

[12]  Michael E. Houle,et al.  Robust Distance-Based Clustering with Applications to Spatial Data Mining , 2001, Algorithmica.

[13]  Beatriz de la Iglesia,et al.  Clustering Rules: A Comparison of Partitioning and Hierarchical Clustering Algorithms , 2006, J. Math. Model. Algorithms.

[14]  J. Current,et al.  An efficient tabu search procedure for the p-Median Problem , 1997 .

[15]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[16]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[17]  P. Rousseeuw,et al.  Partitioning Around Medoids (Program PAM) , 2008 .

[18]  C. Revelle,et al.  A Lagrangean heuristic for the maximal covering location problem , 1996 .

[19]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[20]  Raymond E. Bonner,et al.  On Some Clustering Techniques , 1964, IBM J. Res. Dev..

[21]  Michael Gertz,et al.  Numerically stable parallel computation of (co-)variance , 2018, SSDBM.

[22]  Hans-Hermann Bock,et al.  Clustering Methods: A History of k-Means Algorithms , 2007 .

[23]  O. Kariv,et al.  An Algorithmic Approach to Network Location Problems. II: The p-Medians , 1979 .

[24]  Sergei Vassilvitskii,et al.  How slow is the k-means method? , 2006, SCG '06.

[25]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[26]  Mauricio G. C. Resende,et al.  A Hybrid Heuristic for the p-Median Problem , 2004, J. Heuristics.

[27]  Hans-Peter Kriegel,et al.  DBSCAN Revisited, Revisited , 2017, ACM Trans. Database Syst..

[28]  Dorit S. Hochbaum,et al.  Heuristics for the fixed cost median problem , 1982, Math. Program..

[29]  Erich Schubert,et al.  Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms , 2018, SISAP.

[30]  Michael L. Overton,et al.  A quadratically convergent method for minimizing a sum of euclidean norms , 1983, Math. Program..

[31]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[32]  Pierre Hansen,et al.  Variable Neighborhood Search , 2018, Handbook of Heuristics.

[33]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[34]  Richard L. Church,et al.  Applying simulated annealing to location-planning models , 1996, J. Heuristics.

[35]  Jae-Gil Lee,et al.  PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency , 2017, KDD.

[36]  R. A. Whitaker,et al.  A Fast Algorithm For The Greedy Interchange For Large-Scale Clustering And Median Location Problems , 1983 .

[37]  Xianfeng Yang,et al.  A New Data Mining Algorithm based on MapReduce and Hadoop , 2014 .

[38]  J. Beasley A note on solving large p-median problems , 1985 .

[39]  Vladimir Estivill-Castro,et al.  Fast and Robust General Purpose Clustering Algorithms , 2000, Data Mining and Knowledge Discovery.

[40]  G. Nemhauser,et al.  Exceptional Paper—Location of Bank Accounts to Optimize Float: An Analytic Study of Exact and Approximate Algorithms , 1977 .

[41]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[42]  Peter J. Rousseeuw,et al.  Clustering Large Applications (Program CLARA) , 2008 .

[43]  Vladimir Estivill-Castro,et al.  Discovering Associations in Spatial Data - An Efficient Medoid Based Approach , 1998, PAKDD.

[44]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[45]  Peter Filzmoser,et al.  A comparison of algorithms for the multivariate L1-median , 2010, Comput. Stat..

[46]  Francesco E. Maranzana,et al.  On the Location of Supply Points to Minimize Transportation Costs , 1963, IBM Syst. J..

[47]  Paul S. Bradley,et al.  Clustering via Concave Minimization , 1996, NIPS.

[48]  M. E. Captivo Fast primal and dual heuristics for the p-median location problem , 1991 .

[49]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[50]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[51]  S. L. HAKIMIt AN ALGORITHMIC APPROACH TO NETWORK LOCATION PROBLEMS. , 1979 .

[52]  Peter J. Rousseeuw,et al.  Using a parallel computer system for statistical resampling methods , 1988 .

[53]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[54]  Hans-Peter Kriegel,et al.  The (black) art of runtime evaluation: Are we comparing algorithms or implementations? , 2017, Knowledge and Information Systems.

[55]  Katharina Morik,et al.  The Relationship of DBSCAN to Matrix Factorization and Spectral Clustering , 2018, LWDA.

[56]  Panagiotis Papapetrou,et al.  Size matters: choosing the most informative set of window lengths for mining patterns in event sequences , 2015, Data Mining and Knowledge Discovery.

[57]  John E. Beasley,et al.  OR-Library: Distributing Test Problems by Electronic Mail , 1990 .

[58]  Polly Bart,et al.  Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph , 1968, Oper. Res..

[59]  R. Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[60]  François Fleuret,et al.  A Sub-Quadratic Exact Medoid Algorithm , 2017, AISTATS.

[61]  Chih-Ping Wei,et al.  Empirical comparison of fast partitioning-based clustering algorithms for large data sets , 2003, Expert Syst. Appl..

[62]  K. Rosing,et al.  A Note Comparing Optimal and Heuristic Solutions To the p‐Median Problem , 2010 .