Near-optimal large-scale k-medoids clustering

Abstract The k-medoids (k-median) problem is one of the best known unsupervised clustering problems. Due to its complexity, finding high-quality solutions for huge-scale datasets remains extremely challenging. The application of many approaches finding optimal or quality solutions is limited to only small and medium-size instances. On the other hand, many parallel, distributed algorithms that can handle huge-scale datasets usually provide very poor solutions. In this paper, we develop a first parallel, distributed primal–dual heuristic algorithm for the k-medoids problem. Its main component is a very efficient parallel subgradient column generation that solves a Lagrangian dual problem and finds a tight bound on solution quality. High-quality solutions are then produced by a parallel core selection technique. We considerably reduce computational burden and memory load by employing a nearest neighbor strategy to approximate the dissimilarity matrix. We demonstrate that our algorithm finds very close to optimal solutions, confirmed by the tightness of dual bounds, of instances that are much larger than those considered in the literature to date. Our experiments include clustering large-scale collections of face images into several thousand of clusters. We show that our approach outperforms parallel improved versions of the most popular k-medoids clustering algorithms, achieving nearly linear parallel speedup.

[1]  Ying-ting Zhu,et al.  K-medoids clustering based on MapReduce and optimal search of medoids , 2014, 2014 9th International Conference on Computer Science & Education.

[2]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[3]  Farahnaz Sadoughi,et al.  Ranked k-medoids: A fast and accurate rank-based partitioning algorithm for clustering large datasets , 2013, Knowl. Based Syst..

[4]  Igor Vasil'ev,et al.  An effective heuristic for large-scale capacitated facility location problems , 2009, J. Heuristics.

[5]  Fei Yang,et al.  Web scale photo hash clustering on a single machine , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Qiaoping Zhang,et al.  A New and Efficient K-Medoid Algorithm for Spatial Clustering , 2005, ICCSA.

[7]  Ravishankar Krishnaswamy,et al.  Relax, No Need to Round: Integrality of Clustering Formulations , 2014, ITCS.

[8]  Igor Vasil'ev,et al.  An aggregation heuristic for large scale p-median problem , 2012, Comput. Oper. Res..

[9]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[10]  Juana López Redondo,et al.  A parallelized Lagrangean relaxation approach for the discrete ordered median problem , 2014, Annals of Operations Research.

[11]  Belén Melián-Batista,et al.  The Parallel Variable Neighborhood Search for the p-Median Problem , 2002, J. Heuristics.

[12]  Weiguo Sheng,et al.  A genetic k-medoids clustering algorithm , 2006, J. Heuristics.

[13]  Martine Labbé,et al.  Solving Large p-Median Problems with a Radius Formulation , 2011, INFORMS J. Comput..

[14]  Chandra Ade Irawan,et al.  An adaptive multiphase approach for large unconditional and conditional p-median problems , 2014, Eur. J. Oper. Res..

[15]  Jae-Gil Lee,et al.  PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency , 2017, KDD.

[16]  Pierre Hansen,et al.  The p-median problem: A survey of metaheuristic approaches , 2005, Eur. J. Oper. Res..

[17]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[18]  Anil K. Jain,et al.  Face Clustering: Representation and Pairwise Constraints , 2017, IEEE Transactions on Information Forensics and Security.

[19]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[20]  Donghua Yu,et al.  An improved K-medoids algorithm based on step increasing and optimizing medoids , 2018, Expert Syst. Appl..

[21]  Pierre Hansen,et al.  Cooperative Parallel Variable Neighborhood Search for the p-Median , 2004, J. Heuristics.

[22]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[23]  Anil K. Jain,et al.  Clustering Millions of Faces by Identity , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Andreas Krause,et al.  Distributed Submodular Maximization: Identifying Representative Elements in Massive Data , 2013, NIPS.

[25]  Jan-Michael Frahm,et al.  Building Rome on a Cloudless Day , 2010, ECCV.

[26]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[27]  P. Hansen,et al.  Variable neighborhood search for the p-median , 1997 .

[28]  Wei-keng Liao,et al.  AGORAS: A Fast Algorithm for Estimating Medoids in Large Datasets , 2016, ICCS.

[29]  F. E. Maranzana,et al.  On the Location of Supply Points to Minimize Transport Costs , 1964 .

[30]  Shili Lin,et al.  GrammR: graphical representation and modeling of count data with application in metagenomics , 2015, Bioinform..

[31]  O. Kariv,et al.  An Algorithmic Approach to Network Location Problems. II: The p-Medians , 1979 .

[32]  Igor Vasil'ev,et al.  A heuristic for large-scale p-median instances , 2003, Electron. Notes Discret. Math..

[33]  Arthur M. Geoffrion,et al.  Lagrangian Relaxation for Integer Programming , 2010, 50 Years of Integer Programming.

[34]  Hsin-Chia Fu,et al.  Variance enhanced K-medoid clustering , 2011, Expert Syst. Appl..

[35]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[36]  Pierre Hansen,et al.  Solving large p-median clustering problems by primal–dual variable neighborhood search , 2009, Data Mining and Knowledge Discovery.

[37]  J. Beasley Lagrangean heuristics for location problems , 1993 .

[38]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[39]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[40]  Caetano Traina,et al.  Using Pivots to Speed-Up k-Medoids Clustering , 2011, J. Inf. Data Manag..

[41]  Anil K. Jain,et al.  IARPA Janus Benchmark - C: Face Dataset and Protocol , 2018, 2018 International Conference on Biometrics (ICB).

[42]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Luis Quesada,et al.  Parallelising the k-Medoids Clustering Problem Using Space-Partitioning , 2013, SOCS.

[44]  Belén Melián-Batista,et al.  Parallelization of the scatter search for the p-median problem , 2003, Parallel Comput..

[45]  Radu State,et al.  Automated Labeling of Unknown Contracts in Ethereum , 2017, 2017 26th International Conference on Computer Communication and Networks (ICCCN).

[46]  Igor Vasil'ev,et al.  Computational study of large-scale p-Median problems , 2007, Math. Program..