A Computational Comparison of Parallel and Distributed K-median Clustering Algorithms on Large-Scale Image Data

Most commonly used clustering algorithms are those aimed at solving the well-known k-median problem. Their main advantage is that they are simple to implement and use, and they are flexible in choosing dissimilarity measures (not necessarily metrics). K-median algorithms are also known to be more robust to noise and outliers in comparison with k-means algorithms. In spite of that, they have been of limited use for large-scale clustering problems due to their high computational and space complexity. This work aims at computational comparison of k-median clustering algorithms in a specific large-scale setting—clustering large image collections. We implement distributed versions of the most common k-median clustering algorithms and compare them with our parallel heuristic for solving large-scale k-median problem instances. We analyze clustering results with respect to external evaluation measures and run time.

[1]  Igor Vasil'ev,et al.  A computational study of a nonlinear minsum facility location problem , 2012, Comput. Oper. Res..

[2]  O. Kariv,et al.  An Algorithmic Approach to Network Location Problems. II: The p-Medians , 1979 .

[3]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[4]  Yu Qiao,et al.  Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks , 2016, IEEE Signal Processing Letters.

[5]  Pierre Hansen,et al.  Solving large p-median clustering problems by primal–dual variable neighborhood search , 2009, Data Mining and Knowledge Discovery.

[6]  Anton V. Ushakov,et al.  A Shared Memory Parallel Heuristic Algorithm for the Large-Scale p-Median Problem , 2017 .

[7]  Igor Vasil'ev,et al.  Computational study of large-scale p-Median problems , 2007, Math. Program..

[8]  Emilio Pasquale Mancini,et al.  A grid-aware MIP solver: Implementation and case studies , 2008, Future Gener. Comput. Syst..

[9]  Martine Labbé,et al.  Solving Large p-Median Problems with a Radius Formulation , 2011, INFORMS J. Comput..

[10]  Anton V. Ushakov,et al.  An effective heuristic for large-scale fault-tolerant k-median problem , 2018, Soft Comput..

[11]  Mark S. Daskin,et al.  The p -Median Problem , 2015 .

[12]  Jae-Gil Lee,et al.  PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency , 2017, KDD.

[13]  Igor Vasil'ev,et al.  An effective heuristic for large-scale capacitated facility location problems , 2009, J. Heuristics.

[14]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[15]  Jan-Michael Frahm,et al.  Building Rome on a Cloudless Day , 2010, ECCV.

[16]  Andrew Zisserman,et al.  Deep Face Recognition , 2015, BMVC.

[17]  Luis Quesada,et al.  Parallelising the k-Medoids Clustering Problem Using Space-Partitioning , 2013, SOCS.

[18]  Omkar M. Parkhi,et al.  VGGFace2: A Dataset for Recognising Faces across Pose and Age , 2017, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[19]  Igor Vasil'ev,et al.  An aggregation heuristic for large scale p-median problem , 2012, Comput. Oper. Res..

[20]  Pierre Hansen,et al.  Cooperative Parallel Variable Neighborhood Search for the p-Median , 2004, J. Heuristics.

[21]  Pierre Hansen,et al.  The p-median problem: A survey of metaheuristic approaches , 2005, Eur. J. Oper. Res..

[22]  Shi Li,et al.  Approximating k-Median via Pseudo-Approximation , 2016, SIAM J. Comput..

[23]  Claudio Sterle,et al.  A parallel subgradient algorithm for Lagrangean dual function of the p-median problem , 2011, Stud. Inform. Univ..

[24]  Marshall L. Fisher,et al.  The Lagrangian Relaxation Method for Solving Integer Programming Problems , 2004, Manag. Sci..

[25]  R. A. Whitaker,et al.  A Fast Algorithm For The Greedy Interchange For Large-Scale Clustering And Median Location Problems , 1983 .

[26]  Ola Svensson,et al.  Recent Developments in Approximation Algorithms for Facility Location and Clustering Problems , 2017 .

[27]  Belén Melián-Batista,et al.  Parallelization of the scatter search for the p-median problem , 2003, Parallel Comput..

[28]  Belén Melián-Batista,et al.  The Parallel Variable Neighborhood Search for the p-Median Problem , 2002, J. Heuristics.