PAMAE: Parallel k-Medoids Clustering with High Accuracy and Efficiency

The k-medoids algorithm is one of the best-known clustering algorithms. Despite this, however, it is not as widely used for big data analytics as the k-means algorithm, mainly because of its high computational complexity. Many studies have attempted to solve the efficiency problem of the k-medoids algorithm, but all such studies have improved efficiency at the expense of accuracy. In this paper, we propose a novel parallel k-medoids algorithm, which we call PAMAE, that achieves both high accuracy and high efficiency. We identify two factors---"global search" and "entire data"---that are essential to achieving high accuracy, but are also very time-consuming if considered simultaneously. Thus, our key idea is to apply them individually through two phases: parallel seeding and parallel refinement, neither of which is costly. The first phase performs global search over sampled data, and the second phase performs local search over entire data. Our theoretical analysis proves that this serial execution of the two phases leads to an accurate solution that would be achieved by global search over entire data. In order to validate the merit of our approach, we implement PAMAE on Spark as well as Hadoop and conduct extensive experiments using various real-world data sets on 12 Microsoft Azure machines (48 cores). The results show that PAMAE significantly outperforms most of recent parallel algorithms and, at the same time, produces a clustering quality as comparable as the previous most-accurate algorithm. The source code and data are available at https://github.com/jaegil/k-Medoid.

[1]  Gérard Govaert,et al.  Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[2]  Jie Mi,et al.  Robust Nonparametric Statistical Methods , 1999, Technometrics.

[3]  Thomas P. Hettmansperger,et al.  Robust Nonparametric Statistical Methods, Second Edition , 2010 .

[4]  Durga Toshniwal,et al.  Improved k-medoids clustering based on cluster validity index and object density , 2010, 2010 IEEE 2nd International Advance Computing Conference (IACC).

[5]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[6]  Roger Sauter,et al.  Introduction to Probability and Statistics for Engineers and Scientists , 2005, Technometrics.

[7]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[8]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[9]  David P. Doane,et al.  Measuring Skewness: A Forgotten Statistic? , 2011 .

[10]  Xianfeng Yang,et al.  A New Data Mining Algorithm based on MapReduce and Hadoop , 2014 .

[11]  Magdalena Balazinska,et al.  Scalable Clustering Algorithm for N-Body Simulations in a Shared-Nothing Cluster , 2010, SSDBM.

[12]  Charu C. Aggarwal,et al.  Data Mining: The Textbook , 2015 .

[13]  Caetano Traina,et al.  Using Pivots to Speed-Up k-Medoids Clustering , 2011, J. Inf. Data Manag..

[14]  Andreas Krause,et al.  Distributed Submodular Maximization: Identifying Representative Elements in Massive Data , 2013, NIPS.

[15]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[16]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[17]  전치혁,et al.  A K-means-like Algorithm for K-medoids Clustering , 2005 .

[18]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[19]  P. Chaudhuri On a geometric notion of quantiles for multivariate data , 1996 .

[20]  Amir Beck,et al.  Weiszfeld’s Method: Old and New Results , 2015, J. Optim. Theory Appl..

[21]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[22]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[23]  Ying-ting Zhu,et al.  K-medoids clustering based on MapReduce and optimal search of medoids , 2014, 2014 9th International Conference on Computer Science & Education.

[24]  Nevcihan Duru,et al.  Decreasing iteration number of k-medoids algorithm with IFART , 2011, 2011 7th International Conference on Electrical and Electronics Engineering (ELECO).

[25]  Hae-Sang Park,et al.  A K-means-like Algorithm for K-medoids Clustering and Its Performance , 2006 .

[26]  P. Chaudhuri Multivariate location estimation using extension of R-estimates through U-statistics type approach , 1992 .

[27]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[28]  G. T. Timmer,et al.  Stochastic global optimization methods part I: Clustering methods , 1987, Math. Program..

[29]  A. B. Sunter,et al.  List Sequential Sampling with Equal or Unequal Probabilities without Replacement , 1977 .

[30]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[31]  Alfredo Marín,et al.  On the convergence of the Weiszfeld algorithm , 2002, Math. Program..

[32]  Satish Narayana Srirama,et al.  Clustering on the cloud: reducing CLARA to MapReduce , 2013, NordiCloud '13.