AGORAS: A Fast Algorithm for Estimating Medoids in Large Datasets

Abstract The k -medoids methods for modeling clustered data have many desirable properties such as robustness to noise and the ability to use non-numerical values, however, they are typically not applied to large datasets due to their associated computational complexity. In this paper, we present AGORAS, a novel heuristic algorithm for the k-medoids problem where the algorithmic complexity is driven by, k, the number of clusters, rather than, n, the number of data points. Our algorithm attempts to isolate a sample from each individual cluster within a sequence of uniformly drawn samples taken from the complete data. As a result, computing the k -medoids solution using our method only involves solving k trivial sub-problems of centrality. This allows our algorithm to run in comparable time for arbitrarily large datasets with same underlying density distribution. We evaluate AGORAS experimentally against PAM and CLARANS – two of the best-known existing algorithms for the k -medoids problem – across a variety of published and synthetic datasets. We find that AGORAS outperforms PAM by up to four orders of magnitude for data sets with less than 10,000 points, and it outperforms CLARANS by two orders of magnitude on a dataset of just 64,000 points. Moreover, we find in some cases that AGORAS also outperforms in terms of cluster quality.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[3]  Weiguo Sheng,et al.  A genetic k-medoids clustering algorithm , 2006, J. Heuristics.

[4]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[5]  Takeo Kanade,et al.  Mode-seeking by Medoidshifts , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[6]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Brian Everitt,et al.  Optimization Clustering Techniques , 2011 .

[8]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[9]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[10]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[11]  Aristides Gionis,et al.  Clustering Aggregation , 2005, ICDE.

[12]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[13]  P. Erd6s ON A CLASSICAL PROBLEM OF PROBABILITY THEORY b , 2001 .

[14]  Jiawei Han,et al.  CLARANS: A Method for Clustering Objects for Spatial Data Mining , 2002, IEEE Trans. Knowl. Data Eng..

[15]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Brian P. Dawkins Siobhan's Problem: The Coupon Collector Revisited , 1991 .