Using Pivots to Speed-Up k-Medoids Clustering

Clustering is a key technique within the KDD process, with k-means, and the more general k-medoids, being well-known incremental partition-based clustering algorithms. A fundamental issue within this class of algorithms is to find an initial set of medians (or medoids) that improves the efficiency of the algorithms (e.g., accelerating its convergence to a solution), at the same time that it improves its effectiveness (e.g., finding more meaningful clusters). Thus, in this article we aim at providing a technique that, given a set of elements, quickly finds a very small number of elements as medoid candidates for this set, allowing to improve both the efficiency and effectiveness of existing clustering algorithms. We target the class of k-medoids algorithms in general, and propose a technique that selects a well-positioned subset of central elements to serve as the initial set of medoids for the clustering process. Our technique leads to a substantially smaller amount of distance calculations, thus improving the algorithm's efficiency when compared to existing methods, without sacrificing effectiveness.  A salient feature of our proposed technique is that it is not a new k-medoid clustering algorithm per se, rather, it can be used in conjunction with any existing clustering algorithm that is based on the k-medoid paradigm. Experimental results, using both synthetic and real datasets, confirm the efficiency, effectiveness and scalability of the proposed technique.

[1]  Qiaoping Zhang,et al.  A New and Efficient K-Medoid Algorithm for Spatial Clustering , 2005, ICCSA.

[2]  Thanasis Hadzilacos,et al.  Advances in Spatial and Temporal Databases , 2015, Lecture Notes in Computer Science.

[3]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[4]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[5]  Kyriakos Mouratidis,et al.  Medoid Queries in Large Spatial Databases , 2005, SSTD.

[6]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[9]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[12]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[13]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[14]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[15]  Hans-Peter Kriegel,et al.  Knowledge Discovery in Large Spatial Databases: Focusing Techniques for Efficient Class Identification , 1995, SSD.

[16]  Peter J. Rousseeuw,et al.  CLUSTERING LARGE DATA SETS , 1986 .

[17]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[18]  Hanan Samet,et al.  Properties of Embedding Methods for Similarity Searching in Metric Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Agma J. M. Traina,et al.  The Metric Histogram: A New and Efficient Approach for Content-based Image Retrieval , 2002, VDB.

[20]  Hae-Sang Park,et al.  A simple and fast algorithm for K-medoids clustering , 2009, Expert Syst. Appl..

[21]  Jeng-Shyang Pan,et al.  An Efficient K -Medoids-Based Algorithm Using Previous Medoid Index, Triangular Inequality Elimination Criteria, and Partial Distance Search , 2002, DaWaK.