The use of k-means++ for approximate spectral clustering of large datasets

Spectral clustering (SC) has been commonly used in recent years, thanks to its nonparametric model, its ability to extract clusters of different manifolds and its easy application. However, SC is infeasible for large datasets because of its high computational cost and memory requirement. To address this challenge, approximate spectral clustering (ASC) has been proposed for large datasets. ASC involves two steps: firstly limited number of data representatives (also known as prototypes) are selected by sampling or quantization methods, then SC is applied to these representatives using various similarity criteria. In this study, several quantization and sampling methods are compared for ASC. Among them, k-means++, which is a recently popular algorithm in clustering, is used to select prototypes in ASC for the first time. Experiments on different datasets indicate that k-means++ is a suitable alternative to neural gas and selective sampling in terms of accuracy and computational cost.

[1]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[2]  Kadim Tasdemir,et al.  Topology-Based Hierarchical Clustering of Self-Organizing Maps , 2011, IEEE Transactions on Neural Networks.

[3]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[4]  Lili Zhang,et al.  Learning Highly Structured Manifolds: Harnessing the Power of SOMs , 2009, Similarity-Based Clustering.

[5]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[6]  Kotagiri Ramamohanarao,et al.  Approximate pairwise clustering for large data sets via sampling plus extension , 2011, Pattern Recognit..

[7]  Thomas Martinetz,et al.  'Neural-gas' network for vector quantization and its application to time-series prediction , 1993, IEEE Trans. Neural Networks.

[8]  Kotagiri Ramamohanarao,et al.  Approximate Spectral Clustering , 2009, PAKDD.

[9]  Erzsébet Merényi,et al.  Exploiting Data Topology in Visualization and Clustering of Self-Organizing Maps , 2009, IEEE Transactions on Neural Networks.

[10]  Ling Huang,et al.  Fast approximate spectral clustering , 2009, KDD.

[11]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[12]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[13]  A. Ultsch Maps for the Visualization of high-dimensional Data Spaces , 2003 .

[14]  Jianbo Shi,et al.  A Random Walks View of Spectral Segmentation , 2001, AISTATS.

[15]  Marina Meila,et al.  A Comparison of Spectral Clustering Algorithms , 2003 .

[16]  Isa Yildirim,et al.  Geodesic Based Similarities for Approximate Spectral Clustering , 2014, 2014 22nd International Conference on Pattern Recognition.

[17]  Kadim Tasdemir,et al.  Vector quantization based approximate spectral clustering of large datasets , 2012, Pattern Recognit..