Compressed K-Means for Large-Scale Clustering

Large-scale clustering has been widely used in many applications, and has received much attention. Most existing clustering methods suffer from both expensive computation and memory costs when applied to large-scale datasets. In this paper, we propose a novel clustering method, dubbed compressed k-means (CKM), for fast large-scale clustering. Specifically, high-dimensional data are compressed into short binary codes, which are well suited for fast clustering. CKM enjoys two key benefits: 1) storage can be significantly reduced by representing data points as binary codes; 2) distance computation is very efficient using Hamming metric between binary codes. We propose to jointly learn binary codes and clusters within one framework. Extensive experimental results on four large-scale datasets, including two million-scale datasets demonstrate that CKM outperforms the state-of-the-art large-scale clustering methods in terms of both computation and memory cost, while achieving comparable clustering accuracy.

[1]  Ivor W. Tsang,et al.  Transfer Hashing with Privileged Information , 2016, IJCAI.

[2]  Svetlana Lazebnik,et al.  Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[3]  Zhiwu Lu,et al.  Large Scale Sparse Clustering , 2016, IJCAI.

[4]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[5]  Wei Liu,et al.  Fast Structural Binary Coding , 2016, IJCAI.

[6]  Ivor W. Tsang,et al.  Tighter and Convex Maximum Margin Clustering , 2009, AISTATS.

[7]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8]  David J. Fleet,et al.  Hamming Distance Metric Learning , 2012, NIPS.

[9]  Fei Yang,et al.  Web scale photo hash clustering on a single machine , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[11]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[12]  Edward Y. Chang,et al.  Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[14]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[15]  Xinlei Chen,et al.  Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.

[16]  Tamir Hazan,et al.  Direct Loss Minimization for Structured Prediction , 2010, NIPS.

[17]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Fillia Makedon,et al.  Fast Nonnegative Matrix Tri-Factorization for Large-Scale Data Co-Clustering , 2011, IJCAI.

[19]  Guosheng Lin,et al.  Supervised Hashing Using Graph Cuts and Boosted Decision Trees , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Yong Wang,et al.  Local and Structural Consistency for Multi-Manifold Clustering , 2011, IJCAI.

[21]  Heng Tao Shen,et al.  Semi-Paired Discrete Hashing: Learning Latent Hash Codes for Semi-Paired Cross-View Retrieval , 2017, IEEE Transactions on Cybernetics.

[22]  Jonathan Drake,et al.  Accelerated k-means with adaptive distance bounds , 2012 .

[23]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[24]  Feiping Nie,et al.  Large-Scale Multi-View Spectral Clustering via Bipartite Graph , 2015, AAAI.

[25]  Thorsten Joachims,et al.  Learning structural SVMs with latent variables , 2009, ICML '09.

[26]  Wei Liu,et al.  Learning to Hash for Indexing Big Data—A Survey , 2015, Proceedings of the IEEE.

[27]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[29]  Yue Zhao,et al.  Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[30]  Weiwei Liu,et al.  Large Margin Metric Learning for Multi-Label Prediction , 2015, AAAI.

[31]  Greg Hamerly,et al.  Making k-means Even Faster , 2010, SDM.