论文信息 - Compressed K-Means for Large-Scale Clustering

Compressed K-Means for Large-Scale Clustering

Large-scale clustering has been widely used in many applications, and has received much attention. Most existing clustering methods suffer from both expensive computation and memory costs when applied to large-scale datasets. In this paper, we propose a novel clustering method, dubbed compressed k-means (CKM), for fast large-scale clustering. Specifically, high-dimensional data are compressed into short binary codes, which are well suited for fast clustering. CKM enjoys two key benefits: 1) storage can be significantly reduced by representing data points as binary codes; 2) distance computation is very efficient using Hamming metric between binary codes. We propose to jointly learn binary codes and clusters within one framework. Extensive experimental results on four large-scale datasets, including two million-scale datasets demonstrate that CKM outperforms the state-of-the-art large-scale clustering methods in terms of both computation and memory cost, while achieving comparable clustering accuracy.

[1] Ivor W. Tsang,et al. Transfer Hashing with Privileged Information , 2016, IJCAI.

[2] Svetlana Lazebnik,et al. Iterative quantization: A procrustean approach to learning binary codes , 2011, CVPR 2011.

[3] Zhiwu Lu,et al. Large Scale Sparse Clustering , 2016, IJCAI.

[4] Charles Elkan,et al. Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[5] Wei Liu,et al. Fast Structural Binary Coding , 2016, IJCAI.

[6] Ivor W. Tsang,et al. Tighter and Convex Maximum Margin Clustering , 2009, AISTATS.

[7] Jitendra Malik,et al. Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[8] David J. Fleet,et al. Hamming Distance Metric Learning , 2012, NIPS.

[9] Fei Yang,et al. Web scale photo hash clustering on a single machine , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Michael I. Jordan,et al. On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[11] J. A. Hartigan,et al. A k-means clustering algorithm , 1979 .

[12] Edward Y. Chang,et al. Parallel Spectral Clustering in Distributed Systems , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[14] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[15] Xinlei Chen,et al. Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.

[16] Tamir Hazan,et al. Direct Loss Minimization for Structured Prediction , 2010, NIPS.

[17] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[18] Fillia Makedon,et al. Fast Nonnegative Matrix Tri-Factorization for Large-Scale Data Co-Clustering , 2011, IJCAI.

[19] Guosheng Lin,et al. Supervised Hashing Using Graph Cuts and Boosted Decision Trees , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20] Yong Wang,et al. Local and Structural Consistency for Multi-Manifold Clustering , 2011, IJCAI.

[21] Heng Tao Shen,et al. Semi-Paired Discrete Hashing: Learning Latent Hash Codes for Semi-Paired Cross-View Retrieval , 2017, IEEE Transactions on Cybernetics.

[22] Jonathan Drake,et al. Accelerated k-means with adaptive distance bounds , 2012 .

[23] Andreas Krause,et al. Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[24] Feiping Nie,et al. Large-Scale Multi-View Spectral Clustering via Bipartite Graph , 2015, AAAI.

[25] Thorsten Joachims,et al. Learning structural SVMs with latent variables , 2009, ICML '09.

[26] Wei Liu,et al. Learning to Hash for Indexing Big Data—A Survey , 2015, Proceedings of the IEEE.

[27] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[28] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[29] Yue Zhao,et al. Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup , 2015, ICML.

[30] Weiwei Liu,et al. Large Margin Metric Learning for Multi-Label Prediction , 2015, AAAI.

[31] Greg Hamerly,et al. Making k-means Even Faster , 2010, SDM.