Too Much Information Kills Information: A Clustering Perspective

Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability. Extensive experiments on synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod 1982) and k-means++ method (Arthur and Vassilvitskii 2007), we only need 7% information of the dataset. If we have up to 15% information of the dataset, then our algorithm outperforms both the k-means method and k-means++ method in at least 80% of the clustering tasks, in terms of the quality of clustering. Also, an extended algorithm based on the same idea guarantees a balanced k-clustering result.

[1]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[2]  Silvio Lattanzi,et al.  A Better k-means++ Algorithm via Local Search , 2019, ICML.

[3]  Vincent Cohen-Addad,et al.  Approximation Schemes for Capacitated Clustering in Doubling Metrics , 2018, SODA.

[4]  Václav Rozhon,et al.  K-means++: Few More Steps Yield Constant Approximation , 2020, ICML.

[5]  Yi Yang,et al.  Balanced Clustering via Exclusive Lasso: A Pragmatic Approach , 2018, AAAI.

[6]  Xuelong Li,et al.  Balanced Clustering with Least Square Regression , 2017, AAAI.

[7]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[8]  Jason Li,et al.  On the Fixed-Parameter Tractability of Capacitated Clustering , 2022, ICALP.

[9]  Sergei Vassilvitskii,et al.  Scalable K-Means++ , 2012, Proc. VLDB Endow..

[10]  Andreas Krause,et al.  Approximate K-Means++ in Sublinear Time , 2016, AAAI.

[11]  Zhu He,et al.  Balanced Clustering: A Uniform Model and Fast Algorithm , 2019, IJCAI.

[12]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[13]  Rolf H. Möhring,et al.  A constant FPT approximation algorithm for hard-capacitated k-means , 2019 .

[14]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[15]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[16]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[17]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.