论文信息 - Robust k-means: a Theoretical Revisit

Robust k-means: a Theoretical Revisit

Over the last years, many variations of the quadratic k-means clustering procedure have been proposed, all aiming to robustify the performance of the algorithm in the presence of outliers. In general terms, two main approaches have been developed: one based on penalized regularization methods, and one based on trimming functions. In this work, we present a theoretical analysis of the robustness and consistency properties of a variant of the classical quadratic k-means algorithm, the robust k-means, which borrows ideas from outlier detection in regression. We show that two outliers in a dataset are enough to breakdown this clustering procedure. However, if we focus on “well-structured” datasets, then robust k-means can recover the underlying cluster structure in spite of the outliers. Finally, we show that, with slight modifications, the most general non-asymptotic results for consistency of quadratic k-means remain valid for this robust variant.

Alexandros Georgogiannis | Alexandros Georgogiannis

[1] D. Ruppert. Robust Statistics: The Approach Based on Influence Functions , 1987 .

[2] Stephen J. Wright. Coordinate descent algorithms , 2015, Mathematical Programming.

[3] P. Tseng. Convergence of a Block Coordinate Descent Method for Nondifferentiable Minimization , 2001 .

[4] David S. Johnson,et al. Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[5] Werner A. Stahel,et al. Robust Statistics: The Approach Based on Influence Functions , 1987 .

[6] Daniela M Witten,et al. Penalized unsupervised learning with outliers. , 2013, Statistics and its interface.

[7] S. R. Jammalamadaka,et al. Empirical Processes in M-Estimation , 2001 .

[8] T. Hastie,et al. SparseNet: Coordinate Descent With Nonconvex Penalties , 2011, Journal of the American Statistical Association.

[9] William M. Rand,et al. Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[10] D. Pollard. Strong Consistency of $K$-Means Clustering , 1981 .

[11] M. Gallegos,et al. A robust method for cluster analysis , 2005, math/0504513.

[12] S. Geer. Empirical Processes in M-Estimation , 2000 .

[13] Y. She,et al. Thresholding-based iterative selection procedures for model selection and shrinkage , 2008, 0812.5061.

[14] T. Linder. LEARNING-THEORETIC METHODS IN VECTOR QUANTIZATION , 2002 .

[15] László Györfi,et al. A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[16] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[17] Wei-Chen Chen,et al. MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms , 2012 .

[18] Marc Teboulle,et al. A Unified Continuous Optimization Framework for Center-Based Clustering Methods , 2007, J. Mach. Learn. Res..

[19] Shai Ben-David,et al. Clustering in the Presence of Background Noise , 2014, ICML.

[20] D. Pollard. Convergence of stochastic processes , 1984 .

[21] A. Gordaliza,et al. Robustness Properties of k Means and Trimmed k Means , 1999 .

[22] G. Ritter. Robust Cluster Analysis and Variable Selection , 2014 .

[23] Jianqing Fan,et al. Regularization of Wavelet Approximations , 2001 .

[24] Georgios B. Giannakis,et al. Robust Clustering Using Outlier-Sparsity Regularization , 2011, IEEE Transactions on Signal Processing.