Distributed k-Clustering for Data with Heavy Noise

In this paper, we consider the $k$-center/median/means clustering with outliers problems (or the $(k, z)$-center/median/means problems) in the distributed setting. Most previous distributed algorithms have their communication costs linearly depending on $z$, the number of outliers. Recently Guha et al.[10] overcame this dependence issue by considering bi-criteria approximation algorithms that output solutions with $2z$ outliers. For the case where $z$ is large, the extra $z$ outliers discarded by the algorithms might be too large, considering that the data gathering process might be costly. In this paper, we improve the number of outliers to the best possible $(1+\epsilon)z$, while maintaining the $O(1)$-approximation ratio and independence of communication cost on $z$. The problems we consider include the $(k, z)$-center problem, and $(k, z)$-median/means problems in Euclidean metrics. Implementation of the our algorithm for $(k, z)$-center shows that it outperforms many previous algorithms, both in terms of the communication cost and quality of the output solution.

[1]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[2]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[3]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[4]  Yu Liu,et al.  K-Means Clustering with Distributed Dimensions , 2016, ICML.

[5]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..

[6]  Qin Zhang,et al.  A Practical Algorithm for Distributed Clustering and Outlier Detection , 2018, NeurIPS.

[7]  Ola Svensson,et al.  Better Guarantees for k-Means and Euclidean k-Median by Primal-Dual Algorithms , 2016, 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS).

[8]  Max A. Little,et al.  Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson's disease progression , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  David P. Woodruff,et al.  Communication-Optimal Distributed Clustering , 2016, NIPS.

[10]  Gustavo Malkomes,et al.  Fast Distributed k-Center Clustering with Outliers on Massive Data , 2015, NIPS.

[11]  Benjamin Moseley,et al.  Fast and Better Distributed MapReduce Algorithms for k-Center Clustering , 2015, SPAA.

[12]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[13]  Sudipto Guha,et al.  Distributed Partial Clustering , 2017, SPAA.

[14]  Ke Chen,et al.  A constant factor approximation algorithm for k-median clustering with outliers , 2008, SODA '08.

[15]  Aristides Gionis,et al.  k-means-: A Unified Approach to Clustering and Outlier Detection , 2013, SDM.

[16]  Barbara M. Anthony,et al.  A Plant Location Guide for the Unsure : Approximation Algorithms for Min-Max Location Problems , 2009 .

[17]  Benjamin Moseley,et al.  Fast clustering using MapReduce , 2011, KDD.

[18]  Avi Wigderson,et al.  The Randomized Communication Complexity of Set Disjointness , 2007, Theory Comput..

[19]  Shi Li,et al.  Constant approximation for k-median and k-means with outliers via iterative rounding , 2017, STOC.

[20]  Maria-Florina Balcan,et al.  Distributed k-means and k-median clustering on general communication topologies , 2013, NIPS.