Faster Balanced Clusterings in High Dimension

The problem of constrained clustering has attracted significant attention in the past decades. In this paper, we study the balanced $k$-center, $k$-median, and $k$-means clustering problems where the size of each cluster is constrained by the given lower and upper bounds. The problems are motivated by the applications in processing large-scale data in high dimension. Existing methods often need to compute complicated matchings (or min cost flows) to satisfy the balance constraint, and thus suffer from high complexities especially in high dimension. We develop an effective framework for the three balanced clustering problems to address this issue, and our method is based on a novel spatial partition idea in geometry. For the balanced $k$-center clustering, we provide a $4$-approximation algorithm that improves the existing approximation factors; for the balanced $k$-median and $k$-means clusterings, our algorithms yield constant and $(1+\epsilon)$-approximation factors with any $\epsilon>0$. More importantly, our algorithms achieve linear or nearly linear running times when $k$ is a constant, and significantly improve the existing ones. Our results can be easily extended to metric balanced clusterings and the running times are sub-linear in terms of the complexity of $n$-point metric.

[1]  Hu Ding,et al.  Balanced k-Center Clustering When k Is A Constant , 2017, CCCG.

[2]  Tomasz Kociumaka,et al.  Constant Factor Approximation for Capacitated k-Center with Outliers , 2014, STACS.

[3]  Ragesh Jaiswal,et al.  Improved analysis of D2-sampling based PTAS for k-means and other clustering problems , 2015, Inf. Process. Lett..

[4]  Samir Khuller,et al.  LP Rounding for k-Centers with Non-uniform Hard Capacities , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[5]  Sariel Har-Peled,et al.  Fast Clustering with Lower Bounds: No Customer too Far, No Shop too Small , 2013, ArXiv.

[6]  Aditya Bhaskara,et al.  Distributed Balanced Clustering via Mapping Coresets , 2014, NIPS.

[7]  Jian Li,et al.  Capacitated Center Problems with Two-Sided Bounds and Outliers , 2017, WADS.

[8]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[9]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[10]  Amit Kumar,et al.  A Simple D2-Sampling Based PTAS for k-Means and Other Clustering Problems , 2012, Algorithmica.

[11]  A. Manne Plant Location Under Economies-of-Scale---Decentralization and Computation , 1964 .

[12]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[13]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[14]  S. Dasgupta The hardness of k-means clustering , 2008 .

[15]  Herbert Edelsbrunner,et al.  Cutting dense point sets in half , 1994, SCG '94.

[16]  Pasi Fränti,et al.  Balanced K-Means for Clustering , 2014, S+SSPR.

[17]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[18]  Shi Li,et al.  On Uniform Capacitated k-Median Beyond the Natural LP Relaxation , 2014, SODA.

[19]  Philip N. Klein,et al.  Local Search Yields Approximation Schemes for k-Means and k-Median in Euclidean and Minor-Free Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[20]  Vahab S. Mirrokni,et al.  Distributed Balanced Partitioning via Linear Embedding , 2015, WSDM.

[21]  Amit Kumar,et al.  A Simple D 2-Sampling Based PTAS for k-Means and other Clustering Problems , 2012, COCOON.

[22]  Joydeep Ghosh,et al.  Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[23]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[24]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[25]  Alexander J. Smola,et al.  Data Driven Resource Allocation for Distributed Learning , 2015, AISTATS.

[26]  Léon Bottou,et al.  Local Algorithms for Pattern Recognition and Dependencies Estimation , 1993, Neural Computation.

[27]  Chaitanya Swamy,et al.  Approximation Algorithms for Clustering Problems with Lower Bounds and Outliers , 2016, ICALP.

[28]  Melanie Schmidt,et al.  Privacy preserving clustering with constraints , 2018, ICALP.

[29]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[30]  Vijay V. Vazirani,et al.  Approximation algorithms for metric facility location and k-Median problems using the primal-dual schema and Lagrangian relaxation , 2001, JACM.

[31]  Samir Khuller,et al.  The Capacitated K-Center Problem , 2000, SIAM J. Discret. Math..

[32]  Jinhui Xu,et al.  A Unified Framework for Clustering Constrained Data without Locality Property , 2015, SODA.

[33]  Shi Li,et al.  Approximating k-Median via Pseudo-Approximation , 2016, SIAM J. Comput..

[34]  Chaitanya Swamy,et al.  Improved Approximation Guarantees for Lower-Bounded Facility Location , 2011, WAOA.

[35]  Amit Kumar,et al.  Faster Algorithms for the Constrained k-means Problem , 2015, Theory of Computing Systems.

[36]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem , 2007, SIAM J. Comput..

[37]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[38]  Venkatesan Guruswami,et al.  Embeddings and non-approximability of geometric problems , 2003, SODA '03.

[39]  Sudipto Guha,et al.  Improved Combinatorial Algorithms for Facility Location Problems , 2005, SIAM J. Comput..

[40]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[41]  Shi Li,et al.  Approximating k-median via pseudo-approximation , 2012, STOC '13.

[42]  Nimrod Megiddo,et al.  On the Complexity of Some Common Geometric Location Problems , 1984, SIAM J. Comput..

[43]  Zhu He,et al.  Balanced Clustering: A Uniform Model and Fast Algorithm , 2019, IJCAI.

[44]  Judit Bar-Ilan,et al.  How to Allocate Network Centers , 1993, J. Algorithms.

[45]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[46]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[47]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[48]  Mohammad R. Salavatipour,et al.  Local Search Yields a PTAS for k-Means in Doubling Metrics , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[49]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[50]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[51]  Vahab Mirrokni,et al.  Streaming Balanced Clustering , 2019, ArXiv.

[52]  James B. Orlin,et al.  Max flows in O(nm) time, or better , 2013, STOC '13.

[53]  Alfred A. Kuehn,et al.  A Heuristic Program for Locating Warehouses , 1963 .

[54]  Piotr Indyk,et al.  Geometric matching under noise: combinatorial bounds and algorithms , 1999, SODA '99.

[55]  Peter Gritzmann,et al.  An LP-based k-means algorithm for balancing weighted point sets , 2017, Eur. J. Oper. Res..

[56]  David B. Shmoys,et al.  A Best Possible Heuristic for the k-Center Problem , 1985, Math. Oper. Res..