The classical center based clustering problems such as k-means/median/center assume that the optimal clusters satisfy the locality property that the points in the same cluster are close to each other. A number of clustering problems arise in machine learning where the optimal clusters do not follow such a locality property. For instance, consider the r-gather clustering problem where there is an additional constraint that each of the clusters should have at least r points or the capacitated clustering problem where there is an upper bound on the cluster sizes. Consider a variant of the k-means problem that may be regarded as a general version of such problems. Here, the optimal clusters O1, ..., Ok are an arbitrary partition of the dataset and the goal is to output k-centers c1, ..., ck such that the objective function ∑i=1k∑x∈Oi||x−ci||2${\sum }_{i = 1}^{k} {\sum }_{x \in O_{i}} ||x - c_{i}||^{2}$ is minimized. It is not difficult to argue that any algorithm (without knowing the optimal clusters) that outputs a single set of k centers, will not behave well as far as optimizing the above objective function is concerned. However, this does not rule out the existence of algorithms that output a list of such k centers such that at least one of these k centers behaves well. Given an error parameter ε > 0, let ℓ denote the size of the smallest list of k-centers such that at least one of the k-centers gives a (1 + ε) approximation w.r.t. the objective function above. In this paper, we show an upper bound on ℓ by giving a randomized algorithm that outputs a list of 2Õ(k/ε)$2^{\tilde {O}(k/\varepsilon )}$k-centers. We also give a closely matching lower bound of 2Ω~(k/ε)$2^{\tilde {\Omega }(k/\sqrt {\varepsilon })}$. Moreover, our algorithm runs in time Ond⋅2Õ(k/ε)$O \left (n d \cdot 2^{\tilde {O}(k/\varepsilon )} \right )$. This is a significant improvement over the previous result of Ding and Xu (2015) who gave an algorithm with running time O(nd ⋅ (log n)k ⋅ 2poly(k/ε)) and output a list of size O((log n)k ⋅ 2poly(k/ε)). Our techniques generalize for the k-median problem and for many other settings where non-Euclidean distance measures are involved.
[1]
Ravishankar Krishnaswamy,et al.
The Hardness of Approximation of Euclidean k-Means
,
2015,
SoCG.
[2]
Jirí Matousek,et al.
On Approximate Geometric k -Clustering
,
2000,
Discret. Comput. Geom..
[3]
Jinhui Xu,et al.
A Unified Framework for Clustering Constrained Data without Locality Property
,
2015,
SODA.
[4]
Mary Inaba,et al.
Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract)
,
1994,
SCG '94.
[5]
Dan Feldman,et al.
A PTAS for k-means clustering based on weak coresets
,
2007,
SCG '07.
[6]
Ke Chen,et al.
On k-Median clustering in high dimensions
,
2006,
SODA '06.
[7]
David M. Mount,et al.
A local search approximation algorithm for k-means clustering
,
2002,
SCG '02.
[8]
Marcel R. Ackermann,et al.
Clustering for metric and non-metric distance measures
,
2008,
SODA '08.
[9]
Ragesh Jaiswal,et al.
Improved analysis of D2-sampling based PTAS for k-means and other clustering problems
,
2015,
Inf. Process. Lett..
[10]
J. Matou.
On Approximate Geometric K-clustering
,
1999
.
[11]
Amit Kumar,et al.
A Simple D2-Sampling Based PTAS for k-Means and Other Clustering Problems
,
2012,
Algorithmica.
[12]
Marek Karpinski,et al.
Approximation schemes for clustering problems
,
2003,
STOC '03.
[13]
Andrea Vattani.
The hardness of k-means clustering in the plane
,
2010
.
[14]
Amit Kumar,et al.
A Simple D 2-Sampling Based PTAS for k-Means and other Clustering Problems
,
2012,
COCOON.
[15]
Meena Mahajan,et al.
The Planar k-means Problem is NP-hard I
,
2009
.
[16]
Sariel Har-Peled,et al.
On coresets for k-means and k-median clustering
,
2004,
STOC '04.
[17]
Piotr Indyk,et al.
Approximate clustering via core-sets
,
2002,
STOC '02.
[18]
Amit Kumar,et al.
Linear-time approximation schemes for clustering problems in any dimensions
,
2010,
JACM.
[19]
S. Dasgupta.
The hardness of k-means clustering
,
2008
.
[20]
Sergei Vassilvitskii,et al.
k-means++: the advantages of careful seeding
,
2007,
SODA '07.