Clustering for metric and non-metric distance measures

We study a generalization of the <i>k</i>-median problem with respect to an arbitrary dissimilarity measure D. Given a finite set <i>P</i>, our goal is to find a set <i>C</i> of size <i>k</i> such that the sum of errors D(<i>P, C</i>) = Σ<i><sub>p∈P</sub></i> min<i><sub>c∈C</sub></i>{D(<i>p, c</i>)} is minimized. The main result in this paper can be stated as follows: There exists an <i>O</i>(<i>n</i>2<i><sup>k/ε)<sup>O(1)</sup></sup></i>) time (1 + ε)-approximation algorithm for the <i>k</i>-median problem with respect to D, if the 1-median problem can be approximated within a factor of (1 + ε) by taking a random sample of constant size and solving the 1-median problem on the sample exactly. Using this characterization, we obtain the first linear time (1 + ε)-approximation algorithms for the <i>k</i>-median problem in an arbitrary metric space with bounded doubling dimension, for the Kullback-Leibler divergence (relative entropy), for Mahalanobis distances, and for some special cases of Bregman divergences. Moreover, we obtain previously known results for the Euclidean <i>k</i>-median problem and the Euclidean <i>k</i>-means problem in a simplified manner. Our results are based on a new analysis of an algorithm from [20].

[1]  Amit Kumar,et al.  A simple linear time ( 1+ ε)- approximation algorithm for geometric k-means clustering in any dimensions , 2004 .

[2]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[3]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[4]  Evangelos Markakis,et al.  Greedy facility location algorithms analyzed using dual fitting with factor-revealing LP , 2002, JACM.

[5]  L. Bregman The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming , 1967 .

[6]  Chandrajit L. Bajaj,et al.  The algebraic degree of geometric optimization problems , 1988, Discret. Comput. Geom..

[7]  Mikkel Thorup Quick k-Median, k-Center, and Facility Location for Sparse Graphs , 2001, ICALP.

[8]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[9]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[10]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[11]  Arindam Banerjee,et al.  Approximation Algorithms for Bregman Clustering Co-clustering and Tensor Clustering , 2008 .

[12]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[13]  Robert M. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[14]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[15]  Inderjit S. Dhillon,et al.  A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification , 2003, J. Mach. Learn. Res..

[16]  Y. Censor,et al.  Parallel Optimization: Theory, Algorithms, and Applications , 1997 .

[17]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[18]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[19]  Thomas M. Cover,et al.  Elements of information theory (2. ed.) , 2006 .

[20]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[21]  Rafail Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, FOCS.

[22]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[23]  P. Mahalanobis On the generalized distance in statistics , 1936 .

[24]  Suvrit Sra,et al.  Approximation Algorithms for Tensor Clustering , 2009, ALT.

[25]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[26]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[27]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[28]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[29]  Robert Krauthgamer,et al.  Bounded geometries, fractals, and low-distortion embeddings , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[30]  Amit Kumar,et al.  Linear Time Algorithms for Clustering Problems in Any Dimensions , 2005, ICALP.

[31]  Amin Saberi,et al.  A new greedy approach for facility location problems , 2002, STOC '02.

[32]  Andrew McGregor,et al.  Finding Metric Structure in Information Theoretic Clustering , 2008, COLT.

[33]  Richard Nock,et al.  Mixed Bregman Clustering with Approximation Guarantees , 2008, ECML/PKDD.

[34]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[35]  D. P. Mercer,et al.  Clustering large datasets , 2003 .

[36]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[37]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean kappa-median Problem , 1999, ESA.

[38]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[39]  Andrzej Stachurski,et al.  Parallel Optimization: Theory, Algorithms and Applications , 2000, Parallel Distributed Comput. Pract..

[40]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[41]  R. Gray,et al.  Speech coding based upon vector quantization , 1980, ICASSP.

[42]  Johannes Blömer,et al.  Coresets and approximate clustering for Bregman divergences , 2009, SODA.

[43]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.