Polynomial time approximation schemes for geometric k-clustering

We deal with the problem of clustering data points. Given n points in a larger set (for example, R/sup d/) endowed with a distance function (for example, L/sup 2/ distance), we would like to partition the data set into k disjoint clusters, each with a "cluster center", so as to minimize the sum over all data points of the distance between the point and the center of the cluster containing the point. The problem is provably NP-hard in some high dimensional geometric settings, even for k=2. We give polynomial time approximation schemes for this problem in several settings, including the binary cube (0, 1)/sup d/ with Hamming distance, and R/sup d/ either with L/sup 1/ distance, or with L/sup 2/ distance, or with the square of L/sup 2/ distance. In all these settings, the best previous results were constant factor approximation guarantees. We note that our problem is similar in flavor to the k-median problem (and the related facility location problem), which has been considered in graph-theoretic and fixed dimensional geometric settings, where it becomes hard when k is part of the input. In contrast, we study the problem when k is fixed, but the dimension is part of the input. Our algorithms are based on a dimension reduction construction for the Hamming cube, which may be of independent interest.

[1]  Jon M. Kleinberg,et al.  Segmentation problems , 2004, JACM.

[2]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.

[3]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[4]  David Eppstein,et al.  Fast hierarchical clustering and other applications of dynamic closest pairs , 1999, SODA '98.

[5]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..

[6]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[7]  Alan M. Frieze,et al.  The regularity lemma and approximation schemes for dense problems , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[8]  Claire Mathieu,et al.  A Randomized Approximation Scheme for Metric MAX-CUT , 1998, FOCS.

[9]  M. Sharir,et al.  E cient Algorithms for Geometric Optimization , 1998 .

[10]  Piotr Indyk A sublinear time approximation scheme for clustering in metric spaces , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[11]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[12]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[13]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[14]  J. Kleinberg,et al.  Authoritative Soueces in a Hyper-linked Environment , 1998, SODA 1998.

[15]  Richard M. Karp The Genomics Revolution and its Challenges for Algorithmic Research , 2001, Current Trends in Theoretical Computer Science.

[16]  Noga Alon,et al.  On Two Segmentation Problems , 1999, J. Algorithms.

[17]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[18]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[19]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[20]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[21]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[22]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[23]  Pankaj K. Agarwal,et al.  Exact and Approximation Algortihms for Clustering , 1997 .

[24]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[25]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[26]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[27]  Leonard J. Schulman,et al.  Clustering for edge-cost minimization (extended abstract) , 2000, STOC '00.

[28]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[29]  Micha Sharir,et al.  Efficient algorithms for geometric optimization , 1998, CSUR.

[30]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory, Ser. B.

[31]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[32]  D. Eppstein,et al.  Approximation algorithms for geometric problems , 1996 .

[33]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[34]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[35]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[36]  Claire Mathieu,et al.  A Randomized Approximation Scheme for Metric MAX-CUT , 2001, J. Comput. Syst. Sci..

[37]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[38]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[39]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[40]  Marek Karpinski,et al.  Polynomial time approximation schemes for dense instances of NP-hard problems , 1995, STOC '95.