Polynomial-time approximation schemes for geometric min-sum median clustering

The Johnson--Lindenstrauss lemma states that <i>n</i> points in ahigh-dimensional Hilbert space can be embedded with smalldistortion of the distances into an <i>O</i>(log <i>n</i>)dimensional space by applying a random linear transformation. Weshow that similar (though weaker) properties hold for certainrandom linear transformations over the Hamming cube. We use thesetransformations to solve NP-hard clustering problems in the cube aswell as in geometric settings.More specifically, we address thefollowing clustering problem. Given <i>n</i> points in a larger set(e.g., ℝ<sup>d</sup>) endowed with a distance function (e.g.,<i>L</i><sup>2</sup> distance), we would like to partition the dataset into <i>k</i> disjoint clusters, each with a "cluster center,"so as to minimize the sum over all data points of the distancebetween the point and the center of the cluster containing thepoint. The problem is provably NP-hard in some high-dimensionalgeometric settings, even for <i>k</i> = 2. We give polynomial-timeapproximation schemes for this problem in several settings,including the binary cube {0,1}<sup>d</sup> with Hamming distance,and ℝ<sup>d</sup> either with <i>L</i><sup>1</sup> distance,or with <i>L</i><sup>2</sup> distance, or with the square of<i>L</i><sup>2</sup> distance. In all these settings, the bestprevious results were constant factor approximation guarantees.Wenote that our problem is similar in flavor to the <i>k</i>-medianproblem (and the related facility location problem), which has beenconsidered in graph-theoretic and fixed dimensional geometricsettings, where it becomes hard when <i>k</i> is part of the input.In contrast, we study the problem when <i>k</i> is fixed, but thedimension is part of the input.

[1]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[2]  Santosh S. Vempala,et al.  On clusterings-good, bad and spectral , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[3]  Piotr Indyk A sublinear time approximation scheme for clustering in metric spaces , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[4]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[5]  Claire Mathieu,et al.  A Randomized Approximation Scheme for Metric MAX-CUT , 1998, FOCS.

[6]  Rafail Ostrovsky,et al.  Efficient search for approximate nearest neighbor in high dimensional spaces , 1998, STOC '98.

[7]  Micha Sharir,et al.  Efficient algorithms for geometric optimization , 1998, CSUR.

[8]  Sanjoy Dasgupta,et al.  Learning mixtures of Gaussians , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[9]  Leonard J. Schulman,et al.  Clustering for edge-cost minimization (extended abstract) , 2000, STOC '00.

[10]  Peter Frankl,et al.  The Johnson-Lindenstrauss lemma and the sphericity of some graphs , 1987, J. Comb. Theory, Ser. B.

[11]  Jon M. Kleinberg,et al.  Segmentation problems , 2004, JACM.

[12]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[13]  Noga Alon,et al.  On Two Segmentation Problems , 1999, J. Algorithms.

[14]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[15]  D. Eppstein,et al.  Approximation algorithms for geometric problems , 1996 .

[16]  Arie Tamir,et al.  Algebraic optimization: The Fermat-Weber location problem , 1990, Math. Program..

[17]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[18]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.

[19]  Jon M. Kleinberg,et al.  Two algorithms for nearest-neighbor search in high dimensions , 1997, STOC '97.

[20]  Marek Karpinski,et al.  Polynomial Time Approximation Schemes for Dense Instances of NP-Hard Problems , 1999, J. Comput. Syst. Sci..

[21]  Sudipto Guha,et al.  Improved combinatorial algorithms for the facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[22]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[23]  Claire Mathieu,et al.  A Randomized Approximation Scheme for Metric MAX-CUT , 2001, J. Comput. Syst. Sci..

[24]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[25]  David R. Karger,et al.  Constant interaction-time scatter/gather browsing of very large document collections , 1993, SIGIR.

[26]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[27]  Gert Vegter,et al.  In handbook of discrete and computational geometry , 1997 .

[28]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.

[29]  Samir Khuller,et al.  Greedy strikes back: improved facility location algorithms , 1998, SODA '98.

[30]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[31]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[32]  Alan M. Frieze,et al.  Fast Monte-Carlo algorithms for finding low-rank approximations , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[33]  Noga Alon,et al.  The Probabilistic Method , 2015, Fundamentals of Ramsey Theory.

[34]  Richard M. Karp The Genomics Revolution and its Challenges for Algorithmic Research , 2001, Current Trends in Theoretical Computer Science.

[35]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[36]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[37]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[38]  Pankaj K. Agarwal,et al.  Exact and Approximation Algortihms for Clustering , 1997 .

[39]  Edie M. Rasmussen,et al.  Clustering Algorithms , 1992, Information Retrieval: Data Structures & Algorithms.

[40]  Leonard J. Schulman,et al.  Clustering for Edge-Cost Minimization , 1999, Electron. Colloquium Comput. Complex..

[41]  T DumaisSusan,et al.  Using linear algebra for intelligent information retrieval , 1995 .

[42]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[43]  Alan M. Frieze,et al.  The regularity lemma and approximation schemes for dense problems , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[44]  Allan Borodin,et al.  Subquadratic approximation algorithms for clustering problems in high dimensional spaces , 1999, STOC '99.

[45]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.