A Unified Framework for Clustering Constrained Data Without Locality Property

In this paper, we consider a class of constrained clustering problems of points in $$\mathbb {R}^{d}$$ R d , where d could be rather high. A common feature of these problems is that their optimal clusterings no longer have the locality property (due to the additional constraints), which is a key property required by many algorithms for their unconstrained counterparts. To overcome the difficulty caused by the loss of locality, we present in this paper a unified framework, called Peeling-and-Enclosing , to iteratively solve two variants of the constrained clustering problems, constrained k-means clustering ( k -CMeans) and constrained k-median clustering ( k -CMedian). Our framework generalizes Kumar et al.’s (J ACM 57(2):5, 2010) elegant k -means clustering approach from unconstrained data to constrained data, and is based on two standalone geometric techniques, called Simplex Lemma and Weaker Simplex Lemma , for k -CMeans and k -CMedian, respectively. The simplex lemma (or weaker simplex lemma) enables us to efficiently approximate the mean (or median) point of an unknown set of points by searching a small-size grid, independent of the dimensionality of the space, in a simplex (or the surrounding region of a simplex), and thus can be used to handle high dimensional data. If k and $$\frac{1}{\epsilon }$$ 1 ϵ are fixed numbers, our framework generates, in nearly linear time (i.e., $$O(n(\log n)^{k+1}d)$$ O ( n ( log n ) k + 1 d ) ), $$O((\log n)^{k})$$ O ( ( log n ) k ) k -tuple candidates for the k mean or median points, and one of them induces a $$(1+\epsilon )$$ ( 1 + ϵ ) -approximation for k -CMeans or k -CMedian, where n is the number of points. Combining this unified framework with a problem-specific selection algorithm (which determines the best k -tuple candidate), we obtain a $$(1+\epsilon )$$ ( 1 + ϵ ) -approximation for each of the constrained clustering problems. Our framework improves considerably the best known results for these problems. We expect that our technique will be applicable to other variants of k -means and k -median clustering problems without locality.

[1]  Vikas Singh,et al.  Ensemble clustering using semidefinite programming with applications , 2010, Machine Learning.

[2]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[3]  Venkatesan Guruswami,et al.  Embeddings and non-approximability of geometric problems , 2003, SODA '03.

[4]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[5]  Chaitanya Swamy,et al.  Fault-tolerant facility location , 2003, SODA '03.

[6]  Christian Sohler,et al.  Probabilistic k-Median Clustering in Data Streams , 2012, Theory of Computing Systems.

[7]  Tom Coleman,et al.  A polynomial time approximation scheme for k-consensus clustering , 2010, SODA '10.

[8]  Marcel R. Ackermann,et al.  Clustering for metric and non-metric distance measures , 2008, SODA '08.

[9]  S. Dasgupta The hardness of k-means clustering , 2008 .

[10]  Jian Li,et al.  Clustering with Diversity , 2010, ICALP.

[11]  Jinhui Xu,et al.  Solving the Chromatic Cone Clustering Problem via Minimum Spanning Sphere , 2011, ICALP.

[12]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[13]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[14]  Sariel Har-Peled,et al.  Fast Clustering with Lower Bounds: No Customer too Far, No Shop too Small , 2013, ArXiv.

[15]  Mark Braverman,et al.  Finding Low Error Clusterings , 2009, COLT.

[16]  Maria-Florina Balcan,et al.  Clustering under approximation stability , 2013, JACM.

[17]  Benjamin Raichel,et al.  Fault Tolerant Clustering Revisited , 2013, CCCG.

[18]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[19]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[20]  Graham Cormode,et al.  Approximation algorithms for clustering uncertain data , 2008, PODS.

[21]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[22]  James B. Orlin,et al.  A faster strongly polynomial minimum cost flow algorithm , 1993, STOC '88.

[23]  Jing Gao,et al.  Semi-Supervised Clustering with Partial Background Information , 2006, SDM.

[24]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[25]  Silvio Lattanzi,et al.  A Local Algorithm for Finding Well-Connected Clusters , 2013, ICML.

[26]  Pankaj K. Agarwal,et al.  A near-linear time ε-approximation algorithm for geometric bipartite matching , 2012, STOC '12.

[27]  Amit Kumar,et al.  Faster Algorithms for the Constrained k-means Problem , 2015, Theory of Computing Systems.

[28]  Amit Kumar,et al.  A Simple D2-Sampling Based PTAS for k-Means and Other Clustering Problems , 2012, Algorithmica.

[29]  Marek Karpinski,et al.  Approximation schemes for clustering problems , 2003, STOC '03.

[30]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[31]  Aditya Bhaskara,et al.  Centrality of trees for capacitated k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$k$$\end{document}-center , 2014, Mathematical Programming.

[32]  J. Matou On Approximate Geometric K-clustering , 1999 .

[33]  Jing Gao,et al.  Finding Global Optimum for Truth Discovery: Entropy Based Geometric Variance , 2016, Symposium on Computational Geometry.

[34]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[35]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[36]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[37]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[38]  Jinhui Xu,et al.  Sub-linear Time Hybrid Approximations for Least Trimmed Squares Estimator and Related Problems , 2014, Symposium on Computational Geometry.

[39]  Avrim Blum,et al.  Stability Yields a PTAS for k-Median and k-Means Clustering , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[40]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[41]  Samir Khuller,et al.  Fault tolerant K-center problems , 1997, Theor. Comput. Sci..

[42]  Esther M. Arkin,et al.  Bichromatic 2-Center of Pairs of Points , 2012, LATIN.

[43]  Samir Khuller,et al.  The Capacitated K-Center Problem , 2000, SIAM J. Discret. Math..

[44]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[45]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[46]  Sudipto Guha,et al.  Exceeding expectations and clustering uncertain data , 2009, PODS.

[47]  Jinhui Xu,et al.  Efficient approximation algorithms for clustering point-sets , 2010, Comput. Geom..

[48]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.

[49]  Jinhui Xu,et al.  A Unified Framework for Clustering Constrained Data without Locality Property , 2015, SODA.

[50]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem , 2007, SIAM J. Comput..

[51]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[52]  Deepayan Chakrabarti,et al.  Evolutionary clustering , 2006, KDD '06.

[53]  Pankaj K. Agarwal,et al.  Algorithms for the transportation problem in geometric settings , 2012, SODA.

[54]  Hu Ding,et al.  Faster Balanced Clusterings in High Dimension , 2018, Theor. Comput. Sci..

[55]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[56]  Sariel Har-Peled,et al.  Net and Prune , 2014, J. ACM.

[57]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.

[58]  Samir Khuller,et al.  LP Rounding for k-Centers with Non-uniform Hard Capacities , 2012, 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science.

[59]  Alexandr Andoni,et al.  Parallel algorithms for geometric graph problems , 2013, STOC.

[60]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[61]  R. Ravi,et al.  The p-Neighbor k-Center Problem , 1998, Inf. Process. Lett..