Clustering with Diversity

We consider the clustering with diversity problem: given a set of colored points in a metric space, partition them into clusters such that each cluster has at least l points, all of which have distinct colors. We give a 2-approximation to this problem for any l when the objective is to minimize the maximum radius of any cluster. We show that the approximation ratio is optimal unless P = NP, by providing a matching lower bound. Several extensions to our algorithm have also been developed for handling outliers. This problem is mainly motivated by applications in privacy-preserving data publication.

[1]  Panos Kalnis,et al.  Fast Data Anonymization with Low Information Loss , 2007, VLDB.

[2]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[3]  Adam Meyerson,et al.  On the complexity of optimal K-anonymity , 2004, PODS.

[4]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[5]  Raymond Chi-Wing Wong,et al.  (α, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing , 2006, KDD '06.

[6]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[7]  John E. Mitchell,et al.  GRAPH PARTITION PROBLEMS WITH MINIMUM SIZE CONSTRAINTS , 2004 .

[8]  S. S. Ravi,et al.  Intractability and clustering with constraints , 2007, ICML '07.

[9]  Samir Khuller,et al.  Algorithms for facility location problems with outliers , 2001, SODA '01.

[10]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[11]  Kyuseok Shim,et al.  Approximate algorithms for K-anonymity , 2007, SIGMOD '07.

[12]  Muhammad H. Alsuwaiyel,et al.  Algorithms - Design Techniques and Analysis , 1999, Lecture Notes Series on Computing.

[13]  Yufei Tao,et al.  Anatomy: simple and effective privacy preservation , 2006, VLDB.

[14]  Frank Stajano,et al.  Location Privacy in Pervasive Computing , 2003, IEEE Pervasive Comput..

[15]  Moni Naor,et al.  On the complexity of differentially private data release: efficient algorithms and hardness results , 2009, STOC '09.

[16]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[17]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[18]  Pierangela Samarati,et al.  Protecting Respondents' Identities in Microdata Release , 2001, IEEE Trans. Knowl. Data Eng..

[19]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[20]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[21]  Frank Klawonn,et al.  Clustering with Size Constraints , 2008, Computational Intelligence Paradigms.

[22]  睦憲 柳浦,et al.  Combinatorial Optimization : Theory and Algorithms (3rd Edition), B. Korte and J. Vygen 著, 出版社 Springer, 発行 2006年, 全ページ 597頁, 価格 53.45ユーロ, ISBN 3-540-25684-9 , 2006 .

[23]  Avrim Blum,et al.  Correlation Clustering , 2004, Machine Learning.

[24]  V. Guruswami,et al.  Correlation clustering with a fixed number of clusters , 2006, SODA 2006.

[25]  Daniel Kifer,et al.  Injecting utility into anonymized datasets , 2006, SIGMOD Conference.

[26]  Yufei Tao,et al.  The hardness and approximation algorithms for l-diversity , 2009, EDBT '10.

[27]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[28]  Rajeev Motwani,et al.  Anonymizing Tables , 2005, ICDT.

[29]  Yufei Tao,et al.  M-invariance: towards privacy preserving re-publication of dynamic datasets , 2007, SIGMOD '07.

[30]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[31]  Sergey Yekhanin,et al.  Towards 3-query locally decodable codes of subexponential length , 2008, JACM.

[32]  Nikhil Bansal,et al.  Correlation Clustering , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[33]  Jens Vygen,et al.  The Book Review Column1 , 2020, SIGACT News.

[34]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[35]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[36]  Samir Khuller,et al.  Achieving anonymity via clustering , 2006, PODS '06.