Differentially Private Clustering in High-Dimensional Euclidean Spaces

We study the problem of clustering sensitive data while preserving the privacy of individuals represented in the dataset, which has broad applications in practical machine learning and data analysis tasks. Although the problem has been widely studied in the context of lowdimensional, discrete spaces, much remains unknown concerning private clustering in highdimensional Euclidean spaces R. In this work, we give differentially private and efficient algorithms achieving strong guarantees for k-means and k-median clustering when d = Ω(polylog(n)). Our algorithm achieves clustering loss at most log(n)OPT+poly(log n, d, k), advancing the state-of-the-art result of √ dOPT+ poly(log n, d, k). We also study the case where the data points are s-sparse and show that the clustering loss can scale logarithmically with d, i.e., log(n)OPT + poly(log n, log d, k, s). Experiments on both synthetic and real datasets verify the effectiveness of the proposed method.

[1]  Elisa Bertino,et al.  Differentially Private K-Means Clustering , 2015, CODASPY.

[2]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[3]  Junbin Gao,et al.  Relations Among Some Low-Rank Subspace Recovery Models , 2014, Neural Computation.

[4]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Max Welling,et al.  Practical Privacy For Expectation Maximization , 2016, ArXiv.

[6]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7]  Haim Kaplan,et al.  Reporting Neighbors in High-Dimensional Euclidean Space , 2013, SIAM J. Comput..

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Artem Barger,et al.  k-Means for Streaming and Distributed Big Sparse Data , 2015, SDM.

[10]  Aarti Singh,et al.  Differentially private subspace clustering , 2015, NIPS.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[13]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[14]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[15]  Roksana Boreli,et al.  K-variates++: More Pluses in the K-means++ , 2016, ICML.

[16]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[17]  Nuggehally Sampath Jayant,et al.  An adaptive clustering algorithm for image segmentation , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[18]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[19]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[20]  Junbin Gao,et al.  Robust latent low rank representation for subspace clustering , 2014, Neurocomputing.

[21]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[22]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[23]  Jirí Matousek,et al.  On Approximate Geometric k -Clustering , 2000, Discret. Comput. Geom..

[24]  Raef Bassily,et al.  Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds , 2014, 1405.7085.

[25]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[26]  Aaron Roth,et al.  Differentially private combinatorial optimization , 2009, SODA '10.

[27]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[28]  Maxim Sviridenko,et al.  A Bi-Criteria Approximation Algorithm for k-Means , 2015, APPROX-RANDOM.

[29]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[30]  S. Dasgupta The hardness of k-means clustering , 2008 .

[31]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[32]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[33]  Maria-Florina Balcan,et al.  Center Based Clustering: A Foundational Perspective , 2014 .

[34]  Maria-Florina Balcan,et al.  Approximate clustering without the approximation , 2009, SODA.

[35]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..