Coresets for Differentially Private K-Means Clustering and Applications to Privacy in Mobile Sensor Networks

Mobile sensor networks are a great source of data. By collecting data with mobile sensor nodes from individuals in a user community, e.g. using their smartphones, we can learn global information such as traffic congestion patterns in the city, location of key community facilities, and locations of gathering places. Can we publish and run queries on mobile sensor network databases without disclosing information about individual nodes?Differential privacy is a strong notion of privacy which guarantees that very little will be learned about individual records in the database, no matter what the attackers already know or wish to learn. Still, there is no practical system applying differential privacy algorithms for clustering points on real databases. This paper describes the construction of small coresets for computing k-means clustering of a set of points while preserving differential privacy. As a result, we give the first k-means clustering algorithm that is both differentially private, and has an approximation error that depends sub-linearly on the data's dimension d. Previous results introduced errors that are exponential in d.We implemented this algorithm and used it to create differentially private location data from GPS tracks. Specifically our algorithm allows clustering GPS databases generated from mobile nodes, while letting the user control the introduced noise due to privacy. We provide experimental results for the system and algorithms, and compare them to existing techniques. To the best of our knowledge, this is the first practical system that enables differentially private clustering on real data.

[1]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.

[2]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[3]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[4]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[5]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[6]  Ke Chen,et al.  On k-Median clustering in high dimensions , 2006, SODA '06.

[7]  Cynthia Dwork,et al.  Privacy-Preserving Datamining on Vertically Partitioned Databases , 2004, CRYPTO.

[8]  Dan Feldman,et al.  Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[9]  Aaron Roth,et al.  A learning theory approach to non-interactive database privacy , 2008, STOC.

[10]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[11]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[12]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[13]  Amos Fiat,et al.  Bi-criteria linear-time approximations for generalized k-mean/median/center , 2007, SCG '07.

[14]  Amos Beimel,et al.  Private Learning and Sanitization: Pure vs. Approximate Differential Privacy , 2013, APPROX-RANDOM.

[15]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[16]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[17]  Daniel A. Spielman,et al.  Spectral Graph Theory and its Applications , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[18]  Artem Barger,et al.  k-Means for Streaming and Distributed Big Sparse Data , 2015, SDM.

[19]  Dan Feldman,et al.  iDiary: from GPS signals to a text-searchable diary , 2013, SenSys '13.

[20]  Dan Feldman,et al.  The single pixel GPS: learning big data signals from tiny coresets , 2012, SIGSPATIAL/GIS.

[21]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[22]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[23]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[24]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[25]  Nina Mishra,et al.  Privacy via pseudorandom sketches , 2006, PODS.

[26]  John W. Fisher,et al.  Coresets for k-Segmentation of Streaming Data , 2014, NIPS.

[27]  Kobbi Nissim,et al.  Locating a Small Cluster Privately , 2016, PODS.

[28]  Dan Feldman,et al.  Communication coverage for independently moving robots , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[29]  David Haussler,et al.  Epsilon-nets and simplex range queries , 1986, SCG '86.

[30]  Dan Feldman,et al.  An effective coreset compression algorithm for large scale sensor networks , 2012, 2012 ACM/IEEE 11th International Conference on Information Processing in Sensor Networks (IPSN).

[31]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[32]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.