Smooth sensitivity and sampling in private data analysis

We introduce a new, generic framework for private data analysis.The goal of private data analysis is to release aggregate information about a data set while protecting the privacy of the individuals whose information the data set contains.Our framework allows one to release functions f of the data withinstance-based additive noise. That is, the noise magnitude is determined not only by the function we want to release, but also bythe database itself. One of the challenges is to ensure that the noise magnitude does not leak information about the database. To address that, we calibrate the noise magnitude to the smoothsensitivity of f on the database x --- a measure of variabilityof f in the neighborhood of the instance x. The new frameworkgreatly expands the applicability of output perturbation, a technique for protecting individuals' privacy by adding a smallamount of random noise to the released statistics. To our knowledge, this is the first formal analysis of the effect of instance-basednoise in the context of data privacy. Our framework raises many interesting algorithmic questions. Namely,to apply the framework one must compute or approximate the smoothsensitivity of f on x. We show how to do this efficiently for several different functions, including the median and the cost ofthe minimum spanning tree. We also give a generic procedure based on sampling that allows one to release f(x) accurately on manydatabases x. This procedure is applicable even when no efficient algorithm for approximating smooth sensitivity of f is known orwhen f is given as a black box. We illustrate the procedure by applying it to k-SED (k-means) clustering and learning mixtures of Gaussians.

[1]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[2]  V. Vu On the Concentration of Multi-Variate Polynomials with Small Expectation , 2000 .

[3]  Sanjoy Dasgupta,et al.  A Two-Round Variant of EM for Gaussian Mixtures , 2000, UAI.

[4]  V. Vu On the concentration of multivariate polynomials with small expectation , 2000, Random Struct. Algorithms.

[5]  Leonard Pitt,et al.  Sublinear time approximate clustering , 2001, SODA '01.

[6]  Chris Clifton,et al.  Tools for privacy preserving distributed data mining , 2002, SKDD.

[7]  Alexandre V. Evfimievski,et al.  Limiting privacy breaches in privacy preserving data mining , 2003, PODS.

[8]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[9]  C. Dwork,et al.  On the Utility of Privacy-Preserving Histograms , 2004 .

[10]  Cynthia Dwork,et al.  Privacy-Preserving Datamining on Vertically Partitioned Databases , 2004, CRYPTO.

[11]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[12]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[13]  Hoeteck Wee,et al.  Toward Privacy in Public Databases , 2005, TCC.

[14]  Éva Tardos,et al.  Algorithm design , 2005 .

[15]  Latanya Sweeney,et al.  Privacy-enhanced linking , 2005, SKDD.

[16]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[17]  Moni Naor,et al.  Our Data, Ourselves: Privacy Via Distributed Noise Generation , 2006, EUROCRYPT.

[18]  Marina Meila,et al.  The uniqueness of a good optimum for K-means , 2006, ICML.

[19]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[20]  Kamalika Chaudhuri,et al.  When Random Sampling Preserves Privacy , 2006, CRYPTO.

[21]  R. Ostrovsky,et al.  The Effectiveness of Lloyd-Type Methods for the k-Means Problem , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[22]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[23]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[24]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[25]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[26]  A. Blum,et al.  A learning theory approach to non-interactive database privacy , 2008, STOC.

[27]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[28]  Cynthia Dwork,et al.  Differential privacy and robust statistics , 2009, STOC '09.