Sampling and partitioning for differential privacy

Differential privacy enjoys increasing popularity thanks to both a precise semantics for privacy and effective enforcement mechanisms. Many tools have been proposed to spread its use and ease the task of the concerned data scientist. The most promising among them completely discharge the user of the privacy concerns by transparently taking care of the privacy budget. However, their implementation proves to be delicate, and introduce flaws by falsifying some of the theoretical assumptions made to guarantee differential privacy. Moreover, such tools rely on assumptions leading to over-approximations which artificially reduce utility. In this paper we focus on a key mechanism that tools do not support well: sampling. We demonstrate an attack on PINQ (McSherry, SIGMOD 2009), one of these tools, relying on the difference between its internal mechanics and the formal theory for the sampling operation, and study a range of sampling methods and show how they can be correctly implemented in a system for differential privacy.

[1]  Andreas Haeberlen,et al.  Linear dependent types for differential privacy , 2013, POPL.

[2]  Sharon Goldberg,et al.  Calibrating Data to Sensitivity in Private Data Analysis , 2012, Proc. VLDB Endow..

[3]  Yves Tillé,et al.  Sampling Algorithms , 2011, International Encyclopedia of Statistical Science.

[4]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[5]  A. I. McLeod,et al.  A Convenient Algorithm for Drawing a Simple Random Sample , 1983 .

[6]  Igor Kononenko,et al.  Machine Learning and Data Mining: Introduction to Principles and Algorithms , 2007 .

[7]  Ninghui Li,et al.  On sampling, anonymization, and differential privacy or, k-anonymization meets differential privacy , 2011, ASIACCS '12.

[8]  Benjamin C. Pierce,et al.  Distance makes the types grow stronger: a calculus for differential privacy , 2010, ICFP '10.

[9]  Ashwin Machanavajjhala,et al.  No free lunch in data privacy , 2011, SIGMOD '11.

[10]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[11]  B. Rosén Asymptotic theory for order sampling , 1997 .

[12]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[13]  Ting Yu,et al.  Conservative or liberal? Personalized differential privacy , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[14]  Chun Yuan,et al.  Differentially Private Data Release through Multidimensional Partitioning , 2010, Secure Data Management.

[15]  Nagiza F. Samatova,et al.  Sampling streaming data with replacement , 2007, Comput. Stat. Data Anal..

[16]  Mervin E. Muller,et al.  Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers , 1962 .

[17]  Nagiza F. Samatova,et al.  Reservoir-Based Random Sampling with Replacement from Data Stream , 2004, SDM.

[18]  Elaine Shi,et al.  GUPT: privacy preserving data analysis made easy , 2012, SIGMOD Conference.

[19]  Sofya Raskhodnikova,et al.  What Can We Learn Privately? , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[20]  Stavros Papadopoulos,et al.  Practical Differential Privacy via Grouping and Smoothing , 2013, Proc. VLDB Endow..

[21]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[22]  J. Wishart Statistical tables , 2018, Global Education Monitoring Report.

[23]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[24]  B. Rosén On sampling with probability proportional to size , 1997 .