Differentially Private Histogram Publishing through Lossy Compression

Differential privacy has emerged as one of the most promising privacy models for private data release. It can be used to release different types of data, and, in particular, histograms, which provide useful summaries of a dataset. Several differentially private histogram releasing schemes have been proposed recently. However, most of them directly add noise to the histogram counts, resulting in undesirable accuracy. In this paper, we propose two sanitization techniques that exploit the inherent redundancy of real-life datasets in order to boost the accuracy of histograms. They lossily compress the data and sanitize the compressed data. Our first scheme is an optimization of the Fourier Perturbation Algorithm (FPA) presented in [13]. It improves the accuracy of the initial FPA by a factor of 10. The other scheme relies on clustering and exploits the redundancy between bins. Our extensive experimental evaluation over various real-life and synthetic datasets demonstrates that our techniques preserve very accurate distributions and considerably improve the accuracy of range queries over attributed histograms.

[1]  Daniel A. Spielman,et al.  Spectral Graph Theory and its Applications , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[2]  Yin Yang,et al.  Differentially private histogram publication , 2012, The VLDB Journal.

[3]  Yin Yang,et al.  Compressive mechanism: utilizing sparse representation in differential privacy , 2011, WPES.

[4]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[5]  Johannes Gehrke,et al.  Differential privacy via wavelet transforms , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[6]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[7]  Stephen E. Fienberg,et al.  Differential Privacy and the Risk-Utility Tradeoff for Multi-dimensional Contingency Tables , 2010, Privacy in Statistical Databases.

[8]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[9]  Chun Yuan,et al.  Differentially Private Data Release through Multidimensional Partitioning , 2010, Secure Data Management.

[10]  Katrina Ligett,et al.  A Simple and Practical Algorithm for Differentially Private Data Release , 2010, NIPS.

[11]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[12]  I. Johnstone,et al.  Ideal denoising in an orthonormal basis chosen from a library of bases , 1994 .

[13]  Suman Nath,et al.  Differentially private aggregation of distributed time-series with transformation and encryption , 2010, SIGMOD Conference.

[14]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[15]  Andrew McGregor,et al.  Optimizing linear counting queries under differential privacy , 2009, PODS.

[16]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[17]  David D. Jensen,et al.  Accurate Estimation of the Degree Distribution of Private Networks , 2009, 2009 Ninth IEEE International Conference on Data Mining.