Differentially private data cubes: optimizing noise sources and consistency

Data cubes play an essential role in data analysis and decision support. In a data cube, data from a fact table is aggregated on subsets of the table's dimensions, forming a collection of smaller tables called cuboids. When the fact table includes sensitive data such as salary or diagnosis, publishing even a subset of its cuboids may compromise individuals' privacy. In this paper, we address this problem using differential privacy (DP), which provides provable privacy guarantees for individuals by adding noise to query answers. We choose an initial subset of cuboids to compute directly from the fact table, injecting DP noise as usual; and then compute the remaining cuboids from the initial set. Given a fixed privacy guarantee, we show that it is NP-hard to choose the initial set of cuboids so that the maximal noise over all published cuboids is minimized, or so that the number of cuboids with noise below a given threshold (precise cuboids) is maximized. We provide an efficient procedure with running time polynomial in the number of cuboids to select the initial set of cuboids, such that the maximal noise in all published cuboids will be within a factor (ln|L| + 1)^2 of the optimal, where |L| is the number of cuboids to be published, or the number of precise cuboids will be within a factor (1 - 1/e) of the optimal. We also show how to enforce consistency in the published cuboids while simultaneously improving their utility (reducing error). In an empirical evaluation on real and synthetic data, we report the amounts of error of different publishing algorithms, and show that our approaches outperform baselines significantly.

[1]  Sofya Raskhodnikova,et al.  What Can We Learn Privately? , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[2]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[3]  Nina Mishra,et al.  Releasing search queries and clicks privately , 2009, WWW '09.

[4]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[5]  Dan Suciu,et al.  Boosting the Accuracy of Differentially-Private Queries Through Consistency , 2009, ArXiv.

[6]  Ashwin Machanavajjhala,et al.  Publishing Search Logs—A Comparative Study of Privacy Guarantees , 2012, IEEE Transactions on Knowledge and Data Engineering.

[7]  Suman Nath,et al.  Differentially private aggregation of distributed time-series with transformation and encryption , 2010, SIGMOD Conference.

[8]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[9]  Andrew McGregor,et al.  Optimizing linear counting queries under differential privacy , 2009, PODS.

[10]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[11]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[12]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[13]  Tim Roughgarden,et al.  Universally utility-maximizing privacy mechanisms , 2008, STOC '09.

[14]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[15]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[16]  Richard S. Varga,et al.  Proof of Theorem 6 , 1983 .

[17]  Jiawei Han,et al.  High-Dimensional OLAP: A Minimal Cubing Approach , 2004, VLDB.

[18]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[19]  Richard S. Varga,et al.  Proof of Theorem 5 , 1983 .

[20]  Adam D. Smith,et al.  Composition attacks and auxiliary information in data privacy , 2008, KDD.

[21]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[22]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[23]  Cynthia Dwork,et al.  The Differential Privacy Frontier (Extended Abstract) , 2009, TCC.

[24]  Ashwin Machanavajjhala,et al.  l-Diversity: Privacy Beyond k-Anonymity , 2006, ICDE.

[25]  Felix Schlenk,et al.  Proof of Theorem 3 , 2005 .

[26]  Johannes Gehrke,et al.  Differential privacy via wavelet transforms , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[27]  Raymond Chi-Wing Wong,et al.  Minimality Attack in Privacy Preserving Data Publishing , 2007, VLDB.

[28]  J. Rodriguez,et al.  Problem (1) , 1994 .

[29]  R. Varga,et al.  Proof of Theorem 4 , 1983 .

[30]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[31]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[32]  Ramakrishnan Srikant,et al.  Privacy preserving OLAP , 2005, SIGMOD '05.

[33]  Andrew McGregor,et al.  Optimizing Histogram Queries under Differential Privacy , 2009, ArXiv.

[34]  Alessandro Panconesi,et al.  Concentration of Measure for the Analysis of Randomized Algorithms , 2009 .

[35]  Daniel Kifer,et al.  Attacks on privacy and deFinetti's theorem , 2009, SIGMOD Conference.

[36]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[37]  Pierangela Samarati,et al.  Generalizing Data to Provide Anonymity when Disclosing Information , 1998, PODS 1998.

[38]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[39]  Adam D. Smith,et al.  Discovering frequent patterns in sensitive data , 2010, KDD.

[40]  Sushil Jajodia,et al.  Preserving Privacy in On-line Analytical Processing Data Cubes , 2007, Secure Data Management in Decentralized Systems.

[41]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[42]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.