Publishing set-valued data via differential privacy

Set-valued data provides enormous opportunities for various data mining tasks. In this paper, we study the problem of publishing set-valued data for data mining tasks under the rigorous differential privacy model. All existing data publishing methods for set-valued data are based on partition-based privacy models, for example k-anonymity, which are vulnerable to privacy attacks based on background knowledge. In contrast, differential privacy provides strong privacy guarantees independent of an adversary's background knowledge and computational power. Existing data publishing approaches for differential privacy, however, are not adequate in terms of both utility and scalability in the context of set-valued data due to its high dimensionality. We demonstrate that set-valued data could be efficiently released under differential privacy with guaranteed utility with the help of context-free taxonomy trees. We propose a probabilistic top-down partitioning algorithm to generate a differentially private release, which scales linearly with the input data size. We also discuss the applicability of our idea to the context of relational data. We prove that our result is (∈, δ)-useful for the class of counting queries, the foundation of many data mining tasks. We show that our approach maintains high utility for counting queries and frequent itemset mining and scales to large datasets through extensive experiments on real-life set-valued datasets.

[1]  Tim Roughgarden,et al.  Interactive privacy via the median mechanism , 2009, STOC '10.

[2]  Jian Pei,et al.  Publishing Sensitive Transactions for Itemset Utility , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[3]  Adam D. Smith,et al.  Composition attacks and auxiliary information in data privacy , 2008, KDD.

[4]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[5]  Philip S. Yu,et al.  Privacy-preserving data publishing: A survey of recent developments , 2010, CSUR.

[6]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[7]  Jeffrey F. Naughton,et al.  Anonymization of Set-Valued Data via Top-Down, Local Generalization , 2009, Proc. VLDB Endow..

[8]  Adam D. Smith,et al.  Discovering frequent patterns in sensitive data , 2010, KDD.

[9]  Chun Yuan,et al.  Differentially Private Data Release through Multidimensional Partitioning , 2010, Secure Data Management.

[10]  Panos Kalnis,et al.  On the Anonymization of Sparse High-Dimensional Data , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[13]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[14]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[15]  Philip S. Yu,et al.  Anonymizing transaction databases for publication , 2008, KDD.

[16]  Alexandre V. Evfimievski,et al.  Privacy preserving mining of association rules , 2002, Inf. Syst..

[17]  Cynthia Dwork,et al.  The price of privacy and the limits of LP decoding , 2007, STOC '07.

[18]  Johannes Gehrke,et al.  Differential privacy via wavelet transforms , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[19]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[20]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[21]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.

[22]  Cynthia Dwork,et al.  Privacy, accuracy, and consistency too: a holistic solution to contingency table release , 2007, PODS.

[23]  Irit Dinur,et al.  Revealing information while preserving privacy , 2003, PODS.

[24]  Raymond Chi-Wing Wong,et al.  Anonymization-based attacks in privacy-preserving data publishing , 2009, TODS.

[25]  David J. DeWitt,et al.  Mondrian Multidimensional K-Anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[26]  Moni Naor,et al.  On the complexity of differentially private data release: efficient algorithms and hardness results , 2009, STOC '09.

[27]  Philip S. Yu,et al.  A Condensation Approach to Privacy Preserving Data Mining , 2004, EDBT.

[28]  Panos Kalnis,et al.  Privacy-preserving anonymization of set-valued data , 2008, Proc. VLDB Endow..

[29]  Panos Kalnis,et al.  Local and global recoding methods for anonymizing set-valued data , 2010, The VLDB Journal.

[30]  Daniel Kifer,et al.  Attacks on privacy and deFinetti's theorem , 2009, SIGMOD Conference.

[31]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[32]  Philip S. Yu,et al.  Handicapping attacker's confidence: an alternative to k-anonymization , 2006, Knowledge and Information Systems.

[33]  Philip S. Yu,et al.  Anonymizing Classification Data for Privacy Preservation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[34]  Nina Mishra,et al.  Releasing search queries and clicks privately , 2009, WWW '09.