Data mining with differential privacy

We consider the problem of data mining with formal privacy guarantees, given a data access interface based on the differential privacy framework. Differential privacy requires that computations be insensitive to changes in any particular individual's record, thereby restricting data leaks through the results. The privacy preserving interface ensures unconditionally safe access to the data and does not require from the data miner any expertise in privacy. However, as we show in the paper, a naive utilization of the interface to construct privacy preserving data mining algorithms could lead to inferior data mining results. We address this problem by considering the privacy and the algorithmic requirements simultaneously, focusing on decision tree induction as a sample application. The privacy mechanism has a profound effect on the performance of the methods chosen by the data miner. We demonstrate that this choice could make the difference between an accurate classifier and a completely useless one. Moreover, an improved algorithm can achieve the same level of accuracy and privacy as the naive implementation but with an order of magnitude fewer learning samples.

[1]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[2]  Sofya Raskhodnikova,et al.  What Can We Learn Privately? , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[3]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[4]  Adam D. Smith,et al.  Composition attacks and auxiliary information in data privacy , 2008, KDD.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  John Mingers,et al.  An empirical comparison of selection measures for decision-tree induction , 2004, Machine Learning.

[7]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[8]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[9]  Cynthia Dwork,et al.  New Efficient Attacks on Statistical Disclosure Control Mechanisms , 2008, CRYPTO.

[10]  Aiko M. Hormann,et al.  Programs for Machine Learning. Part I , 1962, Inf. Control..

[11]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[12]  Kamalika Chaudhuri,et al.  Privacy-preserving logistic regression , 2008, NIPS.

[13]  Haim Kaplan,et al.  Private coresets , 2009, STOC '09.

[14]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[16]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[17]  Ian Witten,et al.  Data Mining , 2000 .

[18]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[19]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[22]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[23]  Andrea S. Foulkes,et al.  Classifcation and Regression Trees , 2009 .

[24]  Marc Despontin,et al.  Multiple Criteria Optimization: Theory, Computation, and Application, Ralph E. Steuer (Ed.). Wiley, Palo Alto, CA (1986) , 1987 .

[25]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[26]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[27]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.