Differentially private classification with decision tree ensemble

Abstract In decision tree classification with differential privacy, it is query intensive to calculate the impurity metrics, such as information gain and gini index. More queries imply more noise addition. Therefore, a straightforward implementation of differential privacy often yields poor accuracy and stableness. This motivates us to adopt better impurity metric for evaluating attributes to build the tree structure recursively. In this paper, we first give a detailed analysis for the statistical queries involved in decision tree induction. Second, we propose a private decision tree algorithm based on the noisy maximal vote. We also present an effective privacy budget allocation strategy. Third, to boost the accuracy and improve the stableness, we construct the ensemble model, where multiple private decision trees are built on bootstrapped samples. Extensive experiments are executed on real datasets to demonstrate that the proposed ensemble model provides accurate and reliable classification results.

[1]  Wenliang Du,et al.  A Hybrid Multi-group Privacy-Preserving Approach for Building Decision Trees , 2007, PAKDD.

[2]  Nabil R. Adam,et al.  Security-control methods for statistical databases: a comparative study , 1989, ACM Comput. Surv..

[3]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[4]  Yufei Tao,et al.  Output perturbation with query relaxation , 2008, Proc. VLDB Endow..

[5]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[8]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Ling Chen,et al.  WaveCluster with Differential Privacy , 2015, CIKM.

[11]  Ilya Mironov,et al.  Differentially private recommender systems: building privacy into the net , 2009, KDD.

[12]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[13]  Raef Bassily,et al.  Differentially Private Empirical Risk Minimization: Efficient Algorithms and Tight Error Bounds , 2014, 1405.7085.

[14]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[15]  Divyakant Agrawal,et al.  Privacy preserving decision tree learning over multiple parties , 2007, Data Knowl. Eng..

[16]  Xiaoqian Jiang,et al.  Differentially Private Histogram Publication for Dynamic Datasets: an Adaptive Sampling Approach , 2015, CIKM.

[17]  Md Zahidul Islam,et al.  A Differentially Private Random Decision Forest Using Reliable Signal-to-Noise Ratios , 2015, Australasian Conference on Artificial Intelligence.

[18]  Rebecca N. Wright,et al.  A Practical Differentially Private Random Decision Tree Classifier , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[19]  Elaine Shi,et al.  GUPT: privacy preserving data analysis made easy , 2012, SIGMOD Conference.

[20]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[21]  Wenliang Du,et al.  Building decision tree classifier on private data , 2002 .

[22]  Vitaly Shmatikov,et al.  Airavat: Security and Privacy for MapReduce , 2010, NSDI.

[23]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[24]  Songrit Maneewongvatana,et al.  Privacy Preserving Decision Tree in Multi Party Environment , 2005, AIRS.

[25]  Svetha Venkatesh,et al.  Differentially Private Random Forest with High Utility , 2015, 2015 IEEE International Conference on Data Mining.

[26]  Adam D. Smith,et al.  Composition attacks and auxiliary information in data privacy , 2008, KDD.

[27]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[28]  Shouhuai Xu,et al.  Privacy-Preserving Decision Tree Mining Based on Random Substitutions , 2006, ETRICS.

[29]  N. Mookhambika,et al.  PRIVACY PRESERVING DECISION TREE LEARNING USING UNREALIZED DATA SETS , 2013 .

[30]  Wenliang Du,et al.  Using randomized response techniques for privacy-preserving data mining , 2003, KDD '03.

[31]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[32]  Chris Clifton,et al.  Privacy-preserving decision trees over vertically partitioned data , 2005, TKDD.

[33]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[34]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[35]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[36]  Ran Wolff,et al.  k-Anonymous Decision Tree Induction , 2006, PKDD.