Ieee Transactions on Knowledge and Data Engineering 1 Differentially Private Random Decision Forests Using Smooth Sensitivity

We propose a new differentially-private decision forest algorithm that minimizes both the number of queries required, and the sensitivity of those queries. To do so, we build an ensemble of random decision trees that avoids querying the private data except to find the majority class label in the leaf nodes. Rather than using a count query to return the class counts like the current state-of-the-art, we use the Exponential Mechanism to only output the class label itself. This drastically reduces the sensitivity of the query -- often by several orders of magnitude -- which in turn reduces the amount of noise that must be added to preserve privacy. Our improved sensitivity is achieved by using "smooth sensitivity", which takes into account the specific data used in the query rather than assuming the worst-case scenario. We also extend work done on the optimal depth of random decision trees to handle continuous features, not just discrete features. This, along with several other improvements, allows us to create a differentially private decision forest with substantially higher predictive power than the current state-of-the-art.

[1]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[2]  Eric Puybaret,et al.  Universal Declaration of Human Rights , 2006 .

[3]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[4]  Md Zahidul Islam,et al.  Knowledge Discovery through SysFor - a Systematically Developed Forest of Multiple Decision Trees , 2011, AusDM.

[5]  Md Zahidul Islam,et al.  A Differentially Private Decision Forest , 2015, AusDM.

[6]  Aaron Roth,et al.  A learning theory approach to non-interactive database privacy , 2008, STOC.

[7]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[8]  Moni Naor,et al.  Theory and Applications of Models of Computation , 2015, Lecture Notes in Computer Science.

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  Anand D. Sarwate,et al.  Signal Processing and Machine Learning with Differential Privacy , 2013 .

[11]  Ling Chen,et al.  WaveCluster with Differential Privacy , 2015, CIKM.

[12]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[13]  Qiang Yang,et al.  Differential Privacy in Telco Big Data Platform , 2015, Proc. VLDB Endow..

[14]  Sau Dan Lee,et al.  Decision Trees for Uncertain Data , 2011, IEEE Transactions on Knowledge and Data Engineering.

[15]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[16]  Tianqing Zhu,et al.  Differential Privacy and Its Application , 2014 .

[17]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[18]  Philip S. Yu,et al.  Differentially private data release for data mining , 2011, KDD.

[19]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[20]  Latanya Sweeney,et al.  Achieving k-Anonymity Privacy Protection Using Generalization and Suppression , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[21]  J. Morsink,et al.  The Universal Declaration of Human Rights: Origins, Drafting, and Intent , 1999 .

[22]  Johannes Gehrke,et al.  Differential privacy via wavelet transforms , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[23]  Unil Yun,et al.  A fast perturbation algorithm using tree structure for privacy preserving utility mining , 2015, Expert Syst. Appl..

[24]  Philip S. Yu,et al.  Top-down specialization for information and privacy preservation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[25]  Frank McSherry,et al.  Privacy integrated queries: an extensible platform for privacy-preserving data analysis , 2009, SIGMOD Conference.

[26]  Aaron Roth,et al.  A learning theory approach to noninteractive database privacy , 2011, JACM.

[27]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[28]  Svetha Venkatesh,et al.  Differentially Private Random Forest with High Utility , 2015, 2015 IEEE International Conference on Data Mining.

[29]  Philip S. Yu,et al.  Anonymizing Classification Data for Privacy Preservation , 2007, IEEE Transactions on Knowledge and Data Engineering.

[30]  Yue Wang,et al.  A Data- and Workload-Aware Query Answering Algorithm for Range Queries Under Differential Privacy , 2014, Proc. VLDB Endow..

[31]  Dima Alhadidi,et al.  Secure and Private Management of Healthcare Databases for Data Mining , 2015, 2015 IEEE 28th International Symposium on Computer-Based Medical Systems.

[32]  Md Zahidul Islam,et al.  Quality Evaluation of an Anonymized Dataset , 2014, 2014 22nd International Conference on Pattern Recognition.

[33]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[34]  Md Zahidul Islam,et al.  Privacy preserving data mining: A noise addition framework using a novel clustering technique , 2011, Knowl. Based Syst..

[35]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[36]  Prateek Jain,et al.  Differentially Private Learning with Kernels , 2013, ICML.

[37]  Rebecca N. Wright,et al.  A Practical Differentially Private Random Decision Tree Classifier , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[38]  Claire Monteleoni,et al.  A Semi-Supervised Learning Approach to Differential Privacy , 2013, 2013 IEEE 13th International Conference on Data Mining Workshops.

[39]  Philip S. Yu,et al.  Effective estimation of posterior probabilities: explaining the accuracy of randomized decision tree approaches , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[40]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001, Statistical Science.

[41]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[42]  Adam D. Smith,et al.  Discovering frequent patterns in sensitive data , 2010, KDD.

[43]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[44]  Philippe Flajolet,et al.  Analytic Combinatorics , 2009 .

[45]  Md Zahidul Islam,et al.  A Differentially Private Random Decision Forest Using Reliable Signal-to-Noise Ratios , 2015, Australasian Conference on Artificial Intelligence.

[46]  Leo Breiman,et al.  Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author) , 2001 .

[47]  Sam Fletcher,et al.  An anonymization technique using intersected decision trees , 2015, J. King Saud Univ. Comput. Inf. Sci..

[48]  Rakesh Agrawal,et al.  Privacy-preserving data mining , 2000, SIGMOD 2000.

[49]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[50]  Charles Elkan,et al.  Differential Privacy and Machine Learning: a Survey and Review , 2014, ArXiv.

[51]  Ashwin Machanavajjhala,et al.  Privacy: Theory meets Practice on the Map , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[52]  Anand D. Sarwate,et al.  Signal Processing and Machine Learning with Differential Privacy: Algorithms and Challenges for Continuous Data , 2013, IEEE Signal Processing Magazine.

[53]  Ljiljana Brankovic,et al.  PRIVACY ISSUES IN KNOWLEDGE DISCOVERY AND DATA MINING , 2000 .

[54]  Philip S. Yu,et al.  Is random model better? On its accuracy and efficiency , 2003, Third IEEE International Conference on Data Mining.