Differentially Private Algorithms for Empirical Machine Learning

An important use of private data is to build machine learning classifiers. While there is a burgeoning literature on differentially private classification algorithms, we find that they are not practical in real applications due to two reasons. First, existing differentially private classifiers provide poor accuracy on real world datasets. Second, there is no known differentially private algorithm for empirically evaluating the private classifier on a private test dataset. In this paper, we develop differentially private algorithms that mirror real world empirical machine learning workflows. We consider the private classifier training algorithm as a blackbox. We present private algorithms for selecting features that are input to the classifier. Though adding a preprocessing step takes away some of the privacy budget from the actual classification process (thus potentially making it noisier and less accurate), we show that our novel preprocessing techniques significantly increase classifier accuracy on three real-world datasets. We also present the first private algorithms for empirically constructing receiver operating characteristic (ROC) curves on a private test set.

[1]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[2]  Masoud Nikravesh,et al.  Feature Extraction: Foundations and Applications (Studies in Fuzziness and Soft Computing) , 2006 .

[3]  Assaf Schuster,et al.  Data mining with differential privacy , 2010, KDD.

[4]  Stavros Papadopoulos,et al.  Practical Differential Privacy via Grouping and Smoothing , 2013, Proc. VLDB Endow..

[5]  Anand D. Sarwate,et al.  Differentially Private Support Vector Machines , 2009, ArXiv.

[6]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[7]  Guy N. Rothblum,et al.  A Multiplicative Weights Mechanism for Privacy-Preserving Data Analysis , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[8]  Basit Shafiq,et al.  Differentially Private Naive Bayes Classification , 2013, 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[9]  Adam D. Smith,et al.  Differentially Private Feature Selection via Stability Arguments, and the Robustness of the Lasso , 2013, COLT.

[10]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[11]  Ofer Harel,et al.  An examination of data confidentiality and disclosure issues related to publication of empirical ROC curves. , 2013, Academic radiology.

[12]  Tiago A. Almeida,et al.  Towards SMS Spam Filtering: Results under a New Dataset , 2013 .

[13]  Jun Zhang,et al.  PrivBayes: private data release via bayesian networks , 2014, SIGMOD Conference.

[14]  Cynthia Dwork,et al.  Practical privacy: the SuLQ framework , 2005, PODS.

[15]  Johannes Gehrke,et al.  Differential privacy via wavelet transforms , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[16]  Yin Yang,et al.  PrivGene: differentially private model fitting using genetic algorithms , 2013, SIGMOD '13.

[17]  Somesh Jha,et al.  Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.

[18]  Sofya Raskhodnikova,et al.  Smooth sensitivity and sampling in private data analysis , 2007, STOC '07.

[19]  Yue Wang,et al.  A Data- and Workload-Aware Algorithm for Range Queries Under Differential Privacy , 2014, ArXiv.

[20]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[21]  Chris Clifton,et al.  Top-k frequent itemsets via differentially private FP-trees , 2014, KDD.

[22]  Qian Xiao,et al.  Differentially private network data release via structural inference , 2014, KDD.

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[25]  Rebecca N. Wright,et al.  A Practical Differentially Private Random Decision Tree Classifier , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[26]  Yin Yang,et al.  Differentially private histogram publication , 2012, The VLDB Journal.

[27]  B. Barak,et al.  A study of privacy and fairness in sensitive data analysis , 2011 .

[28]  Christopher M. Bishop,et al.  Pattern Recognition and Machine Learning (Information Science and Statistics) , 2006 .