Pythia: Data Dependent Differentially Private Algorithm Selection

Differential privacy has emerged as a preferred standard for ensuring privacy in analysis tasks on sensitive datasets. Recent algorithms have allowed for significantly lower error by adapting to properties of the input data. These so-called data-dependent algorithms have different error rates for different inputs. There is now a complex and growing landscape of algorithms without a clear winner that can offer low error over all datasets. As a result, the best possible error rates are not attainable in practice, because the data curator cannot know which algorithm to select prior to actually running the algorithm. We address this challenge by proposing a novel meta-algorithm designed to relieve the data curator of the burden of algorithm selection. It works by learning (from non-sensitive data) the association between dataset properties and the best-performing algorithm. The meta-algorithm is deployed by first testing the input for low-sensitivity properties and then using the results to select a good algorithm. The result is an end-to-end differentially private system: Pythia, which we show offers improvements over using any single algorithm alone. We empirically demonstrate the benefit of Pythia for the tasks of releasing histograms, answering 1- and 2-dimensional range queries, as well as for constructing private Naive Bayes classifiers.

[1]  Yin Yang,et al.  Differentially private histogram publication , 2012, The VLDB Journal.

[2]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[3]  Katrina Ligett,et al.  A Simple and Practical Algorithm for Differentially Private Data Release , 2010, NIPS.

[4]  Johannes Gehrke,et al.  Differential privacy via wavelet transforms , 2009, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[5]  I-Cheng Yeh,et al.  The comparisons of data mining techniques for the predictive accuracy of probability of default of credit card clients , 2009, Expert Syst. Appl..

[6]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[7]  Ashwin Machanavajjhala,et al.  Principled Evaluation of Differentially Private Algorithms using DPBench , 2015, SIGMOD Conference.

[8]  Harry Zhang,et al.  The Optimality of Naive Bayes , 2004, FLAIRS.

[9]  Yue Wang,et al.  A Data- and Workload-Aware Algorithm for Range Queries Under Differential Privacy , 2014, ArXiv.

[10]  Li Xiong,et al.  DPCube: Releasing Differentially Private Data Cubes for Health Information , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[11]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[12]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[13]  Claude Castelluccia,et al.  Differentially Private Histogram Publishing through Lossy Compression , 2012, 2012 IEEE 12th International Conference on Data Mining.

[14]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[15]  Ninghui Li,et al.  Understanding Hierarchical Methods for Differentially Private Histograms , 2013, Proc. VLDB Endow..

[16]  Divesh Srivastava,et al.  DPT: Differentially Private Trajectory Synthesis Using Hierarchical Reference Systems , 2015, Proc. VLDB Endow..

[17]  Matthias Grossglauser,et al.  CRAWDAD dataset epfl/mobility (v.2009-02-24) , 2009 .

[18]  Jianliang Xu,et al.  Towards Accurate Histogram Publication under Differential Privacy , 2014, SDM.

[19]  Kamalika Chaudhuri,et al.  A Stability-based Validation Procedure for Differentially Private Machine Learning , 2013, NIPS.

[20]  Kunal Talwar,et al.  Mechanism Design via Differential Privacy , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[21]  Dan Suciu,et al.  Boosting the accuracy of differentially private histograms through consistency , 2009, Proc. VLDB Endow..

[22]  Daniel A. Spielman,et al.  Spectral Graph Theory and its Applications , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[23]  Graham Cormode,et al.  Personal privacy vs population privacy: learning to attack anonymization , 2011, KDD.