A new analysis of differential privacy’s generalization guarantees (invited paper)

We give a new proof of the "transfer theorem" underlying adaptive data analysis: that any mechanism for answering adaptively chosen statistical queries that is differentially private and sample-accurate is also accurate out-of-sample. Our new proof is elementary and gives structural insights that we expect will be useful elsewhere. We show: 1) that differential privacy ensures that the expectation of any query on the posterior distribution on datasets induced by the transcript of the interaction is close to its true value on the data distribution, and 2) sample accuracy on its own ensures that any query answer produced by the mechanism is close to its posterior expectation with high probability. This second claim follows from a thought experiment in which we imagine that the dataset is resampled from the posterior distribution after the mechanism has committed to its answers. The transfer theorem then follows by summing these two bounds, and in particular, avoids the "monitor argument" used to derive high probability bounds in prior work. An upshot of our new proof technique is that the concrete bounds we obtain are substantially better than the best previously known bounds, even though the improvements are in the constants, rather than the asymptotics (which are known to be tight). As we show, our new bounds outperform the naive "sample-splitting" baseline at dramatically smaller dataset sizes compared to the previous state of the art, bringing techniques from this literature closer to practicality.

[1]  Jonathan Ullman,et al.  Preventing False Discovery in Interactive Data Analysis Is Hard , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[2]  Sam Elder,et al.  Challenges in Bayesian Adaptive Data Analysis , 2016, ArXiv.

[3]  Aaron Roth,et al.  Max-Information, Differential Privacy, and Post-selection Hypothesis Testing , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[4]  Maxim Raginsky,et al.  Information-theoretic analysis of generalization capability of learning algorithms , 2017, NIPS.

[5]  Thomas Steinke,et al.  Interactive fingerprinting codes and the hardness of preventing false discovery , 2014, 2016 Information Theory and Applications Workshop (ITA).

[6]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7]  Aaron Roth,et al.  Adaptive Learning with Robust Generalization Guarantees , 2016, COLT.

[8]  Xinkun Nie,et al.  Why adaptively collected data have negative bias and how to correct for it , 2017, AISTATS.

[9]  Seth Neel,et al.  Mitigating Bias in Adaptive Data Gathering via Differential Privacy , 2018, ICML.

[10]  James Zou,et al.  Controlling Bias in Adaptive Data Analysis Using Information Theory , 2015, AISTATS.

[11]  Aaron Roth,et al.  Guaranteed Validity for Empirical Approaches to Adaptive Data Analysis , 2020, AISTATS.

[12]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[13]  Kobbi Nissim,et al.  Concentration Bounds for High Sensitivity Functions Through Differential Privacy , 2019, J. Priv. Confidentiality.

[14]  Thomas Steinke,et al.  Generalization for Adaptively-chosen Estimators via Stable Median , 2017, COLT.

[15]  Katrina Ligett,et al.  A necessary and sufficient stability notion for adaptive generalization , 2019 .

[16]  Tijana Zrnic,et al.  Natural Analysts in Adaptive Data Analysis , 2019, ICML.

[17]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[18]  Jan Vondrák,et al.  Generalization Bounds for Uniformly Stable Algorithms , 2018, NeurIPS.

[19]  Sam Elder,et al.  Bayesian Adaptive Data Analysis Guarantees from Subgaussianity , 2016, ArXiv.

[20]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[21]  Thomas Steinke,et al.  Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds , 2016, TCC.

[22]  Vitaly Feldman,et al.  The advantages of multiple classes for reducing overfitting from test set reuse , 2019, ICML.

[23]  Thomas Steinke,et al.  Subgaussian Tail Bounds via Stability Arguments , 2017, ArXiv.

[24]  A. Gelman,et al.  The statistical crisis in science , 2014 .

[25]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[26]  Thomas Steinke,et al.  Calibrating Noise to Variance in Adaptive Data Analysis , 2017, COLT.