The Everlasting Database: Statistical Validity at a Fair Price

The problem of handling adaptivity in data analysis, intentional or not, permeates a variety of fields, including test-set overfitting in ML challenges and the accumulation of invalid scientific discoveries. We propose a mechanism for answering an arbitrarily long sequence of potentially adaptive statistical queries, by charging a price for each query and using the proceeds to collect additional samples. Crucially, we guarantee statistical validity without any assumptions on how the queries are generated. We also ensure with high probability that the cost for $M$ non-adaptive queries is $O(\log M)$, while the cost to a potentially adaptive user who makes $M$ queries that do not depend on any others is $O(\sqrt{M})$.

[1]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010, Nature.

[2]  Jonathan Ullman,et al.  Preventing False Discovery in Interactive Data Analysis Is Hard , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[3]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[4]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[5]  Moritz Hardt Climbing a shaky ladder: Better adaptive risk estimation , 2017, ArXiv.

[6]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7]  A. Gelman,et al.  The statistical crisis in science , 2014 .

[8]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[9]  Thomas Steinke,et al.  Generalization for Adaptively-chosen Estimators via Stable Median , 2017, COLT.

[10]  Kobbi Nissim,et al.  On the Generalization Properties of Differential Privacy , 2015, ArXiv.

[11]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[12]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[13]  Raef Bassily,et al.  Algorithmic stability for adaptive data analysis , 2015, STOC.

[14]  S. Rosset,et al.  Generalized α‐investing: definitions, optimality results and application to public databases , 2014 .

[15]  S. Rosset,et al.  Generalized Alpha Investing: Definitions, Optimality Results, and Application to Public Databases , 2013, 1307.0522.

[16]  J. T. Childers,et al.  Observation of a new particle in the search for the Standard Model Higgs boson with the ATLAS detector at the LHC , 2012 .

[17]  Thomas Steinke,et al.  Concentrated Differential Privacy: Simplifications, Extensions, and Lower Bounds , 2016, TCC.

[18]  Thomas Steinke,et al.  Interactive fingerprinting codes and the hardness of preventing false discovery , 2014, 2016 Information Theory and Applications Workshop (ITA).

[19]  Jianxin Shi,et al.  Developing and evaluating polygenic risk prediction models for stratified disease prevention , 2016, Nature Reviews Genetics.

[20]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[21]  Avrim Blum,et al.  The Ladder: A Reliable Leaderboard for Machine Learning Competitions , 2015, ICML.