Controlling Bias in Adaptive Data Analysis Using Information Theory

Modern data is messy and high-dimensional, and it is often not clear a priori what are the right questions to ask. Instead, the analyst typically needs to use the data to search for interesting analyses to perform and hypotheses to test. This is an adaptive process, where the choice of analysis to be performed next depends on the results of the previous analyses on the same data. It's widely recognized that this process, even if well-intentioned, can lead to biases and false discoveries, contributing to the crisis of reproducibility in science. But while adaptivity renders standard statistical theory invalid, folklore and experience suggest that not all types of adaptive analysis are equally at risk for false discoveries. In this paper, we propose a general information-theoretic framework to quantify and provably bound the bias and other statistics of an arbitrary adaptive analysis process. We prove that our mutual information based bound is tight in natural models, and then use it to give rigorous insights into when commonly used procedures do or do not lead to substantially biased estimation. We first consider several popular feature selection protocols, like rank selection or variance-based selection. We then consider the practice of adding random noise to the observations or to the reported statistics, which is advocated by related ideas from differential privacy and blinded data analysis. We discuss the connections between these techniques and our framework, and supplement our results with illustrative simulations.

[1]  Yihong Wu,et al.  Dissipation of Information in Channels With Input Constraints , 2014, IEEE Transactions on Information Theory.

[2]  Victor Chernozhukov,et al.  Inference on Treatment Effects after Selection Amongst High-Dimensional Controls , 2011 .

[3]  Peter Kraft,et al.  Curses--winner's and otherwise--in genetic epidemiology. , 2008, Epidemiology.

[4]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[5]  R. Tibshirani,et al.  A SIGNIFICANCE TEST FOR THE LASSO. , 2013, Annals of statistics.

[6]  R. Gray Entropy and Information Theory , 1990, Springer New York.

[7]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[8]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[9]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[10]  Martin J. Aryee,et al.  Epigenome-wide association studies without the need for cell-type composition , 2014, Nature Methods.

[11]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[12]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[13]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[14]  In-Hee Lee,et al.  A filter-based feature selection approach for identifying potential biomarkers for lung cancer , 2011, Journal of Clinical Bioinformatics.

[15]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[16]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[17]  T. Poggio,et al.  General conditions for predictivity in learning theory , 2004, Nature.

[18]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[19]  Jonathan Taylor,et al.  Statistical learning and selective inference , 2015, Proceedings of the National Academy of Sciences.

[20]  Avrim Blum,et al.  The Ladder: A Reliable Leaderboard for Machine Learning Competitions , 2015, ICML.

[21]  Adel Javanmard,et al.  Confidence intervals and hypothesis testing for high-dimensional regression , 2013, J. Mach. Learn. Res..

[22]  Venkat Anantharam,et al.  On Maximal Correlation, Hypercontractivity, and the Data Processing Inequality studied by Erkip and Cover , 2013, ArXiv.

[23]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[24]  R. Tibshirani,et al.  Exact Post-selection Inference for Forward Stepwise and Least Angle Regression , 2014 .

[25]  Siqi Wu,et al.  Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks , 2016, Proceedings of the National Academy of Sciences.

[26]  V. V. Buldygin,et al.  The sub-Gaussian norm of a binary random variable , 2013 .

[27]  S. Geer,et al.  On asymptotically optimal confidence regions and tests for high-dimensional models , 2013, 1303.0518.

[28]  Dennis L. Sun,et al.  Optimal Inference After Model Selection , 2014, 1410.2597.

[29]  Jonathan Ullman,et al.  Preventing False Discovery in Interactive Data Analysis Is Hard , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[30]  Saul Perlmutter,et al.  Blind analysis: Hide results to seek the truth , 2015, Nature.

[31]  L. Wasserman,et al.  Operating characteristics and extensions of the false discovery rate procedure , 2002 .

[32]  Dennis L. Sun,et al.  Exact post-selection inference, with application to the lasso , 2013, 1311.6238.

[33]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[34]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.