Machine Learning's Dropout Training is Distributionally Robust Optimal

This paper shows that dropout training in Generalized Linear Models is the minimax solution of a two-player, zero-sum game where an adversarial nature corrupts a statistician's covariates using a multiplicative nonparametric errors-in-variables model. In this game---known as a Distributionally Robust Optimization problem---nature's least favorable distribution is dropout noise, where nature independently deletes entries of the covariate vector with some fixed probability $\delta$. Our decision-theoretic analysis shows that dropout training---the statistician's minimax strategy in the game---indeed provides out-of-sample expected loss guarantees for distributions that arise from multiplicative perturbations of in-sample data. This paper also provides a novel, parallelizable, Unbiased Multi-Level Monte Carlo algorithm to speed-up the implementation of dropout training. Our algorithm has a much smaller computational cost compared to the naive implementation of dropout, provided the number of data points is much smaller than the dimension of the covariate vector.

[1]  Michael B. Giles,et al.  Multilevel Monte Carlo Path Simulation , 2008, Oper. Res..

[2]  A. Dasgupta Asymptotic Theory of Statistics and Probability , 2008 .

[3]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[4]  Peter W. Glynn,et al.  Unbiased Multilevel Monte Carlo: Stochastic Optimization, Steady-state Simulation, Quantiles, and Other Applications , 2019, 1904.09929.

[5]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[6]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[7]  Susan Athey,et al.  Machine Learning Methods That Economists Should Know About , 2019, Annual Review of Economics.

[8]  Alexander Shapiro,et al.  Lectures on Stochastic Programming: Modeling and Theory , 2009 .

[9]  Michael B. Giles,et al.  Multilevel Monte Carlo methods , 2013, Acta Numerica.

[10]  Stephen Tyree,et al.  Learning with Marginalized Corrupted Features , 2013, ICML.

[11]  Sanjog Misra,et al.  Deep Neural Networks for Estimation and Inference , 2018, Econometrica.

[12]  J. T. Hwang Multiplicative Errors-in-Variables Models with Applications to Recent Data Released by the U.S. Department of Energy , 1986 .

[13]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[14]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[15]  Christopher M. Bishop,et al.  Current address: Microsoft Research, , 2022 .

[16]  Daniel Kuhn,et al.  Distributionally Robust Convex Optimization , 2014, Oper. Res..

[17]  Alexander Shapiro,et al.  Distributionally Robust Stochastic Programming , 2017, SIAM J. Optim..

[18]  Colin Wei,et al.  The Implicit and Explicit Regularization Effects of Dropout , 2020, ICML.

[19]  Sanjay Mehrotra,et al.  Distributionally Robust Optimization: A Review , 2019, ArXiv.

[20]  Stefan Heinrich,et al.  Multilevel Monte Carlo Methods , 2001, LSSC.

[21]  Gerald S. Rogers,et al.  Mathematical Statistics: A Decision Theoretic Approach , 1967 .

[22]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[23]  I. Gilboa,et al.  Maxmin Expected Utility with Non-Unique Prior , 1989 .

[24]  M. KarthyekRajhaaA.,et al.  Robust Wasserstein profile inference and applications to machine learning , 2019, J. Appl. Probab..

[25]  J. Neumann,et al.  Theory of games and economic behavior , 1945, 100 Years of Math Milestones.

[26]  Yinyu Ye,et al.  Distributionally Robust Optimization Under Moment Uncertainty with Application to Data-Driven Problems , 2010, Oper. Res..

[27]  D. Madigan,et al.  Bayesian Model Averaging for Linear Regression Models , 1997 .

[28]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[29]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[30]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[31]  Philip M. Long,et al.  On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[32]  Herbert E. Scarf,et al.  A Min-Max Solution of an Inventory Problem , 1957 .

[33]  Ian R. Petersen,et al.  Robust Properties of Risk-Sensitive Control , 2000, Math. Control. Signals Syst..

[34]  B. V. Bahr On the Convergence of Moments in the Central Limit Theorem , 1965 .

[35]  T. Sargent,et al.  Robust Control and Model Uncertainty , 2001 .

[36]  David Draper,et al.  Assessment and Propagation of Model Uncertainty , 2011 .

[37]  Timothy Christensen,et al.  Counterfactual Sensitivity and Robustness , 2019, Econometrica.

[38]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[39]  Abraham Wald,et al.  Statistical Decision Functions , 1951 .

[40]  Johannes Schmidt-Hieber,et al.  Nonparametric regression using deep neural networks with ReLU activation function , 2017, The Annals of Statistics.

[41]  H. Robbins A Stochastic Approximation Method , 1951 .

[42]  S. Karlin,et al.  Studies in the Mathematical Theory of Inventory and Production, by K.J. Arrow, S. Karlin, H. Scarf with contributions by M.J. Beckmann, J. Gessford, R.F. Muth. Stanford, California, Stanford University Press, 1958, X p.340p., $ 8.75. , 1959, Bulletin de l'Institut de recherches économiques et sociales.