Stability and Generalization of Bilevel Programming in Hyperparameter Optimization

The (gradient-based) bilevel programming framework is widely used in hyperparameter optimization and has achieved excellent performance empirically. Previous theoretical work mainly focuses on its optimization properties, while leaving the analysis on generalization largely open. This paper attempts to address the issue by presenting an expectation bound w.r.t. the validation set based on uniform stability. Our results can explain some mysterious behaviours of the bilevel programming in practice, for instance, overfitting to the validation set. We also present an expectation bound for the classical cross-validation algorithm. Our results suggest that gradient-based algorithms can be better than cross-validation under certain conditions in a theoretical perspective. Furthermore, we prove that regularization terms in both the outer and inner levels can relieve the overfitting problem in gradient-based algorithms. In experiments on feature learning and data reweighting for noisy labels, we corroborate our theoretical findings.

[1]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[2]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[3]  Byron Boots,et al.  Truncated Back-propagation for Bilevel Optimization , 2018, AISTATS.

[4]  Yoshua Bengio,et al.  Gradient-Based Optimization of Hyperparameters , 2000, Neural Computation.

[5]  Patrice Marcotte,et al.  An overview of bilevel optimization , 2007, Ann. Oper. Res..

[6]  David Duvenaud,et al.  Residual Flows for Invertible Generative Modeling , 2019, NeurIPS.

[7]  Frank Hutter,et al.  Neural Architecture Search: A Survey , 2018, J. Mach. Learn. Res..

[8]  Roger B. Grosse,et al.  Self-Tuning Networks: Bilevel Optimization of Hyperparameters using Structured Best-Response Functions , 2019, ICLR.

[9]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[10]  Kirthevasan Kandasamy,et al.  Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly , 2019, J. Mach. Learn. Res..

[11]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[12]  Lars Schmidt-Thieme,et al.  Beyond Manual Tuning of Hyperparameters , 2015, KI - Künstliche Intelligenz.

[13]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[14]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[15]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[16]  David Duvenaud,et al.  Optimizing Millions of Hyperparameters by Implicit Differentiation , 2019, AISTATS.

[17]  Kaiyi Ji,et al.  Lower Bounds and Accelerated Algorithms for Bilevel Optimization , 2021, ArXiv.

[18]  Saeed Ghadimi,et al.  Approximation Methods for Bilevel Programming , 2018, 1802.02246.

[19]  H. Vincent Poor,et al.  Convergence of Meta-Learning with Task-Specific Adaptation over Partial Parameters , 2020, NeurIPS.

[20]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[21]  Qi Xie,et al.  Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting , 2019, NeurIPS.

[22]  Ameet Talwalkar,et al.  Foundations of Machine Learning , 2012, Adaptive computation and machine learning.

[23]  Jianqing Fan,et al.  High-Dimensional Statistics , 2014 .

[24]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[25]  Massimiliano Pontil,et al.  On the Iteration Complexity of Hypergradient Computation , 2020, ICML.

[26]  David Duvenaud,et al.  Stochastic Hyperparameter Optimization through Hypernetworks , 2018, ArXiv.

[27]  Kaiyi Ji,et al.  Bilevel Optimization: Convergence Analysis and Enhanced Design , 2020, ICML.

[28]  Andreas Maurer,et al.  Algorithmic Stability and Meta-Learning , 2005, J. Mach. Learn. Res..

[29]  Kian Hsiang Low,et al.  DrMAD: Distilling Reverse-Mode Automatic Differentiation for Optimizing Hyperparameters of Deep Neural Networks , 2016, IJCAI.

[30]  A. Salman Avestimehr,et al.  On Polynomial Approximations for Privacy-Preserving and Verifiable ReLU Networks , 2020, ArXiv.

[31]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[32]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[33]  Tapani Raiko,et al.  Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters , 2015, ICML.

[34]  Bin Yang,et al.  Learning to Reweight Examples for Robust Deep Learning , 2018, ICML.

[35]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[36]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[37]  Korris Fu-Lai Chung,et al.  A Closer Look at the Training Strategy for Modern Meta-Learning , 2020, NeurIPS.

[38]  Renjie Liao,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[39]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[40]  Zhi-Hua Zhou,et al.  Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data , 2020, ICML.