A Fully Single Loop Algorithm for Bilevel Optimization without Hessian Inverse

In this paper, we propose a new Hessian inverse free Fully Single Loop Algorithm (FSLA) for bilevel optimization problems. Classic algorithms for bilevel optimization admit a double loop structure which is computationally expensive. Recently, several single loop algorithms have been proposed with optimizing the inner and outer variable alternatively. However, these algorithms not yet achieve fully single loop. As they overlook the loop needed to evaluate the hyper-gradient for a given inner and outer state. In order to develop a fully single loop algorithm, we first study the structure of the hypergradient and identify a general approximation formulation of hyper-gradient computation that encompasses several previous common approaches, e.g. back-propagation through time, conjugate gradient, etc. Based on this formulation, we introduce a new state variable to maintain the historical hyper-gradient information. Combining our new formulation with the alternative update of the inner and outer variables, we propose an efficient fully single loop algorithm. We theoretically show that the error generated by the new state can be bounded and our algorithm converges with the rate of O( −2). Finally, we verify the efficacy our algorithm empirically through multiple bilevel optimization based machine learning tasks.

[1]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[2]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[3]  Kaiyi Ji,et al.  Provably Faster Algorithms for Bilevel Optimization , 2021, NeurIPS.

[4]  Zhaoran Wang,et al.  A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic , 2020, ArXiv.

[5]  Byron Boots,et al.  Truncated Back-propagation for Bilevel Optimization , 2018, AISTATS.

[6]  Jihun Hamm,et al.  Penalty Method for Inversion-Free Deep Bilevel Optimization , 2019, ArXiv.

[7]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[8]  Wenbo Gao,et al.  ES-MAML: Simple Hessian-Free Meta Learning , 2020, ICLR.

[9]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[10]  Michael C. Ferris,et al.  Finite perturbation of convex programs , 1991 .

[11]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[12]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[13]  David Duvenaud,et al.  Stochastic Hyperparameter Optimization through Hypernetworks , 2018, ArXiv.

[14]  Ping Li,et al.  Meta-CoTGAN: A Meta Cooperative Training Paradigm for Improving Adversarial Text Generation , 2020, AAAI.

[15]  Saeed Ghadimi,et al.  Approximation Methods for Bilevel Programming , 2018, 1802.02246.

[16]  R. Willoughby Solutions of Ill-Posed Problems (A. N. Tikhonov and V. Y. Arsenin) , 1979 .

[17]  Junjie Yang,et al.  Provably Faster Algorithms for Bilevel Optimization and Applications to Meta-Learning , 2020, ArXiv.

[18]  J. Larsen,et al.  Design and regularization of neural networks: the optimal use of a validation set , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[19]  Prashant Khanduri,et al.  A Near-Optimal Algorithm for Stochastic Bilevel Optimization via Double-Momentum , 2021, NeurIPS.

[20]  Heng Huang,et al.  Enhanced Bilevel Optimization via Bregman Distance , 2021, ArXiv.

[21]  Yoshua Bengio,et al.  Gradient-Based Optimization of Hyperparameters , 2000, Neural Computation.

[22]  Isao Yamada,et al.  Minimizing the Moreau Envelope of Nonsmooth Convex Functions over the Fixed Point Set of Certain Quasi-Nonexpansive Mappings , 2011, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[23]  Heng Huang,et al.  Improved Bilevel Model: Fast and Optimal Algorithm with Theoretical Guarantee , 2020, ArXiv.

[24]  Deyu Meng,et al.  Investigating Bi-Level Optimization for Learning and Vision From a Unified Perspective: A Survey and Beyond , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Guinan Su,et al.  AlphaGAN: Fully Differentiable Architecture Search for Generative Adversarial Networks , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Nam Ik Cho,et al.  Meta-Transfer Learning for Zero-Shot Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[28]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[29]  Chuan-Sheng Foo,et al.  Efficient multiple hyperparameter learning for log-linear models , 2007, NIPS.

[30]  Massimiliano Pontil,et al.  On the Iteration Complexity of Hypergradient Computation , 2020, ICML.

[31]  Sebastian Tschiatschek,et al.  Learner-aware Teaching: Inverse Reinforcement Learning with Preferences and Constraints , 2019, NeurIPS.

[32]  Shimrit Shtern,et al.  A First Order Method for Solving Convex Bilevel Optimization Problems , 2017, SIAM J. Optim..

[33]  K. Zhang,et al.  Convergent Reinforcement Learning with Function Approximation: A Bilevel Optimization Perspective , 2018 .

[34]  Chen Gao,et al.  AdversarialNAS: Adversarial Neural Architecture Search for GANs , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Rong Jin,et al.  On Stochastic Moving-Average Estimators for Non-Convex Optimization , 2021, ArXiv.

[36]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[37]  Dingding Chen,et al.  Optimal use of regularization and cross-validation in neural network modeling , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[38]  Mikhail Solodov,et al.  An Explicit Descent Method for Bilevel Convex Optimization , 2006 .

[39]  Kaiyi Ji,et al.  Lower Bounds and Accelerated Algorithms for Bilevel Optimization , 2021, ArXiv.

[40]  Feihu Huang,et al.  SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients , 2021, ArXiv.

[41]  Xiaopeng Zhang,et al.  PC-DARTS: Partial Channel Connections for Memory-Efficient Architecture Search , 2020, ICLR.

[42]  Akiko Takeda,et al.  Hyperparameter Learning via Bilevel Nonsmooth Optimization , 2018 .

[43]  A Single-Timescale Stochastic Bilevel Optimization Method , 2021, ArXiv.

[44]  Lisa Zhang,et al.  Reviving and Improving Recurrent Back-Propagation , 2018, ICML.

[45]  Sashank J. Reddi,et al.  SCAFFOLD: Stochastic Controlled Averaging for Federated Learning , 2019, ICML.

[46]  Neil Houlsby,et al.  Transfer Learning with Neural AutoML , 2018, NeurIPS.

[47]  Katja Hofmann,et al.  Fast Context Adaptation via Meta-Learning , 2018, ICML.

[48]  L'eon Bottou,et al.  Cold Case: The Lost MNIST Digits , 2019, NeurIPS.