Improved Bilevel Model: Fast and Optimal Algorithm with Theoretical Guarantee

Due to the hierarchical structure of many machine learning problems, bilevel programming is becoming more and more important recently, however, the complicated correlation between the inner and outer problem makes it extremely challenging to solve. Although several intuitive algorithms based on the automatic differentiation have been proposed and obtained success in some applications, not much attention has been paid to finding the optimal formulation of the bilevel model. Whether there exists a better formulation is still an open problem. In this paper, we propose an improved bilevel model which converges faster and better compared to the current formulation. We provide theoretical guarantee and evaluation results over two tasks: Data Hyper-Cleaning and Hyper Representation Learning. The empirical results show that our model outperforms the current bilevel model with a great margin. \emph{This is a concurrent work with \citet{liu2020generic} and we submitted to ICML 2020. Now we put it on the arxiv for record.}

[1]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[2]  Jin Zhang,et al.  A Generic First-Order Algorithmic Framework for Bi-Level Programming Beyond Lower-Level Singleton , 2020, ICML.

[3]  T. Zolezzi,et al.  Well-Posed Optimization Problems , 1993 .

[4]  François Laviolette,et al.  Sequential Model-Based Ensemble Optimization , 2014, UAI.

[5]  Shimrit Shtern,et al.  A First Order Method for Solving Convex Bilevel Optimization Problems , 2017, SIAM J. Optim..

[6]  Yoshua Bengio,et al.  Algorithms for Hyper-Parameter Optimization , 2011, NIPS.

[7]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[8]  Byron Boots,et al.  Truncated Back-propagation for Bilevel Optimization , 2018, AISTATS.

[9]  Michael C. Ferris,et al.  Finite perturbation of convex programs , 1991 .

[10]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[11]  Tapani Raiko,et al.  Scalable Gradient-Based Tuning of Continuous Regularization Hyperparameters , 2015, ICML.

[12]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[13]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[14]  Pieter Abbeel,et al.  Meta-Learning with Temporal Convolutions , 2017, ArXiv.

[15]  Hong-Kun Xu VISCOSITY APPROXIMATION METHODS FOR NONEXPANSIVE MAPPINGS , 2004 .

[16]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[17]  Saeed Ghadimi,et al.  Approximation Methods for Bilevel Programming , 2018, 1802.02246.

[18]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[19]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[20]  L'eon Bottou,et al.  Cold Case: The Lost MNIST Digits , 2019, NeurIPS.

[21]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[22]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[23]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[24]  Nicolas P. Couellan On the convergence of stochastic bi-level gradient methods , 2016 .

[25]  Isao Yamada,et al.  Minimizing the Moreau Envelope of Nonsmooth Convex Functions over the Fixed Point Set of Certain Quasi-Nonexpansive Mappings , 2011, Fixed-Point Algorithms for Inverse Problems in Science and Engineering.

[26]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[27]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[28]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[29]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[30]  Mark W. Schmidt,et al.  Online Learning Rate Adaptation with Hypergradient Descent , 2017, ICLR.

[31]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[32]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[33]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[34]  David Duvenaud,et al.  Stochastic Hyperparameter Optimization through Hypernetworks , 2018, ArXiv.