Bilevel Programming for Hyperparameter Optimization and Meta-Learning

We introduce a framework based on bilevel programming that unifies gradient-based hyperparameter optimization and meta-learning. We show that an approximate version of the bilevel problem can be solved by taking into explicit account the optimization dynamics for the inner objective. Depending on the specific setting, the outer variables take either the meaning of hyperparameters in a supervised learning problem or parameters of a meta-learner. We provide sufficient conditions under which solutions of the approximate problem converge to those of the exact problem. We instantiate our approach for meta-learning in the case of deep learning where representation layers are treated as hyperparameters shared across a set of training episodes. In experiments, we confirm our theoretical findings, present encouraging results for few-shot learning and contrast the bilevel approach against classical approaches for learning-to-learn.

[1]  Gregory R. Koch,et al.  Siamese Neural Networks for One-Shot Image Recognition , 2015 .

[2]  Patrice Marcotte,et al.  An overview of bilevel optimization , 2007, Ann. Oper. Res..

[3]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[4]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[5]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[6]  Massimiliano Pontil,et al.  The Benefit of Multitask Representation Learning , 2015, J. Mach. Learn. Res..

[7]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[8]  Andreas Griewank,et al.  Evaluating derivatives - principles and techniques of algorithmic differentiation, Second Edition , 2000, Frontiers in applied mathematics.

[9]  T. Zolezzi,et al.  Well-Posed Optimization Problems , 1993 .

[10]  Ilja Kuzborskij,et al.  From N to N+1: Multiclass Transfer Incremental Learning , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Fabian Pedregosa,et al.  Hyperparameter optimization with approximate gradient , 2016, ICML.

[13]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jing Hu,et al.  Classification model selection via bilevel programming , 2008, Optim. Methods Softw..

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  Misha Denil,et al.  Learned Optimizers that Scale and Generalize , 2017, ICML.

[18]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[19]  Ryan P. Adams,et al.  Gradient-based Hyperparameter Optimization through Reversible Learning , 2015, ICML.

[20]  David D. Cox,et al.  Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures , 2013, ICML.

[21]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[22]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[23]  Lars Schmidt-Thieme,et al.  Beyond Manual Tuning of Hyperparameters , 2015, KI - Künstliche Intelligenz.

[24]  S. Sathiya Keerthi,et al.  An Efficient Method for Gradient-Based Adaptation of Hyperparameters in SVM Models , 2006, NIPS.

[25]  Paolo Frasconi,et al.  Forward and Reverse Gradient-Based Hyperparameter Optimization , 2017, ICML.

[26]  J. Urgen Schmidhuber Learning to Control Fast-weight Memories: an Alternative to Dynamic Recurrent Networks , 1991 .

[27]  Sebastian Thrun,et al.  Learning to Learn , 1998, Springer US.

[28]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[29]  Daan Wierstra,et al.  One-shot Learning with Memory-Augmented Neural Networks , 2016, ArXiv.

[30]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[31]  Hong Yu,et al.  Meta Networks , 2017, ICML.

[32]  Vahid Tarokh,et al.  On Optimal Generalizability in Parametric Learning , 2017, NIPS.

[33]  Justin Domke,et al.  Generic Methods for Optimization-Based Modeling , 2012, AISTATS.

[34]  Barak A. Pearlmutter,et al.  Automatic differentiation in machine learning: a survey , 2015, J. Mach. Learn. Res..

[35]  Jonathan Baxter,et al.  Learning internal representations , 1995, COLT '95.

[36]  Aurko Roy,et al.  Learning to Remember Rare Events , 2017, ICLR.

[37]  Amos J. Storkey,et al.  Towards a Neural Statistician , 2016, ICLR.

[38]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[39]  Alexei A. Efros,et al.  Undoing the Damage of Dataset Bias , 2012, ECCV.

[40]  Charles A. Micchelli,et al.  Learning Multiple Tasks with Kernel Methods , 2005, J. Mach. Learn. Res..

[41]  Massimiliano Pontil,et al.  Incremental Learning-to-Learn with Statistical Guarantees , 2018, UAI.

[42]  Marcin Andrychowicz,et al.  Learning to learn by gradient descent by gradient descent , 2016, NIPS.

[43]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[44]  Jonathan F. Bard,et al.  Practical Bilevel Optimization: Algorithms and Applications , 1998 .

[45]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.