论文信息 - Learning to Learn without Gradient Descent by Gradient Descent

Learning to Learn without Gradient Descent by Gradient Descent

We learn recurrent neural network optimizers trained on simple synthetic functions by gradient descent. We show that these learned optimizers exhibit a remarkable degree of transfer in that they can be used to efficiently optimize a broad range of derivative-free black-box functions, including Gaussian process bandits, simple control objectives, global optimization benchmarks and hyper-parameter tuning tasks. Up to the training horizon, the learned optimizers learn to trade-off exploration and exploitation, and compare favourably with heavily engineered Bayesian optimization packages for hyper-parameter tuning.

[1] W. R. Thompson. ON THE LIKELIHOOD THAT ONE UNKNOWN PROBABILITY EXCEEDS ANOTHER IN VIEW OF THE EVIDENCE OF TWO SAMPLES , 1933 .

[2] Lewis B. Ward. Reminiscence and rote learning. , 1937 .

[3] H. Harlow,et al. The formation of learning sets. , 1949, Psychological review.

[4] Harold J. Kushner,et al. A New Method of Locating the Maximum Point of an Arbitrary Multipeak Curve in the Presence of Noise , 1964 .

[5] E. Kehoe. A layered network model of associative learning: learning to learn and configuration. , 1988, Psychological review.

[6] J. Mockus,et al. The Bayesian approach to global optimization , 1989 .

[7] Richard J. Mammone,et al. Meta-neural networks that learn by learning , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[8] R. J. Williams,et al. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[9] Richard S. Sutton,et al. Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[10] J. Schmidhuber,et al. A neural network that embeds its own meta-levels , 1993, IEEE International Conference on Neural Networks.

[11] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[12] Donald R. Jones,et al. Efficient Global Optimization of Expensive Black-Box Functions , 1998, J. Glob. Optim..

[13] Sebastian Thrun,et al. Learning to Learn , 1998, Springer US.

[14] Nicol N. Schraudolph,et al. Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[15] Magnus Thor Jonsson,et al. Evolution and design of distributed learning rules , 2000, 2000 IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks. Proceedings of the First IEEE Symposium on Combinations of Evolutionary Computation and Neural Networks (Cat. No.00.

[16] Sepp Hochreiter,et al. Learning to Learn Using Gradient Descent , 2001, ICANN.

[17] Donald R. Jones,et al. A Taxonomy of Global Optimization Methods Based on Response Surfaces , 2001, J. Glob. Optim..

[18] Samy Bengio,et al. On the search for new learning rules for ANNs , 1995, Neural Processing Letters.

[19] Yoshua Bengio,et al. On the Optimization of a Synaptic Learning Rule , 2007 .

[20] Ron Kohavi,et al. Controlled experiments on the web: survey and practical guide , 2009, Data Mining and Knowledge Discovery.

[21] Eric Walter,et al. An informational approach to the global optimization of expensive-to-evaluate functions , 2006, J. Glob. Optim..

[22] Carl E. Rasmussen,et al. Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[23] Rémi Munos,et al. Pure Exploration in Multi-armed Bandits Problems , 2009, ALT.

[24] Nando de Freitas,et al. New inference strategies for solving Markov Decision Processes using reversible jump MCMC , 2009, UAI.

[25] Andreas Krause,et al. Information-Theoretic Regret Bounds for Gaussian Process Optimization in the Bandit Setting , 2009, IEEE Transactions on Information Theory.

[26] Nando de Freitas,et al. A Tutorial on Bayesian Optimization of Expensive Cost Functions, with Application to Active User Modeling and Hierarchical Reinforcement Learning , 2010, ArXiv.

[27] Steven L. Scott,et al. A modern Bayesian look at the multi-armed bandit , 2010 .