Online meta-learning by parallel algorithm competition

The efficiency of reinforcement learning algorithms depends critically on a few meta-parameters that modulate the learning updates and the trade-off between exploration and exploitation. The adaptation of the meta-parameters is an open question, which arguably has become a more important issue recently with the success of deep reinforcement learning. The long learning times in domains such as Atari 2600 video games makes it not feasible to perform comprehensive searches of appropriate meta-parameter values. In this study, we propose the Online Meta-learning by Parallel Algorithm Competition (OMPAC) method, which is a novel Lamarckian evolutionary approach to online meta-parameter adaptation. The population consists of several instances of a reinforcement learning algorithm which are run in parallel with small differences in initial meta-parameter values. After a fixed number of learning episodes, the instances are selected based on their performance on the task at hand, i.e., the fitness. Before continuing the learning, Gaussian noise is added to the meta-parameters with a predefined probability. We validate the OMPAC method by improving the state-of-the-art results in stochastic SZ-Tetris and in 10x10 Tetris by 31% and 84%, respectively, and by improving the learning speed and performance for deep Sarsa(λ) agents in the Atari 2600 domain.

[1]  Peter I. Frazier,et al.  The Parallel Knowledge Gradient Method for Batch Bayesian Optimization , 2016, NIPS.

[2]  Dimitri P. Bertsekas,et al.  Temporal Dierences-Based Policy Iteration and Applications in Neuro-Dynamic Programming 1 , 1997 .

[3]  Prabhat,et al.  Scalable Bayesian Optimization Using Deep Neural Networks , 2015, ICML.

[4]  Shie Mannor,et al.  Adaptive Lambda Least-Squares Temporal Difference Learning , 2016, 1612.09465.

[5]  Heidi Burgiel,et al.  How to lose at Tetris , 1997, The Mathematical Gazette.

[6]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[7]  Junichiro Yoshimoto,et al.  Control of exploitation-exploration meta-parameter in reinforcement learning , 2002, Neural Networks.

[8]  Kenji Doya,et al.  Evolution of meta-parameters in reinforcement learning algorithm , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[9]  Kenji Doya,et al.  Meta-learning in Reinforcement Learning , 2003, Neural Networks.

[10]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[11]  James E. Baker,et al.  Reducing Bias and Inefficienry in the Selection Algorithm , 1987, ICGA.

[12]  Damien Ernst,et al.  How to Discount Deep Reinforcement Learning: Towards New Dynamic Strategies , 2015, ArXiv.

[13]  Aaron Klein,et al.  Bayesian Optimization with Robust Bayesian Neural Networks , 2016, NIPS.

[14]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[15]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[16]  Shane Legg,et al.  Massively Parallel Methods for Deep Reinforcement Learning , 2015, ArXiv.

[17]  Kunikazu Kobayashi,et al.  A Meta-learning Method Based on Temporal Difference Error , 2009, ICONIP.

[18]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[19]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[20]  Richard S. Sutton,et al.  Generalization in ReinforcementLearning : Successful Examples UsingSparse Coarse , 1996 .

[21]  Henrik I. Christensen,et al.  Co-evolution of Shaping Rewards and Meta-Parameters in Reinforcement Learning , 2008, Adapt. Behav..

[22]  Kevin Leyton-Brown,et al.  Sequential Model-Based Optimization for General Algorithm Configuration , 2011, LION.

[23]  Henrik I. Christensen,et al.  Darwinian embodied evolution of the learning ability for survival , 2011, Adapt. Behav..

[24]  Max Jaderberg,et al.  Population Based Training of Neural Networks , 2017, ArXiv.

[25]  Jasper Snoek,et al.  Practical Bayesian Optimization of Machine Learning Algorithms , 2012, NIPS.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[28]  Bruno Scherrer,et al.  Approximate Dynamic Programming Finally Performs Well in the Game of Tetris , 2013, NIPS.

[29]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[30]  Friedhelm Schwenker,et al.  Neural Network Ensembles in Reinforcement Learning , 2013, Neural Processing Letters.

[31]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[32]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[33]  S. Ioffe,et al.  Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming , 1996 .

[34]  Scott Sanner,et al.  Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting Lambda , 2010, ICML.

[35]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .