Learning exploration strategies in model-based reinforcement learning

Reinforcement learning (RL) is a paradigm for learning sequential decision making tasks. However, typically the user must hand-tune exploration parameters for each different domain and/or algorithm that they are using. In this work, we present an algorithm called LEO for learning these exploration strategies on-line. This algorithm makes use of bandit-type algorithms to adaptively select exploration strategies based on the rewards received when following them. We show empirically that this method performs well across a set of five domains. In contrast, for a given algorithm, no set of parameters is best across all domains. Our results demonstrate that the LEO algorithm successfully learns the best exploration strategies on-line, increasing the received reward over static parameterizations of exploration and reducing the need for hand-tuning exploration parameters.

[1]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[2]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[3]  Pierre-Yves Oudeyer,et al.  R-IAC: Robust Intrinsically Motivated Exploration and Active Learning , 2009, IEEE Transactions on Autonomous Mental Development.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  Pierre-Yves Oudeyer,et al.  Robust intrinsically motivated exploration and active learning , 2009, 2009 IEEE 8th International Conference on Development and Learning.

[6]  Ronen I. Brafman,et al.  R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7]  Richard L. Lewis,et al.  Strong mitigation: nesting search for good policies within search for good reward , 2012, AAMAS.

[8]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[9]  Nando de Freitas,et al.  Portfolio Allocation for Bayesian Optimization , 2010, UAI.

[10]  Pierre-Yves Oudeyer,et al.  The strategic student approach for life-long exploration and learning , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[11]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12]  Nicolò Cesa-Bianchi,et al.  Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[13]  Lihong Li,et al.  The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning , 2009, ICML '09.

[14]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[15]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[16]  Richard L. Lewis,et al.  Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents , 2011, AAAI.

[17]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18]  Ran El-Yaniv,et al.  Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[19]  Peter Stone,et al.  Structure Learning in Ergodic Factored MDPs without Knowledge of the Transition Function's In-Degree , 2011, ICML.

[20]  J. Hammersley SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[21]  Reuven Y. Rubinstein,et al.  Simulation and the Monte Carlo Method , 1981 .

[22]  Peter Stone,et al.  Intrinsically motivated model learning for a developing curious agent , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[23]  Peter Stone,et al.  Real time targeted exploration in large domains , 2010, 2010 IEEE 9th International Conference on Development and Learning.

[24]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.