论文信息 - Learning exploration strategies in model-based reinforcement learning

Learning exploration strategies in model-based reinforcement learning

Reinforcement learning (RL) is a paradigm for learning sequential decision making tasks. However, typically the user must hand-tune exploration parameters for each different domain and/or algorithm that they are using. In this work, we present an algorithm called LEO for learning these exploration strategies on-line. This algorithm makes use of bandit-type algorithms to adaptively select exploration strategies based on the rewards received when following them. We show empirically that this method performs well across a set of five domains. In contrast, for a given algorithm, no set of parameters is best across all domains. Our results demonstrate that the LEO algorithm successfully learns the best exploration strategies on-line, increasing the received reward over static parameterizations of exploration and reducing the need for hand-tuning exploration parameters.

Manuel Lopes | Peter Stone | Todd Hester

[1] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[2] Jürgen Schmidhuber,et al. Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[3] Pierre-Yves Oudeyer,et al. R-IAC: Robust Intrinsically Motivated Exploration and Active Learning , 2009, IEEE Transactions on Autonomous Mental Development.

[4] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5] Pierre-Yves Oudeyer,et al. Robust intrinsically motivated exploration and active learning , 2009, 2009 IEEE 8th International Conference on Development and Learning.

[6] Ronen I. Brafman,et al. R-MAX - A General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning , 2001, J. Mach. Learn. Res..

[7] Richard L. Lewis,et al. Strong mitigation: nesting search for good policies within search for good reward , 2012, AAMAS.

[8] Andrew Y. Ng,et al. Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[9] Nando de Freitas,et al. Portfolio Allocation for Bayesian Optimization , 2010, UAI.

[10] Pierre-Yves Oudeyer,et al. The strategic student approach for life-long exploration and learning , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[11] Doina Precup,et al. Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[12] Nicolò Cesa-Bianchi,et al. Gambling in a rigged casino: The adversarial multi-armed bandit problem , 1995, Proceedings of IEEE 36th Annual Foundations of Computer Science.

[13] Lihong Li,et al. The adaptive k-meteorologists problem and its application to structure learning and feature selection in reinforcement learning , 2009, ICML '09.

[14] Andrew G. Barto,et al. Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[15] Leo Breiman,et al. Random Forests , 2001, Machine Learning.

[16] Richard L. Lewis,et al. Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents , 2011, AAAI.

[17] Peter Auer,et al. Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[18] Ran El-Yaniv,et al. Online Choice of Active Learning Algorithms , 2003, J. Mach. Learn. Res..

[19] Peter Stone,et al. Structure Learning in Ergodic Factored MDPs without Knowledge of the Transition Function's In-Degree , 2011, ICML.

[20] J. Hammersley. SIMULATION AND THE MONTE CARLO METHOD , 1982 .

[21] Reuven Y. Rubinstein,et al. Simulation and the Monte Carlo Method , 1981 .

[22] Peter Stone,et al. Intrinsically motivated model learning for a developing curious agent , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[23] Peter Stone,et al. Real time targeted exploration in large domains , 2010, 2010 IEEE 9th International Conference on Development and Learning.

[24] Pierre-Yves Oudeyer,et al. Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.