论文信息 - Mitigating Planner Overfitting in Model-Based Reinforcement Learning

Mitigating Planner Overfitting in Model-Based Reinforcement Learning

An agent with an inaccurate model of its environment faces a difficult choice: it can ignore the errors in its model and act in the real world in whatever way it determines is optimal with respect to its model. Alternatively, it can take a more conservative stance and eschew its model in favor of optimizing its behavior solely via real-world interaction. This latter approach can be exceedingly slow to learn from experience, while the former can lead to "planner overfitting" - aspects of the agent's behavior are optimized to exploit errors in its model. This paper explores an intermediate position in which the planner seeks to avoid overfitting through a kind of regularization of the plans it considers. We present three different approaches that demonstrably mitigate planner overfitting in reinforcement-learning environments.

[1] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[2] Tom Schaul,et al. Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[3] S. Shankar Sastry,et al. Autonomous Helicopter Flight via Reinforcement Learning , 2003, NIPS.

[4] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[5] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[6] Ronald Parr,et al. Linear Complementarity for Regularized Policy Evaluation and Improvement , 2010, NIPS.

[7] M. Loth,et al. Sparse Temporal Difference Learning Using LASSO , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[8] Justin A. Boyan,et al. Least-Squares Temporal Difference Learning , 1999, ICML.

[9] Michael Kearns,et al. Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[10] Marc G. Bellemare,et al. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[11] Shie Mannor,et al. Regularized Fitted Q-Iteration for planning in continuous-space Markovian decision problems , 2009, 2009 American Control Conference.

[12] Ambuj Tewari,et al. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs , 2009, UAI.

[13] David Hsu,et al. DESPOT: Online POMDP Planning with Regularization , 2013, NIPS.

[14] Marek Petrik,et al. Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision Processes , 2010, ICML.

[15] Gavin Taylor,et al. Kernelized value function approximation for reinforcement learning , 2009, ICML '09.

[16] Ronald J. Williams,et al. Tight Performance Bounds on Greedy Policies Based on Imperfect Value Functions , 1993 .

[17] Alec Radford,et al. Proximal Policy Optimization Algorithms , 2017, ArXiv.

[18] Alex Graves,et al. Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[19] Nan Jiang,et al. The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.