论文信息 - Using inaccurate models in reinforcement learning

Using inaccurate models in reinforcement learning

In the model-based policy search approach to reinforcement learning (RL), policies are found using a model (or "simulator") of the Markov decision process. However, for high-dimensional continuous-state tasks, it can be extremely difficult to build an accurate model, and thus often the algorithm returns a policy that works in simulation but not in real-life. The other extreme, model-free RL, tends to require infeasibly large numbers of real-life trials. In this paper, we present a hybrid algorithm that requires only an approximate model, and only a small number of real-life trials. The key idea is to successively "ground" the policy evaluations using real-life trials, but to rely on the approximate model to suggest local changes. Our theoretical results show that this algorithm achieves near-optimal performance in the real system, even when the model is only approximate. Empirical results also demonstrate that---when given only a crude model and a small number of real-life trials---our algorithm can obtain near-optimal performance in the real system.

[1] David Q. Mayne,et al. Differential dynamic programming , 1972, The Mathematical Gazette.

[2] B. Anderson,et al. Optimal control: linear quadratic methods , 1990 .

[3] Leonard M. Adleman,et al. Proof of proposition 3 , 1992 .

[4] Frank L. Lewis,et al. Aircraft Control and Simulation , 1992 .

[5] T D Gillespie,et al. Fundamentals of Vehicle Dynamics , 1992 .

[6] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7] Dimitri P. Bertsekas,et al. Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[8] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9] J. Doyle,et al. Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[10] Stefan Schaal,et al. Robot Learning From Demonstration , 1997, ICML.

[11] Richard S. Sutton,et al. Introduction to Reinforcement Learning , 1998 .

[12] Kevin L. Moore,et al. Iterative Learning Control: An Expository Overview , 1999 .

[13] Jun Morimoto,et al. Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[14] A. Schaft. G.E. Dullerud and F. Paganini - A course in robust control theory: a convex approach. Berlin: Springer-Verlag, 2000 (Text in Applied Mathematics ; 36) , 2001 .

[15] Jun Morimoto,et al. Minimax Differential Dynamic Programming: An Application to Robust Biped Walking , 2002, NIPS.

[16] Andrew W. Moore,et al. Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[17] Peter Stone,et al. Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[18] K. Taira. Proof of Theorem 1.3 , 2004 .

[19] Laurent El Ghaoui,et al. Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[20] Laurent El Ghaoui,et al. Robust Solutions to Markov Decision Problems with Uncertain Transition Matrices , 2005 .

[21] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22] G. Dullerud,et al. A Course in Robust Control Theory: A Convex Approach , 2005 .