Using inaccurate models in reinforcement learning

In the model-based policy search approach to reinforcement learning (RL), policies are found using a model (or "simulator") of the Markov decision process. However, for high-dimensional continuous-state tasks, it can be extremely difficult to build an accurate model, and thus often the algorithm returns a policy that works in simulation but not in real-life. The other extreme, model-free RL, tends to require infeasibly large numbers of real-life trials. In this paper, we present a hybrid algorithm that requires only an approximate model, and only a small number of real-life trials. The key idea is to successively "ground" the policy evaluations using real-life trials, but to rely on the approximate model to suggest local changes. Our theoretical results show that this algorithm achieves near-optimal performance in the real system, even when the model is only approximate. Empirical results also demonstrate that---when given only a crude model and a small number of real-life trials---our algorithm can obtain near-optimal performance in the real system.

[1]  David Q. Mayne,et al.  Differential dynamic programming , 1972, The Mathematical Gazette.

[2]  B. Anderson,et al.  Optimal control: linear quadratic methods , 1990 .

[3]  Leonard M. Adleman,et al.  Proof of proposition 3 , 1992 .

[4]  Frank L. Lewis,et al.  Aircraft Control and Simulation , 1992 .

[5]  T D Gillespie,et al.  Fundamentals of Vehicle Dynamics , 1992 .

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[8]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[9]  J. Doyle,et al.  Robust and optimal control , 1995, Proceedings of 35th IEEE Conference on Decision and Control.

[10]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[11]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[12]  Kevin L. Moore,et al.  Iterative Learning Control: An Expository Overview , 1999 .

[13]  Jun Morimoto,et al.  Acquisition of stand-up behavior by a real robot using hierarchical reinforcement learning , 2000, Robotics Auton. Syst..

[14]  A. Schaft G.E. Dullerud and F. Paganini - A course in robust control theory: a convex approach. Berlin: Springer-Verlag, 2000 (Text in Applied Mathematics ; 36) , 2001 .

[15]  Jun Morimoto,et al.  Minimax Differential Dynamic Programming: An Application to Robust Biped Walking , 2002, NIPS.

[16]  Andrew W. Moore,et al.  Locally Weighted Learning for Control , 1997, Artificial Intelligence Review.

[17]  Peter Stone,et al.  Machine Learning for Fast Quadrupedal Locomotion , 2004, AAAI.

[18]  K. Taira Proof of Theorem 1.3 , 2004 .

[19]  Laurent El Ghaoui,et al.  Robust Control of Markov Decision Processes with Uncertain Transition Matrices , 2005, Oper. Res..

[20]  Laurent El Ghaoui,et al.  Robust Solutions to Markov Decision Problems with Uncertain Transition Matrices , 2005 .

[21]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[22]  G. Dullerud,et al.  A Course in Robust Control Theory: A Convex Approach , 2005 .