Gradient-Aware Model-based Policy Search

Traditional model-based reinforcement learning approaches learn a model of the environment dynamics without explicitly considering how it will be used by the agent. In the presence of misspecified model classes, this can lead to poor estimates, as some relevant available information is ignored. In this paper, we introduce a novel model-based policy search approach that exploits the knowledge of the current agent policy to learn an approximate transition model, focusing on the portions of the environment that are most relevant for policy improvement. We leverage a weighting scheme, derived from the minimization of the error on the model-based policy gradient estimator, in order to define a suitable objective function that is optimized for learning the approximate transition model. Then, we integrate this procedure into a batch policy improvement algorithm, named Gradient-Aware Model-based Policy Search (GAMPS), which iteratively learns a transition model and uses it, together with the collected trajectories, to compute the new policy parameters. Finally, we empirically validate GAMPS on benchmark domains analyzing and discussing its properties.

[1]  Marcello Restelli,et al.  Policy Optimization via Importance Sampling , 2018, NeurIPS.

[2]  Yoshua Bengio,et al.  Using a Financial Training Criterion Rather than a Prediction Criterion , 1997, Int. J. Neural Syst..

[3]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[4]  Jan Peters,et al.  Model learning for robot control: a survey , 2011, Cognitive Processing.

[5]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[6]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[7]  Alborz Geramifard,et al.  Reinforcement learning with misspecified model classes , 2013, 2013 IEEE International Conference on Robotics and Automation.

[8]  Tom Schaul,et al.  The Predictron: End-To-End Learning and Planning , 2016, ICML.

[9]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Amir-massoud Farahmand,et al.  Iterative Value-Aware Model Learning , 2018, NeurIPS.

[12]  Kavosh Asadi,et al.  Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning , 2018, ArXiv.

[13]  H. Kahn,et al.  Methods of Reducing Sample Size in Monte Carlo Computations , 1953, Oper. Res..

[14]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[15]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[16]  Sergey Levine,et al.  Model-Based Reinforcement Learning for Atari , 2019, ICLR.

[17]  L. Piroddi,et al.  An identification algorithm for polynomial NARX models based on simulation error minimization , 2003 .

[18]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[19]  MODEL-ENSEMBLE TRUST-REGION POLICY OPTI- , 2017 .

[20]  Kavosh Asadi,et al.  Equivalence Between Wasserstein and Value-Aware Model-based Reinforcement Learning , 2018, ArXiv.

[21]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[22]  Jan Peters,et al.  Model Learning with Local Gaussian Process Regression , 2009, Adv. Robotics.

[23]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[24]  Marcello Restelli,et al.  Transfer of Samples in Policy Search via Multiple Importance Sampling , 2019, ICML.

[25]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[26]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[27]  Mehryar Mohri,et al.  Relative deviation learning bounds and generalization with unbounded loss functions , 2013, Annals of Mathematics and Artificial Intelligence.

[28]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[29]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[30]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[31]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[32]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[33]  Pieter Abbeel,et al.  Using inaccurate models in reinforcement learning , 2006, ICML.

[34]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[35]  Xin Wang,et al.  Model-based Policy Gradient Reinforcement Learning , 2003, ICML.

[36]  Sergey Levine,et al.  Goal-driven dynamics learning via Bayesian optimization , 2017, 2017 IEEE 56th Annual Conference on Decision and Control (CDC).

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  A. R. Penner,et al.  The physics of putting , 2002 .

[39]  R Bellman,et al.  On the Theory of Dynamic Programming. , 1952, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[41]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[42]  Martin A. Riedmiller Neural Fitted Q Iteration - First Experiences with a Data Efficient Neural Reinforcement Learning Method , 2005, ECML.

[43]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[44]  Jun Morimoto,et al.  Model-Based Policy Gradients with Parameter-Based Exploration by Least-Squares Conditional Density Estimation , 2013, Neural Networks.

[45]  Byron Boots,et al.  Differentiable MPC for End-to-end Planning and Control , 2018, NeurIPS.

[46]  J. Andrew Bagnell,et al.  Agnostic System Identification for Model-Based Reinforcement Learning , 2012, ICML.

[47]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[48]  Masashi Sugiyama,et al.  Model-Based Policy Gradients with Parameter-Based Exploration by Least-Squares Conditional Density Estimation , 2013 .

[49]  Peter L. Bartlett,et al.  Infinite-Horizon Policy-Gradient Estimation , 2001, J. Artif. Intell. Res..

[50]  C. Rasmussen,et al.  Improving PILCO with Bayesian Neural Network Dynamics Models , 2016 .

[51]  Marcello Restelli,et al.  Configurable Markov Decision Processes , 2018, ICML.

[52]  Daniel Nikovski,et al.  Value-Aware Loss Function for Model-based Reinforcement Learning , 2017, AISTATS.

[53]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[54]  Priya L. Donti,et al.  Task-based End-to-end Model Learning in Stochastic Optimization , 2017, NIPS.

[55]  Satinder Singh,et al.  Value Prediction Network , 2017, NIPS.

[56]  Andrea Bonarini,et al.  Reinforcement Learning in Continuous Action Spaces through Sequential Monte Carlo Methods , 2007, NIPS.

[57]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[58]  David K. Smith,et al.  Dynamic Programming and Optimal Control. Volume 1 , 1996 .

[59]  Yuandong Tian,et al.  Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees , 2018, ICLR.

[61]  Martin A. Riedmiller,et al.  Batch Reinforcement Learning , 2012, Reinforcement Learning.

[62]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.