Variational Model-based Policy Optimization

Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efficiency problem in RL. However, designing such algorithms is often challenging because the bias in simulated data may overshadow the ease of data generation. A potential solution to this challenge is to jointly learn and improve model and policy using a universal objective function. In this paper, we leverage the connection between RL and probabilistic inference, and formulate such an objective function as a variational lower-bound of a log-likelihood. This allows us to use expectation maximization (EM) and iteratively fix a baseline policy and learn a variational distribution, consisting of a model and a policy (E-step), followed by improving the baseline policy given the learned variational distribution (M-step). We propose model-based and model-free policy iteration (actor-critic) style algorithms for the E-step and show how the variational distribution learned by them can be used to optimize the M-step in a fully model-based fashion. Our experiments on a number of continuous control tasks show that despite being more complex, our model-based (E-step) algorithm, called {\em variational model-based policy optimization} (VMBPO), is more sample-efficient and robust to hyper-parameter tuning than its model-free (E-step) counterpart. Using the same control tasks, we also compare VMBPO with several state-of-the-art model-based and model-free RL algorithms and show its sample efficiency and performance.

[1]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[2]  Sergey Levine,et al.  Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics , 2014, NIPS.

[3]  Sergey Levine,et al.  Unsupervised Exploration with Deep Model-Based Reinforcement Learning , 2018 .

[4]  Marc Toussaint,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2012, Robotics: Science and Systems.

[5]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[6]  V. Kaul,et al.  Planning , 2012 .

[7]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[8]  Luc De Raedt,et al.  Proceedings of the 22nd international conference on Machine learning , 2005 .

[9]  Vivek S. Borkar,et al.  Q-Learning for Risk-Sensitive Control , 2002, Math. Oper. Res..

[10]  Martin J. Wainwright,et al.  Estimating divergence functionals and the likelihood ratio by penalized convex risk minimization , 2007, NIPS.

[11]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .

[12]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Ang Li,et al.  Prediction, Consistency, Curvature: Representation Learning for Locally-Linear Control , 2020, ICLR.

[14]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[15]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[16]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[17]  Amir-massoud Farahmand,et al.  Iterative Value-Aware Model Learning , 2018, NeurIPS.

[18]  J. W. Nieuwenhuis,et al.  Boekbespreking van D.P. Bertsekas (ed.), Dynamic programming and optimal control - volume 2 , 1999 .

[19]  Martin A. Riedmiller,et al.  Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images , 2015, NIPS.

[20]  Robert M Thrall,et al.  Mathematics of Operations Research. , 1978 .

[21]  Masashi Sugiyama,et al.  Efficient Sample Reuse in EM-Based Policy Search , 2009, ECML/PKDD.

[22]  Yuval Tassa,et al.  Maximum a Posteriori Policy Optimisation , 2018, ICLR.

[23]  Peter A. Flach,et al.  Proceedings of the 28th International Conference on Machine Learning , 2011 .

[24]  Roger Levin,et al.  Consistency. , 2020, Journal of the American Dental Association.

[25]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[26]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[27]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[28]  Emanuel Todorov,et al.  General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[29]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[30]  Matthew Fellows,et al.  VIREL: A Variational Inference Framework for Reinforcement Learning , 2018, NeurIPS.

[31]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[32]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[33]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[34]  Sergey Levine,et al.  Learning Complex Neural Network Policies with Trajectory Optimization , 2014, ICML.

[35]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[36]  Sergey Levine,et al.  When to Trust Your Model: Model-Based Policy Optimization , 2019, NeurIPS.

[37]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[38]  M. V. Rossum,et al.  In Neural Computation , 2022 .

[39]  Ofir Nachum,et al.  Path Consistency Learning in Tsallis Entropy Regularized MDPs , 2018, ICML.

[40]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[41]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[42]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[43]  Mohammad Ghavamzadeh,et al.  Policy-Aware Model Learning for Policy Gradient Methods , 2020, ArXiv.