Maximum a Posteriori Policy Optimisation

We introduce a new algorithm for reinforcement learning called Maximum aposteriori Policy Optimisation (MPO) based on coordinate ascent on a relative entropy objective. We show that several existing methods can directly be related to our derivation. We develop two off-policy algorithms and demonstrate that they are competitive with the state-of-the-art in deep reinforcement learning. In particular, for continuous control, our method outperforms existing methods with respect to sample efficiency, premature convergence and robustness to hyperparameter settings while achieving similar or better final performance.

[1]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[2]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[3]  Geoffrey E. Hinton,et al.  Using Expectation-Maximization for Reinforcement Learning , 1997, Neural Computation.

[4]  Doina Precup,et al.  A Convergent Form of Approximate Policy Iteration , 2002, NIPS.

[5]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[6]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[7]  Emanuel Todorov,et al.  General duality between optimal control and estimation , 2008, 2008 47th IEEE Conference on Decision and Control.

[8]  Jan Peters,et al.  Policy Search for Motor Primitives in Robotics , 2008, NIPS 2008.

[9]  Marc Toussaint,et al.  Robot trajectory optimization using approximate inference , 2009, ICML '09.

[10]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[11]  Stefan Schaal,et al.  A Generalized Path Integral Control Approach to Reinforcement Learning , 2010, J. Mach. Learn. Res..

[12]  Gerhard Neumann,et al.  Variational Inference for Policy Search in changing situations , 2011, ICML.

[13]  Jan Peters,et al.  Hierarchical Relative Entropy Policy Search , 2014, AISTATS.

[14]  Jan Peters,et al.  A Survey on Policy Search for Robotics , 2013, Found. Trends Robotics.

[15]  Sergey Levine,et al.  Variational Policy Search via Trajectory Optimization , 2013, NIPS.

[16]  N. Roy,et al.  On Stochastic Optimal Control and Reinforcement Learning by Approximate Inference , 2013 .

[17]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[18]  Yuval Tassa,et al.  Learning Continuous Control Policies by Stochastic Value Gradients , 2015, NIPS.

[19]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[20]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[21]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[22]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[23]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[24]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[25]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[26]  Nicolas Le Roux Efficient iterative policy optimization , 2016, ArXiv.

[27]  Sergey Levine,et al.  Deep Reinforcement Learning for Robotic Manipulation , 2016, ArXiv.

[28]  Glen Berseth,et al.  Terrain-adaptive locomotion skills using deep reinforcement learning , 2016, ACM Trans. Graph..

[29]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[30]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[31]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[32]  Sergey Levine,et al.  Guided Policy Search via Approximate Mirror Descent , 2016, NIPS.

[33]  Johannes Fürnkranz,et al.  Model-Free Preference-Based Reinforcement Learning , 2016, AAAI.

[34]  Pieter Abbeel,et al.  Stochastic Neural Networks for Hierarchical Reinforcement Learning , 2016, ICLR.

[35]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[36]  Sergey Levine,et al.  Path integral guided policy search , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[38]  Luís Paulo Reis,et al.  Deriving and improving CMA-ES with information geometric trust regions , 2017, GECCO.

[39]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[40]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[41]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[42]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[43]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[44]  Tom Schaul,et al.  Reinforcement Learning with Unsupervised Auxiliary Tasks , 2016, ICLR.

[45]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[46]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Pascal Vincent,et al.  Convergent Tree-Backup and Retrace with Function Approximation , 2017, ICML.

[48]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[49]  Jakub W. Pachocki,et al.  Emergent Complexity via Multi-Agent Competition , 2017, ICLR.