Multi-Preference Actor Critic

Policy gradient algorithms typically combine discounted future rewards with an estimated value function, to compute the direction and magnitude of parameter updates. However, for most Reinforcement Learning tasks, humans can provide additional insight to constrain the policy learning. We introduce a general method to incorporate multiple different feedback channels into a single policy gradient loss. In our formulation, the Multi-Preference Actor Critic (M-PAC), these different types of feedback are implemented as constraints on the policy. We use a Lagrangian relaxation to satisfy these constraints using gradient descent while learning a policy that maximizes rewards. Experiments in Atari and Pendulum verify that constraints are being respected and can accelerate the learning process.

[1]  Peter Stone,et al.  Interactively shaping agents via human reinforcement: the TAMER framework , 2009, K-CAP '09.

[2]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents , 2012, J. Artif. Intell. Res..

[3]  Shane Legg,et al.  Deep Reinforcement Learning from Human Preferences , 2017, NIPS.

[4]  Jiashi Feng,et al.  Policy Optimization with Demonstrations , 2018, ICML.

[5]  J. Andrew Bagnell,et al.  Reinforcement and Imitation Learning via Interactive No-Regret Learning , 2014, ArXiv.

[6]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[7]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[8]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[9]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[10]  Guan Wang,et al.  Interactive Learning from Policy-Dependent Human Feedback , 2017, ICML.

[11]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[12]  John Langford,et al.  Approximately Optimal Approximate Reinforcement Learning , 2002, ICML.

[13]  Yang Gao,et al.  Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[14]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[15]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[16]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[17]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[18]  Peter Stone,et al.  Behavioral Cloning from Observation , 2018, IJCAI.

[19]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[20]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[21]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.