Bridging the Gap Between Value and Policy Based Reinforcement Learning

We establish a new connection between value and policy based reinforcement learning (RL) based on a relationship between softmax temporal value consistency and policy optimality under entropy regularization. Specifically, we show that softmax consistent action values correspond to optimal entropy regularized policy probabilities along any action sequence, regardless of provenance. From this observation, we develop a new RL algorithm, Path Consistency Learning (PCL), that minimizes a notion of soft consistency error along multi-step action sequences extracted from both on- and off-policy traces. We examine the behavior of PCL in different scenarios and show that PCL can be interpreted as generalizing both actor-critic and Q-learning algorithms. We subsequently deepen the relationship by showing how a single model can be used to represent both a policy and the corresponding softmax state values, eliminating the need for a separate critic. The experimental evaluation demonstrates that PCL significantly outperforms strong actor-critic and Q-learning baselines across several benchmarks.

[1]  Jing Peng,et al.  Function Optimization using Connectionist Reinforcement Learning Algorithms , 1991 .

[2]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[3]  Gerald Tesauro,et al.  Temporal Difference Learning and TD-Gammon , 1995, J. Int. Comput. Games Assoc..

[4]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[5]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[6]  Dimitri P. Bertsekas,et al.  Dynamic Programming and Optimal Control, Two Volume Set , 1995 .

[7]  Michael L. Littman,et al.  Algorithms for Sequential Decision Making , 1996 .

[8]  John N. Tsitsiklis,et al.  Analysis of Temporal-Diffference Learning with Function Approximation , 1996, NIPS.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[11]  Adrian S. Lewis,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[12]  Doina Precup,et al.  Eligibility Traces for Off-Policy Policy Evaluation , 2000, ICML.

[13]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[14]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[15]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[16]  Nuttapong Chentanez,et al.  Intrinsically Motivated Reinforcement Learning , 2004, NIPS.

[17]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[18]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[19]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[20]  H. Kappen Path integrals and symmetry breaking for optimal control theory , 2005, physics/0505066.

[21]  Emanuel Todorov,et al.  Linearly-solvable Markov decision problems , 2006, NIPS.

[22]  Csaba Szepesvári,et al.  Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path , 2006, Machine Learning.

[23]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[24]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[25]  Jürgen Schmidhuber,et al.  Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990–2010) , 2010, IEEE Transactions on Autonomous Mental Development.

[26]  Yasemin Altun,et al.  Relative Entropy Policy Search , 2010 .

[27]  Emanuel Todorov,et al.  Policy gradients in linearly-solvable MDPs , 2010, NIPS.

[28]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[29]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[30]  Vicenç Gómez,et al.  Dynamic Policy Programming with Function Approximation , 2011, AISTATS.

[31]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[32]  Hilbert J. Kappen,et al.  Dynamic policy programming , 2010, J. Mach. Learn. Res..

[33]  Vicenç Gómez,et al.  Optimal control as a graphical model inference problem , 2009, Machine Learning.

[34]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[35]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[36]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[37]  Philip S. Thomas,et al.  Personalized Ad Recommendation Systems for Life-Time Value Optimization with Guarantees , 2015, IJCAI.

[38]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[39]  J. Andrew Bagnell,et al.  Approximate MaxEnt Inverse Optimal Control and Its Application for Mental Simulation of Human Interactions , 2015, AAAI.

[40]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[43]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[44]  Dale Schuurmans,et al.  Reward Augmented Maximum Likelihood for Neural Structured Prediction , 2016, NIPS.

[45]  Kavosh Asadi,et al.  A New Softmax Operator for Reinforcement Learning , 2016, ArXiv.

[46]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[47]  Roy Fox,et al.  Taming the Noise in Reinforcement Learning via Soft Updates , 2015, UAI.

[48]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[49]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[50]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[51]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[52]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[53]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[54]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[55]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[56]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[57]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[58]  Marc G. Bellemare,et al.  The Reactor: A Sample-Efficient Actor-Critic Architecture , 2017, ArXiv.

[59]  Kavosh Asadi,et al.  An Alternative Softmax Operator for Reinforcement Learning , 2016, ICML.

[60]  Sergey Levine,et al.  Q-Prop: Sample-Efficient Policy Gradient with An Off-Policy Critic , 2016, ICLR.

[61]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[62]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[63]  Dale Schuurmans,et al.  Improving Policy Gradient by Exploring Under-appreciated Rewards , 2016, ICLR.

[64]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[65]  Sergey Levine,et al.  Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[66]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.