Direct and indirect reinforcement learning

Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. In this paper, we classify RL into direct and indirect methods according to how they seek optimal policy of the Markov Decision Process (MDP) problem. The former solves optimal policy by directly maximizing an objective function using gradient descent method, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We take vanilla policy gradient and approximate policy iteration to study their internal relationship, and reveal that both direct and indirect methods can be unified in actor-critic architecture and are equivalent if we always choose stationary state distribution of current policy as initial state distribution of MDP. Finally, we classify the current mainstream RL algorithms and compare the differences between other criteria including value-based and policy-based, model-based and model-free.

[1]  Sheldon M. Ross,et al.  Stochastic Processes , 2018, Gauge Integral Structures for Stochastic Calculus and Quantum Electrodynamics.

[2]  Martha White,et al.  Linear Off-Policy Actor-Critic , 2012, ICML.

[3]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[4]  Martha White,et al.  An Off-policy Policy Gradient Theorem Using Emphatic Weightings , 2018, NeurIPS.

[5]  Pieter Abbeel,et al.  Equivalence Between Policy Gradients and Soft Q-Learning , 2017, ArXiv.

[6]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[7]  John N. Tsitsiklis,et al.  Simulation-based optimization of Markov reward processes , 1998, Proceedings of the 37th IEEE Conference on Decision and Control (Cat. No.98CH36171).

[8]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[9]  Jürgen Schmidhuber,et al.  Recurrent World Models Facilitate Policy Evolution , 2018, NeurIPS.

[10]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[11]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[12]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[13]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[14]  Shimon Whiteson,et al.  Generalized Off-Policy Actor-Critic , 2019, NeurIPS.

[15]  R. J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[16]  Derong Liu,et al.  Action dependent heuristic dynamic programming for home energy resource scheduling , 2013 .

[17]  Haibo He,et al.  Model-Free Dual Heuristic Dynamic Programming , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[18]  Elman Mansimov,et al.  Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation , 2017, NIPS.

[19]  Honglak Lee,et al.  Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion , 2018, NeurIPS.

[20]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[21]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[22]  Sergey Levine,et al.  Continuous Deep Q-Learning with Model-based Acceleration , 2016, ICML.

[23]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[24]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[25]  Dale Schuurmans,et al.  Bridging the Gap Between Value and Policy Based Reinforcement Learning , 2017, NIPS.

[26]  Sergey Levine,et al.  Guided Policy Search , 2013, ICML.

[27]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[28]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[29]  John N. Tsitsiklis,et al.  Analysis of temporal-difference learning with function approximation , 1996, NIPS 1996.

[30]  Marc G. Bellemare,et al.  The Reactor: A fast and sample-efficient Actor-Critic agent for Reinforcement Learning , 2017, ICLR.

[31]  Sergey Levine,et al.  Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning , 2018, ArXiv.

[32]  Warren B. Powell,et al.  “Approximate dynamic programming: Solving the curses of dimensionality” by Warren B. Powell , 2007, Wiley Series in Probability and Statistics.

[33]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[34]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[35]  Richard E. Turner,et al.  Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning , 2017, NIPS.

[36]  Warren B. Powell,et al.  Approximate Dynamic Programming - Solving the Curses of Dimensionality , 2007 .

[37]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[38]  Wojciech M. Czarnecki,et al.  Grandmaster level in StarCraft II using multi-agent reinforcement learning , 2019, Nature.

[39]  R. Bellman,et al.  Dynamic Programming and Markov Processes , 1960 .

[40]  Pieter Abbeel,et al.  Model-Ensemble Trust-Region Policy Optimization , 2018, ICLR.

[41]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[42]  Razvan Pascanu,et al.  Imagination-Augmented Agents for Deep Reinforcement Learning , 2017, NIPS.

[43]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[44]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[45]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[46]  Shengbo Eben Li,et al.  Generalized Policy Iteration for Optimal Control in Continuous Time , 2019, ArXiv.

[47]  Carl E. Rasmussen,et al.  PILCO: A Model-Based and Data-Efficient Approach to Policy Search , 2011, ICML.

[48]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[49]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[50]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[51]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[52]  Koray Kavukcuoglu,et al.  Combining policy gradient and Q-learning , 2016, ICLR.

[53]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[54]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[55]  Sergey Levine,et al.  Reinforcement Learning with Deep Energy-Based Policies , 2017, ICML.

[56]  Dale Schuurmans,et al.  Trust-PCL: An Off-Policy Trust Region Method for Continuous Control , 2017, ICLR.

[57]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[58]  Satinder Singh,et al.  Self-Imitation Learning , 2018, ICML.

[59]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[60]  Sham M. Kakade,et al.  A Natural Policy Gradient , 2001, NIPS.

[61]  Piotr Gierlak,et al.  Globalized Dual Heuristic Dynamic Programming in Control of Robotic Manipulator , 2016 .

[62]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[63]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[64]  Zhengyu Liu,et al.  Deep adaptive dynamic programming for nonaffine nonlinear optimal control problem with state constraints , 2019, ArXiv.

[65]  Huaguang Zhang,et al.  An Overview of Research on Adaptive Dynamic Programming , 2013, Acta Automatica Sinica.

[66]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[67]  M. Puterman,et al.  Modified Policy Iteration Algorithms for Discounted Markov Decision Problems , 1978 .