论文信息 - Real-Time Reinforcement Learning

Real-Time Reinforcement Learning

Markov Decision Processes (MDPs), the mathematical framework underlying most algorithms in Reinforcement Learning (RL), are often used in a way that wrongfully assumes that the state of an agent's environment does not change during action selection. As RL systems based on MDPs begin to find application in real-world safety critical situations, this mismatch between the assumptions underlying classical MDPs and the reality of real-time computation may lead to undesirable outcomes. In this paper, we introduce a new framework, in which states and actions evolve simultaneously and show how it is related to the classical MDP formulation. We analyze existing algorithms under the new real-time formulation and show why they are suboptimal when used in real-time. We then use those insights to create a new algorithm Real-Time Actor Critic (RTAC) that outperforms the existing state-of-the-art continuous control algorithm Soft Actor Critic both in real-time and non-real-time settings.

Chris Pal | Simon Ramstedt | Chris Pal | Simon Ramstedt

[1] Yuval Tassa,et al. MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[2] Thomas J. Walsh,et al. Learning and planning in environments with delayed feedback , 2009, Autonomous Agents and Multi-Agent Systems.

[3] R. Bellman. A Markovian Decision Process , 1957 .

[4] David Silver,et al. Learning values across many orders of magnitude , 2016, NIPS.

[5] Roland Siegwart,et al. Control of a Quadrotor With Reinforcement Learning , 2017, IEEE Robotics and Automation Letters.

[6] Gerald Tesauro,et al. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play , 1994, Neural Computation.

[7] Guy Lever,et al. Deterministic Policy Gradient Algorithms , 2014, ICML.

[8] Demis Hassabis,et al. Mastering the game of Go without human knowledge , 2017, Nature.

[9] Sergey Levine,et al. Trust Region Policy Optimization , 2015, ICML.

[10] Herke van Hoof,et al. Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[11] Yann Ollivier,et al. Making Deep Q-learning methods robust to time discretization , 2019, ICML.

[12] John N. Tsitsiklis,et al. Actor-Critic Algorithms , 1999, NIPS.

[13] Joshua B. Tenenbaum,et al. At Human Speed: Deep Reinforcement Learning with Action Delay , 2018, ArXiv.