Addressing Action Oscillations through Learning Policy Inertia

Deep reinforcement learning (DRL) algorithms have been demonstrated to be effective in a wide range of challenging decision making and control tasks. However, these methods typically suffer from severe action oscillations in particular in discrete action setting, which means that agents select different actions within consecutive steps even though states only slightly differ. This issue is often neglected since the policy is usually evaluated by its cumulative rewards only. Action oscillation strongly affects the user experience and can even cause serious potential security menace especially in realworld domains with the main concern of safety, such as autonomous driving. To this end, we introduce Policy Inertia Controller (PIC) which serves as a generic plug-in framework to off-the-shelf DRL algorithms, to enables adaptive trade-off between the optimality and smoothness of the learned policy in a formal way. We propose Nested Policy Iteration as a general training algorithm for PIC-augmented policy which ensures monotonically non-decreasing updates under some mild conditions. Further, we derive a practical DRL algorithm, namely Nested Soft Actor-Critic. Experiments on a collection of autonomous driving tasks and several Atari games suggest that our approach demonstrates substantial oscillation reduction in comparison to a range of commonly adopted baselines with almost no performance degradation.

[1]  Balaraman Ravindran,et al.  Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning , 2017, ICLR.

[2]  Edouard Leurent,et al.  Practical Open-Loop Optimistic Planning , 2019, ECML/PKDD.

[3]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[4]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[5]  John S. Schreck,et al.  Learning Retrosynthetic Planning through Simulated Experience , 2019, ACS central science.

[6]  Mohammad Norouzi,et al.  Dream to Control: Learning Behaviors by Latent Imagination , 2019, ICLR.

[7]  Olexandr Isayev,et al.  MolecularRNN: Generating realistic molecular graphs with optimized properties , 2019, ArXiv.

[8]  Petros Christodoulou,et al.  Soft Actor-Critic for Discrete Action Settings , 2019, ArXiv.

[9]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[10]  Edouard Leurent,et al.  Social Attention for Autonomous Decision-Making in Dense Traffic , 2019, ArXiv.

[11]  Sergey Levine,et al.  Learning to Walk via Deep Reinforcement Learning , 2018, Robotics: Science and Systems.

[12]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[13]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[14]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[15]  T. Zhao,et al.  Deep Reinforcement Learning with Smooth Policy , 2020, ICML 2020.

[16]  Marcello Restelli,et al.  Control Frequency Adaptation via Action Persistence in Batch Reinforcement Learning , 2020, ICML.

[17]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[18]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[19]  Romain Laroche,et al.  Budgeted Reinforcement Learning in Continuous State Space , 2019, NeurIPS.

[20]  Balaraman Ravindran,et al.  Dynamic Frame skip Deep Q Network , 2016, ArXiv.

[21]  David Janz,et al.  Learning to Drive in a Day , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[22]  Jure Leskovec,et al.  Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation , 2018, NeurIPS.

[23]  James Bergstra,et al.  Autoregressive Policies for Continuous Control Deep Reinforcement Learning , 2019, IJCAI.

[24]  Sergey Levine,et al.  AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos , 2020, Robotics: Science and Systems.