Deep Reinforcement Learning with Robust and Smooth Policy

Deep reinforcement learning (RL) has achieved great empirical successes in various domains. However, the large search space of neural networks requires a large amount of data, which makes the current RL algorithms not sample efficient. Motivated by the fact that many environments with continuous state space have smooth transitions, we propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- \textbf{S}mooth \textbf{R}egularized \textbf{R}einforcement \textbf{L}earning ($\textbf{SR}^2\textbf{L}$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy. Moreover, our proposed framework can also improve the robustness of policy against measurement error in the state space, and can be naturally extended to distribubutionally robust setting. We apply the proposed framework to both on-policy (TRPO) and off-policy algorithm (DDPG). Through extensive experiments, we demonstrate that our method achieves improved sample efficiency and robustness.

[1]  Swarat Chaudhuri,et al.  Control Regularization for Reduced Variance Reinforcement Learning , 2019, ICML.

[2]  F. Hampel The Influence Curve and Its Role in Robust Estimation , 1974 .

[3]  Xianglong Liu,et al.  Orthogonal Weight Normalization: Solution to Optimization over Multiple Dependent Stiefel Manifolds in Deep Neural Networks , 2017, AAAI.

[4]  Dawn Song,et al.  Using Self-Supervised Learning Can Improve Model Robustness and Uncertainty , 2019, NeurIPS.

[5]  Xiaodong Liu,et al.  SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization , 2020, ACL.

[6]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[7]  Shin Ishii,et al.  Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[9]  Michael I. Jordan,et al.  Theoretically Principled Trade-off between Robustness and Accuracy , 2019, ICML.

[10]  Wei Zhao,et al.  Deep Reinforcement Learning for Sponsored Search Real-time Bidding , 2018, KDD.

[11]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[12]  Quoc V. Le,et al.  Unsupervised Data Augmentation , 2019, ArXiv.

[13]  G. Monahan State of the Art—A Survey of Partially Observable Markov Decision Processes: Theory, Models, and Algorithms , 1982 .

[14]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[15]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[16]  Abhinav Gupta,et al.  Robust Adversarial Reinforcement Learning , 2017, ICML.

[17]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[18]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[19]  Samy Bengio,et al.  Adversarial examples in the physical world , 2016, ICLR.

[20]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[21]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[22]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[24]  Nicholas Jing Yuan,et al.  DRN: A Deep Reinforcement Learning Framework for News Recommendation , 2018, WWW.

[25]  Weinan Zhang,et al.  Real-Time Bidding with Multi-Agent Reinforcement Learning in Display Advertising , 2018, CIKM.

[26]  Andrew W. Moore,et al.  Generalization in Reinforcement Learning: Safely Approximating the Value Function , 1994, NIPS.

[27]  A. Kleywegt,et al.  Distributionally Robust Stochastic Optimization with Wasserstein Distance , 2016, Math. Oper. Res..