Analysis of Reward Functions in Deep Reinforcement Learning for Continuous State Space Control

Deep Reinforcement Learning (DRL), which uses deep neural networks for the approximation of the value function and the policy, in continuous state-space control tasks has recently shown promising results. However, the use of deep neural networks as function approximators has often resulted in intractable analyses of DRL algorithms mainly due to their non-convexities and thus a lack of theoretical guarantee such as asymptotic global convergence of the learning algorithm. Considering the fact that the reward function in reinforcement learning is one of the key entities that determines the overall characteristics of the learning agents, we focused on a smaller but an important aspect of the analysis, investigating the structure of widely used reward functions in DRL tasks and their possible effects on the learning algorithm. The proposed analysis may facilitate identification of appropriate reward functions in DRL tasks, which has often been conducted via trial and error.

[1]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[2]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[3]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[4]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[5]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[6]  Jan M. Maciejowski,et al.  Predictive control : with constraints , 2002 .

[7]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[8]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[9]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[10]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[11]  Shimon Whiteson,et al.  OFFER: Off-Environment Reinforcement Learning , 2017, AAAI.

[12]  Lex Weaver,et al.  The Optimal Reward Baseline for Gradient-Based Reinforcement Learning , 2001, UAI.

[13]  Philip Bachman,et al.  Deep Reinforcement Learning that Matters , 2017, AAAI.

[14]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[15]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[16]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[17]  Danna Zhou,et al.  d. , 1934, Microbial pathogenesis.

[18]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[19]  Peter L. Bartlett,et al.  Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning , 2001, J. Mach. Learn. Res..

[20]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[21]  Alberto Bemporad,et al.  Predictive Control for Linear and Hybrid Systems , 2017 .

[22]  Hao Li,et al.  Visualizing the Loss Landscape of Neural Nets , 2017, NeurIPS.