论文信息 - RUDDER: Return Decomposition for Delayed Rewards

RUDDER: Return Decomposition for Delayed Rewards

We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD({\lambda}), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards. Source code is available at \url{this https URL} and demonstration videos at \url{this https URL}.

[1] V. Marčenko,et al. DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[2] A. H. Klopf,et al. Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[3] A G Barto,et al. Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[4] M. J. Sobel. The variance of discounted Markov decision processes , 1982 .

[5] Frank Fallside,et al. Dynamic reinforcement driven error propagation networks with application to game playing , 1989 .

[6] C. Watkins. Learning from delayed rewards , 1989 .

[7] Jürgen Schmidhuber,et al. Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[8] Kuala Lumpur,et al. WITH TIME DELAYS , 1990 .

[9] P. Tseng. Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[10] Sepp Hochreiter,et al. Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[11] John N. Tsitsiklis,et al. An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[12] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[13] Michael I. Jordan,et al. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[14] George H. John. When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[15] Mahesan Niranjan,et al. On-line Q-learning using connectionist systems , 1994 .

[16] Martin L. Puterman,et al. Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17] Richard S. Sutton,et al. A Menu of Designs for Reinforcement Learning Over Time , 1995 .

[18] John N. Tsitsiklis,et al. Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[19] J. Jachymski. Continuous dependence of attractors of iterated function systems , 1996 .

[20] Jürgen Schmidhuber,et al. LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[21] Dimitri P. Bertsekas,et al. Stochastic shortest path games: theory and algorithms , 1997 .

[22] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[23] V. Borkar. Stochastic approximation with two time scales , 1997 .

[24] E. Kirr,et al. Continuous dependence on parameters of the fixed points set for some set-valued operators , 1997 .

[25] Stephen D. Patek,et al. Stochastic and shortest path games: theory and algorithms , 1997 .

[26] G. Lugosi,et al. On Concentration-of-Measure Inequalities , 1998 .

[27] S. Hochreiter. Recurrent Neural Net Learning and Vanishing , 1998 .

[28] Sepp Hochreiter,et al. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[29] A. C. Rencher. Linear models in statistics , 1999 .

[30] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[31] Stefan Schaal,et al. Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[32] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[33] Jürgen Schmidhuber,et al. Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[34] Thomas de Quincey. [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[35] Balaraman Ravindran,et al. Symmetries and Model Minimization in Markov Decision Processes , 2001 .

[36] Sepp Hochreiter,et al. Learning to Learn Using Gradient Descent , 2001, ICANN.

[37] Bram Bakker,et al. Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[38] E. Oja,et al. Independent Component Analysis , 2013 .

[39] A. Soshnikov. A Note on Universality of the Distribution of the Largest Eigenvalues in Certain Sample Covariance Matrices , 2001, math/0104113.

[40] Yoshua Bengio,et al. Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[41] Balaraman Ravindran,et al. SMDP Homomorphisms: An Algebraic Approach to Abstraction in Semi-Markov Decision Processes , 2003, IJCAI.

[42] Robert Givan,et al. Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[43] M. Akritas,et al. with censored data , 2003 .

[44] Garrison W. Cottrell,et al. Principled Methods for Advising Reinforcement Learning Agents , 2003, ICML.

[45] Eric Wiewiora,et al. Potential-Based Shaping and Q-Value Initialization are Equivalent , 2003, J. Artif. Intell. Res..