RUDDER: Return Decomposition for Delayed Rewards

We propose RUDDER, a novel reinforcement learning approach for delayed rewards in finite Markov decision processes (MDPs). In MDPs the Q-values are equal to the expected immediate reward plus the expected future rewards. The latter are related to bias problems in temporal difference (TD) learning and to high variance problems in Monte Carlo (MC) learning. Both problems are even more severe when rewards are delayed. RUDDER aims at making the expected future rewards zero, which simplifies Q-value estimation to computing the mean of the immediate reward. We propose the following two new concepts to push the expected future rewards toward zero. (i) Reward redistribution that leads to return-equivalent decision processes with the same optimal policies and, when optimal, zero expected future rewards. (ii) Return decomposition via contribution analysis which transforms the reinforcement learning task into a regression task at which deep learning excels. On artificial tasks with delayed rewards, RUDDER is significantly faster than MC and exponentially faster than Monte Carlo Tree Search (MCTS), TD({\lambda}), and reward shaping approaches. At Atari games, RUDDER on top of a Proximal Policy Optimization (PPO) baseline improves the scores, which is most prominent at games with delayed rewards. Source code is available at \url{this https URL} and demonstration videos at \url{this https URL}.

[1]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[2]  A. H. Klopf,et al.  Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[3]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[4]  M. J. Sobel The variance of discounted Markov decision processes , 1982 .

[5]  Frank Fallside,et al.  Dynamic reinforcement driven error propagation networks with application to game playing , 1989 .

[6]  C. Watkins Learning from delayed rewards , 1989 .

[7]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[8]  Kuala Lumpur,et al.  WITH TIME DELAYS , 1990 .

[9]  P. Tseng Solving H-horizon, stationary Markov decision problems in time proportional to log(H) , 1990 .

[10]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[11]  John N. Tsitsiklis,et al.  An Analysis of Stochastic Shortest Path Problems , 1991, Math. Oper. Res..

[12]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[13]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[14]  George H. John When the Best Move Isn't Optimal: Q-learning with Exploration , 1994, AAAI.

[15]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[16]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[17]  Richard S. Sutton,et al.  A Menu of Designs for Reinforcement Learning Over Time , 1995 .

[18]  John N. Tsitsiklis,et al.  Neuro-Dynamic Programming , 1996, Encyclopedia of Machine Learning.

[19]  J. Jachymski Continuous dependence of attractors of iterated function systems , 1996 .

[20]  Jürgen Schmidhuber,et al.  LSTM can Solve Hard Long Time Lag Problems , 1996, NIPS.

[21]  Dimitri P. Bertsekas,et al.  Stochastic shortest path games: theory and algorithms , 1997 .

[22]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[23]  V. Borkar Stochastic approximation with two time scales , 1997 .

[24]  E. Kirr,et al.  Continuous dependence on parameters of the fixed points set for some set-valued operators , 1997 .

[25]  Stephen D. Patek,et al.  Stochastic and shortest path games: theory and algorithms , 1997 .

[26]  G. Lugosi,et al.  On Concentration-of-Measure Inequalities , 1998 .

[27]  S. Hochreiter Recurrent Neural Net Learning and Vanishing , 1998 .

[28]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[29]  A. C. Rencher Linear models in statistics , 1999 .

[30]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[31]  Stefan Schaal,et al.  Is imitation learning the route to humanoid robots? , 1999, Trends in Cognitive Sciences.

[32]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[33]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[34]  Thomas de Quincey [C] , 2000, The Works of Thomas De Quincey, Vol. 1: Writings, 1799–1820.

[35]  Balaraman Ravindran,et al.  Symmetries and Model Minimization in Markov Decision Processes , 2001 .

[36]  Sepp Hochreiter,et al.  Learning to Learn Using Gradient Descent , 2001, ICANN.

[37]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[38]  E. Oja,et al.  Independent Component Analysis , 2013 .

[39]  A. Soshnikov A Note on Universality of the Distribution of the Largest Eigenvalues in Certain Sample Covariance Matrices , 2001, math/0104113.

[40]  Yoshua Bengio,et al.  Gradient Flow in Recurrent Nets: the Difficulty of Learning Long-Term Dependencies , 2001 .

[41]  Balaraman Ravindran,et al.  SMDP Homomorphisms: An Algebraic Approach to Abstraction in Semi-Markov Decision Processes , 2003, IJCAI.

[42]  Robert Givan,et al.  Equivalence notions and model minimization in Markov decision processes , 2003, Artif. Intell..

[43]  M. Akritas,et al.  with censored data , 2003 .

[44]  Garrison W. Cottrell,et al.  Principled Methods for Advising Reinforcement Learning Agents , 2003, ICML.

[45]  Eric Wiewiora,et al.  Potential-Based Shaping and Q-Value Initialization are Equivalent , 2003, J. Artif. Intell. Res..

[46]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[47]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.

[48]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[49]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[50]  John N. Tsitsiklis,et al.  Asynchronous Stochastic Approximation and Q-Learning , 1994, Machine Learning.

[51]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[52]  William Bolton,et al.  Instrumentation And Control Systems , 2004 .

[53]  Warren B. Powell,et al.  Reinforcement Learning and Its Relationship to Supervised Learning , 2004 .

[54]  Richard S. Sutton,et al.  Landmark learning: An illustration of associative search , 1981, Biological Cybernetics.

[55]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[56]  Jing Peng,et al.  Incremental multi-step Q-learning , 1994, Machine Learning.

[57]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[58]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[59]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[60]  Duane Szafron,et al.  Visual Explanation of Evidence with Additive Classifiers , 2006, AAAI.

[61]  Warren B. Powell,et al.  Handbook of Learning and Approximate Dynamic Programming , 2006, IEEE Transactions on Automatic Control.

[62]  Thomas J. Walsh,et al.  Towards a Unified Theory of State Abstraction for MDPs , 2006, AI&M.

[63]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[64]  Klaus Obermayer,et al.  Fast model-based protein homology detection without alignment , 2007, Bioinform..

[65]  Stefan Schaal,et al.  Reinforcement learning by reward-weighted regression for operational space control , 2007, ICML '07.

[66]  John N. Tsitsiklis,et al.  Bias and Variance Approximation in Value Function Estimates , 2007, Manag. Sci..

[67]  M. Frigon Fixed point and continuation results for contractions in metric and gauge spaces , 2007 .

[68]  B. Bakker,et al.  Reinforcement learning by backpropagation through an LSTM model/critic , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[69]  Peter Dayan,et al.  Hippocampal Contributions to Control: The Third Way , 2007, NIPS.

[70]  Hazhir Rahmandad,et al.  Effects of feedback delay on learning , 2009 .

[71]  J. Schmidhuber,et al.  A Novel Connectionist System for Unconstrained Handwriting Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Ingemar J. Cox,et al.  Probably Approximately Correct Search , 2009, ICTIR.

[73]  M. Rudelson,et al.  Non-asymptotic theory of random matrices: extreme singular values , 2010, 1003.2990.

[74]  Hado van Hasselt,et al.  Double Q-learning , 2010, NIPS.

[75]  U. Rieder,et al.  Markov Decision Processes , 2010 .

[76]  Klaus Obermayer,et al.  The optimal unbiased value estimator and its relation to LSTD, TD and MC , 2010, Machine Learning.

[77]  Ferenc Beleznay,et al.  Comparing Value-Function Estimation Algorithms in Undiscounted Problems , 2012 .

[78]  Shie Mannor,et al.  Policy Gradients with Variance Related Risk Criteria , 2012, ICML.

[79]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[80]  Melanie Mitchell,et al.  Interpreting individual classifications of hierarchical networks , 2013, 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[81]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[82]  Jürgen Schmidhuber,et al.  Evolving large-scale neural networks for vision-based reinforcement learning , 2013, GECCO '13.

[83]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[84]  Joaquín González-Rodríguez,et al.  Automatic language identification using long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[85]  Björn W. Schuller,et al.  Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling , 2014, INTERSPEECH.

[86]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[87]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[88]  Erik Marchi,et al.  Multi-resolution linear prediction based features for audio onset detection with bidirectional LSTM neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[89]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[90]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[91]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[92]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[93]  Marcus Rohrbach,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[94]  David Vandyke,et al.  Reward Shaping with Recurrent Neural Networks for Speeding up On-Line Policy Learning in Spoken Dialogue Systems , 2015, SIGDIAL Conference.

[95]  Sam Devlin,et al.  Expressing Arbitrary Reward Functions as Potential-Based Advice , 2015, AAAI.

[96]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[97]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[98]  Peter Stone,et al.  Deep Recurrent Q-Learning for Partially Observable MDPs , 2015, AAAI Fall Symposia.

[99]  Marc G. Bellemare,et al.  The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract) , 2012, IJCAI.

[100]  Shie Mannor,et al.  Learning the Variance of the Reward-To-Go , 2016, J. Mach. Learn. Res..

[101]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, ECCV.

[102]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[103]  Tom Schaul,et al.  Dueling Network Architectures for Deep Reinforcement Learning , 2015, ICML.

[104]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[105]  Zachary Feinstein Continuity properties and sensitivity analysis of parameterized fixed points and approximate fixed points , 2016 .

[106]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[107]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[108]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[109]  Marc G. Bellemare,et al.  Safe and Efficient Off-Policy Reinforcement Learning , 2016, NIPS.

[110]  Yuval Tassa,et al.  Learning and Transfer of Modulated Locomotor Controllers , 2016, ArXiv.

[111]  Peter Stone,et al.  On the Analysis of Complex Backup Strategies in Monte Carlo Tree Search , 2016, ICML.

[112]  A. Veretennikov Ergodic Markov processes and Poisson equations (lecture notes) , 2016, 1610.09661.

[113]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[114]  Demis Hassabis,et al.  Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm , 2017, ArXiv.

[115]  Jukka Luoma,et al.  Time delays, competitive interdependence, and firm performance , 2017 .

[116]  Zhe L. Lin,et al.  Top-Down Neural Attention by Excitation Backprop , 2016, International Journal of Computer Vision.

[117]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[118]  David Berthelot,et al.  BEGAN: Boundary Equilibrium Generative Adversarial Networks , 2017, ArXiv.

[119]  Marc G. Bellemare,et al.  A Distributional Perspective on Reinforcement Learning , 2017, ICML.

[120]  Tom Schaul,et al.  Successor Features for Transfer in Reinforcement Learning , 2016, NIPS.

[121]  Alexander Binder,et al.  Explaining nonlinear classification decisions with deep Taylor decomposition , 2015, Pattern Recognit..

[122]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[123]  Klaus-Robert Müller,et al.  Explaining Recurrent Neural Network Predictions in Sentiment Analysis , 2017, WASSA@EMNLP.

[124]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[125]  Tom Schaul,et al.  Rainbow: Combining Improvements in Deep Reinforcement Learning , 2017, AAAI.

[126]  David Budden,et al.  Distributed Prioritized Experience Replay , 2018, ICLR.

[127]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[128]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2018, Autonomous Agents and Multi-Agent Systems.

[129]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[130]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[131]  Nando de Freitas,et al.  Playing hard exploration games by watching YouTube , 2018, NeurIPS.

[132]  Shane Legg,et al.  IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures , 2018, ICML.

[133]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[134]  Rémi Munos,et al.  Observe and Look Further: Achieving Consistent Performance on Atari , 2018, ArXiv.

[135]  Ian Osband,et al.  The Uncertainty Bellman Equation and Exploration , 2017, ICML.

[136]  Peter Henderson,et al.  Reward Estimation for Variance Reduction in Deep Reinforcement Learning , 2018, CoRL.

[137]  Martin A. Riedmiller,et al.  Learning by Playing - Solving Sparse Reward Tasks from Scratch , 2018, ICML.

[138]  Ashley D. Edwards,et al.  Forward-Backward Reinforcement Learning , 2018, ArXiv.

[139]  Shalabh Bhatnagar,et al.  Two Timescale Stochastic Approximation with Controlled Markov noise , 2015, Math. Oper. Res..

[140]  Tom Schaul,et al.  Transfer in Deep Reinforcement Learning Using Successor Features and Generalised Policy Improvement , 2018, ICML.

[141]  Christopher Joseph Pal,et al.  Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding , 2018, NeurIPS.

[142]  Jürgen Schmidhuber,et al.  World Models , 2018, ArXiv.

[143]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[144]  Sergey Levine,et al.  Recall Traces: Backtracking Models for Efficient Reinforcement Learning , 2018, ICLR.

[145]  Juergen Schmidhuber,et al.  Reinforcement Learning Upside Down: Don't Predict Rewards - Just Map Them to Actions , 2019, ArXiv.

[146]  Sae-Young Chung,et al.  Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update , 2018, NeurIPS.

[147]  Filipe Wall Mutz,et al.  Training Agents using Upside-Down Reinforcement Learning , 2019, ArXiv.

[148]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[149]  Doina Precup,et al.  Hindsight Credit Assignment , 2019, NeurIPS.

[150]  Wojciech Samek,et al.  Explaining and Interpreting LSTMs , 2019, Explainable AI.

[151]  Richard Socher,et al.  Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards , 2019, NeurIPS.

[152]  Csaba Szepesvari,et al.  Bandit Algorithms , 2020 .

[153]  P. Alam ‘G’ , 2021, Composites Engineering: An A–Z Guide.

[154]  P. Alam,et al.  H , 1887, High Explosives, Propellants, Pyrotechnics.

[155]  P. Alam ‘N’ , 2021, Composites Engineering: An A–Z Guide.