Learning Long-Term Reward Redistribution via Randomized Return Decomposition

Many practical applications of reinforcement learning require agents to learn from sparse and delayed rewards. It challenges the ability of agents to attribute their actions to future outcomes. In this paper, we consider the problem formulation of episodic reinforcement learning with trajectory feedback. It refers to an extreme delay of reward signals, in which the agent can only obtain one reward signal at the end of each trajectory. A popular paradigm for this problem setting is learning with a designed auxiliary dense reward function, namely proxy reward, instead of sparse environmental signals. Based on this framework, this paper proposes a novel reward redistribution algorithm, randomized return decomposition (RRD), to learn a proxy reward function for episodic reinforcement learning. We establish a surrogate problem by Monte-Carlo sampling that scales up least-squares-based reward redistribution to long-horizon problems. We analyze our surrogate loss function by connection with existing methods in the literature, which illustrates the algorithmic properties of our approach. In experiments, we extensively evaluate our proposed method on a variety of benchmark tasks with episodic rewards and demonstrate substantial improvement over baseline algorithms.

[1]  Yoshua Bengio,et al.  Hyperbolic Discounting and Learning over Multiple Horizons , 2019, ArXiv.

[2]  Jun Tan,et al.  Stabilizing Reinforcement Learning in Dynamic Environment with Application to Online Recommendation , 2018, KDD.

[3]  Olexandr Isayev,et al.  Deep reinforcement learning for de novo drug design , 2017, Science Advances.

[4]  Konstantinos V. Katsikopoulos,et al.  Markov decision processes with delays and asynchronous cost collection , 2003, IEEE Trans. Autom. Control..

[5]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[6]  Yuandong Tian,et al.  Training Agent for First-Person Shooter Game with Actor-Critic Curriculum Learning , 2016, ICLR.

[7]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[8]  Peter Stone,et al.  TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Xi Chen,et al.  Sequence Modeling of Temporal Credit Assignment for Episodic Reinforcement Learning , 2019, ArXiv.

[11]  Sam Devlin,et al.  An Empirical Study of Potential-Based Reward Shaping and Advice in Complex, Multi-Agent Systems , 2011, Adv. Complex Syst..

[12]  Guy Lever,et al.  Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward , 2018, AAMAS.

[13]  Doina Precup,et al.  Hindsight Credit Assignment , 2019, NeurIPS.

[14]  Maja J. Mataric,et al.  Reward Functions for Accelerated Learning , 1994, ICML.

[15]  Marek Petrik,et al.  Biasing Approximate Dynamic Programming with a Lower Discount Factor , 2008, NIPS.

[16]  D. S. Moore,et al.  The Basic Practice of Statistics , 2001 .

[17]  Jian Peng,et al.  Off-Policy Reinforcement Learning with Delayed Rewards , 2021, ICML.

[18]  Jonathan Binas,et al.  Reinforcement Learning with Random Delays , 2021, ICLR.

[19]  Richard L. Lewis,et al.  Reward Design via Online Gradient Ascent , 2010, NIPS.

[20]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[21]  Thomas J. Walsh,et al.  Learning and planning in environments with delayed feedback , 2009, Autonomous Agents and Multi-Agent Systems.

[22]  Hazhir Rahmandad,et al.  Effects of feedback delay on learning , 2009 .

[23]  Renyuan Xu,et al.  Learning in Generalized Linear Contextual Bandits with Stochastic Delays , 2019, NeurIPS.

[24]  Sepp Hochreiter,et al.  Convergence Proof for Actor-Critic Methods Applied to PPO and RUDDER , 2020, Trans. Large Scale Data Knowl. Centered Syst..

[25]  P. Schrimpf,et al.  Dynamic Programming , 2011 .

[26]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[27]  Björn Wittenmark,et al.  Stochastic Analysis and Control of Real-time Systems with Random Time Delays , 1999 .

[28]  Michael I. Jordan,et al.  On the Theory of Reinforcement Learning with Once-per-Episode Feedback , 2021, ArXiv.

[29]  Chongjie Zhang,et al.  Towards Understanding Cooperative Multi-Agent Q-Learning with Value Factorization , 2020, NeurIPS.

[30]  Hang Su,et al.  Playing FPS Games With Environment-Aware Hierarchical Reinforcement Learning , 2019, IJCAI.

[31]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[32]  Ron Kohavi,et al.  Bias Plus Variance Decomposition for Zero-One Loss Functions , 1996, ICML.

[33]  Lei Han,et al.  LIIR: Learning Individual Intrinsic Reward in Multi-Agent Reinforcement Learning , 2019, NeurIPS.

[34]  Marc G. Bellemare,et al.  Dopamine: A Research Framework for Deep Reinforcement Learning , 2018, ArXiv.

[35]  Hoong Chuin Lau,et al.  Credit Assignment For Collective Multiagent RL With Global Rewards , 2018, NeurIPS.

[36]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[37]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[38]  Richard L. Lewis,et al.  Pairwise Weights for Temporal Credit Assignment , 2021, ArXiv.

[39]  Honglak Lee,et al.  Deep Learning for Reward Design to Improve Monte Carlo Tree Search in ATARI Games , 2016, IJCAI.

[40]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[41]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[42]  Greg Wayne,et al.  Synthetic Returns for Long-Term Credit Assignment , 2021, ArXiv.

[43]  Srikanth Kandula,et al.  Resource Management with Deep Reinforcement Learning , 2016, HotNets.

[44]  Shie Mannor,et al.  Reinforcement Learning with Trajectory Feedback , 2020, ArXiv.

[45]  Sepp Hochreiter,et al.  Align-RUDDER: Learning From Few Demonstrations by Reward Redistribution , 2020, ArXiv.

[46]  Thomas A. Runkler,et al.  A benchmark environment motivated by industrial control problems , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[47]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[48]  Nan Jiang,et al.  The Dependence of Effective Planning Horizon on Model Accuracy , 2015, AAMAS.

[49]  Joaquin Quiñonero Candela,et al.  Counterfactual reasoning and learning systems: the example of computational advertising , 2013, J. Mach. Learn. Res..

[50]  Satinder Singh,et al.  Generative Adversarial Self-Imitation Learning , 2018, ArXiv.

[51]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[52]  Benjamin Recht,et al.  Simple random search provides a competitive approach to reinforcement learning , 2018, ArXiv.

[53]  Li Li,et al.  Optimization of Molecules via Deep Reinforcement Learning , 2018, Scientific Reports.

[54]  Yung Yi,et al.  QTRAN: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning , 2019, ICML.

[55]  Junhyuk Oh,et al.  What Can Learned Intrinsic Rewards Capture? , 2019, ICML.

[56]  Daniel Dewey,et al.  Reinforcement Learning and the Reward Engineering Principle , 2014, AAAI Spring Symposia.

[57]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[58]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[59]  Robert Babuska,et al.  Control delay in Reinforcement Learning for real-time dynamic systems: A memoryless approach , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[60]  Li Fei-Fei,et al.  Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go? , 2018, ICML.

[61]  Yang Yu,et al.  QPLEX: Duplex Dueling Multi-Agent Q-Learning , 2020, ArXiv.

[62]  Yuan Zhou,et al.  Learning Guidance Rewards with Trajectory-space Smoothing , 2020, NeurIPS.

[63]  Qiang Liu,et al.  Learning Self-Imitating Diverse Policies , 2018, ICLR.

[64]  Pieter Abbeel,et al.  PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training , 2021, ICML.

[65]  David Silver,et al.  Meta-Gradient Reinforcement Learning , 2018, NeurIPS.

[66]  Csaba Szepesvári,et al.  Learning Near-Optimal Policies with Bellman-Residual Minimization Based Fitted Policy Iteration and a Single Sample Path , 2006, COLT.

[67]  Shimon Whiteson,et al.  QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning , 2018, ICML.

[68]  Zhengyuan Zhou,et al.  Gradient-free Online Learning in Continuous Games with Delayed Rewards , 2020, International Conference on Machine Learning.