Learning to Shape Rewards using a Game of Switching Controls

Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construction is timeconsuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimal Shaping Algorithm (ROSA), an automated RS framework in which the shapingreward function is constructed in a novel Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards and their optimal values while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which easily adopts existing RL algorithms, learns to construct a shapingreward function that is tailored to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA’s congenial properties in three carefully designed experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.

[1]  Yujing Hu,et al.  Learning to Utilize Shaping Rewards: A New Approach of Reward Shaping , 2020, NeurIPS.

[2]  Xiaotie Deng,et al.  Settling the complexity of computing two-player Nash equilibria , 2007, JACM.

[3]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[4]  Dimitri P. Bertsekas,et al.  Approximate Dynamic Programming , 2017, Encyclopedia of Machine Learning and Data Mining.

[5]  Sam Devlin,et al.  Theoretical considerations of potential-based reward shaping for multi-agent systems , 2011, AAMAS.

[6]  Sam Devlin,et al.  Policy invariance under reward transformations for multi-objective reinforcement learning , 2017, Neurocomputing.

[7]  Giovanni Montana,et al.  PlanGAN: Model-based Planning With Sparse Rewards and Multiple Goals , 2020, NeurIPS.

[8]  Richard Socher,et al.  Keeping Your Distance: Solving Sparse Reward Tasks Using Self-Balancing Shaped Rewards , 2019, NeurIPS.

[9]  O. J. Vrieze,et al.  On stochastic games with additive reward and transition structure , 1985 .

[10]  Demis Hassabis,et al.  A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play , 2018, Science.

[11]  S. Shreve,et al.  Stochastic differential equations , 1955, Mathematical Proceedings of the Cambridge Philosophical Society.

[12]  John N. Tsitsiklis,et al.  Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives , 1999, IEEE Trans. Autom. Control..

[13]  Dongbin Zhao,et al.  A Survey of Deep Reinforcement Learning in Video Games , 2019, ArXiv.

[14]  David Mguni,et al.  Cutting Your Losses: Learning Fault-Tolerant Control and Optimal Stopping under Adverse Risk , 2019, ArXiv.

[15]  Jimmy Ba,et al.  Learning Intrinsic Rewards as a Bi-Level Optimization Problem , 2020, UAI.

[16]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[17]  Weinan Zhang,et al.  Bi-level Actor-Critic for Multi-agent Coordination , 2020, AAAI.

[18]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[19]  Dong Yan,et al.  Reward Shaping via Meta-Learning , 2019, ArXiv.

[20]  Filip De Turck,et al.  VIME: Variational Information Maximizing Exploration , 2016, NIPS.

[21]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[22]  Csaba Szepesvári,et al.  Finite-Time Bounds for Fitted Value Iteration , 2008, J. Mach. Learn. Res..

[23]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[24]  Andrew G. Barto,et al.  Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density , 2001, ICML.

[25]  Shimon Whiteson,et al.  The Impact of Non-stationarity on Generalisation in Deep Reinforcement Learning , 2020, ArXiv.

[26]  Jane X. Wang,et al.  Reinforcement Learning, Fast and Slow , 2019, Trends in Cognitive Sciences.

[27]  Julia Donaldson,et al.  The big match , 2008 .

[28]  Sam Devlin,et al.  An Empirical Study of Potential-Based Reward Shaping and Advice in Complex, Multi-Agent Systems , 2011, Adv. Complex Syst..

[29]  David C. Noelle,et al.  Unsupervised Methods For Subgoal Discovery During Intrinsic Motivation in Model-Free Hierarchical Reinforcement Learning , 2019, KEG@AAAI.

[30]  Michael L. Littman,et al.  Cyclic Equilibria in Markov Games , 2005, NIPS.

[31]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[32]  Sergio Valcarcel Macua,et al.  Coordinating the Crowd: Inducing Desirable Equilibria in Non-Cooperative Systems , 2019, AAMAS.

[33]  Satinder Singh,et al.  On Learning Intrinsic Rewards for Policy Gradient Methods , 2018, NeurIPS.

[34]  Erhan Bayraktar,et al.  On the One-Dimensional Optimal Switching Problem , 2007, Math. Oper. Res..

[35]  Csaba Szepesvári,et al.  Fitted Q-iteration in continuous action-space MDPs , 2007, NIPS.

[36]  Carmine Maria Pappalardo,et al.  A Parametric Study of a Deep Reinforcement Learning Control System Applied to the Swing-Up Problem of the Cart-Pole , 2020, Applied Sciences.

[37]  Sam Devlin,et al.  Dynamic potential-based reward shaping , 2012, AAMAS.

[38]  Bernhard Schölkopf,et al.  Photorealistic Video Super Resolution , 2018, ArXiv.

[39]  B. Stengel,et al.  COMPUTING EQUILIBRIA FOR TWO-PERSON GAMES , 1996 .

[40]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[41]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[42]  H. Young,et al.  Handbook of Game Theory with Economic Applications , 2015 .

[43]  Sonia Chernova,et al.  Reinforcement Learning from Demonstration through Shaping , 2015, IJCAI.

[44]  Carl E. Rasmussen,et al.  Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning , 2011, Robotics: Science and Systems.

[45]  Yaodong Yang,et al.  An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective , 2020, ArXiv.

[46]  Yoav Shoham,et al.  Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[47]  Yaodong Yang,et al.  Modelling Behavioural Diversity for Learning in Open-Ended Games , 2021, ICML.

[48]  Ying Wen,et al.  Learning in Nonzero-Sum Stochastic Games with Potentials , 2021, ICML.

[49]  D. Mguni,et al.  A Viscosity Approach to Stochastic Differential Games of Control and Stopping Involving Impulsive Control , 2018, 1803.11432.

[50]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[51]  Tamer Basar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[52]  Peng Peng,et al.  Multiagent Bidirectionally-Coordinated Nets: Emergence of Human-level Coordination in Learning to Play StarCraft Combat Games , 2017, 1703.10069.

[53]  Sam Devlin,et al.  Expressing Arbitrary Reward Functions as Potential-Based Advice , 2015, AAAI.

[54]  Matthew E. Taylor,et al.  Diverse Auto-Curriculum is Critical for Successful Real-World Multiagent Learning Systems , 2021, AAMAS.

[55]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[56]  Traian Rebedea,et al.  Playing Atari Games with Deep Reinforcement Learning and Human Checkpoint Replay , 2016, ArXiv.

[57]  Yaodong Yang,et al.  Multi-Agent Determinantal Q-Learning , 2020, ICML.