Analysing the Effects of Reward Shaping in Multi-Objective Stochastic Games

The majority of Multi-Agent Reinforcement Learning (MARL) implementations aim to optimise systems with respect to a single objective, despite the fact that many real world problems are inherently multi-objective in nature. Research into multi-objective MARL is still in its infancy, and few studies to date have dealt with the issue of credit assignment. Reward shaping has been proposed as a means to address the credit assignment problem in single-objective MARL, however it has been shown to alter the intended goals of the domain if misused, leading to unintended behaviour. Two popular shaping methods are Potential-Based Reward Shaping and difference rewards, and both have been repeatedly shown to improve learning speed and the quality of joint policies learned by agents in single-objective problems. In this work we discuss the theoretical implications of applying these approaches to multi-objective problems, and evaluate their efficacy using a new multi-objective benchmark domain where the true Pareto optimal system utilities are known. Our work provides the first empirical evidence that agents using these shaping methodologies can sample true Pareto optimal solutions in multi-objective Stochastic Games.

[1]  Kagan Tumer,et al.  Multi-objective multiagent credit assignment in reinforcement learning and NSGA-II , 2016, Soft Computing.

[2]  V. Pareto Manual of Political Economy: A Critical and Variorum Edition , 2014 .

[3]  Csaba Szepesvári,et al.  Multi-criteria Reinforcement Learning , 1998, ICML.

[4]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[5]  Jasbir S. Arora,et al.  Survey of multi-objective optimization methods for engineering , 2004 .

[6]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[7]  Sam Devlin,et al.  Dynamic potential-based reward shaping , 2012, AAMAS.

[8]  Sam Devlin,et al.  Policy invariance under reward transformations for multi-objective reinforcement learning , 2017, Neurocomputing.

[9]  Joseph A. Paradiso,et al.  The gesture recognition toolkit , 2014, J. Mach. Learn. Res..

[10]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[11]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[12]  Kagan Tumer,et al.  Distributed agent-based air traffic flow management , 2007, AAMAS '07.

[13]  Jen Jen Chung,et al.  Local Approximation of Difference Evaluation Functions , 2016, AAMAS.

[14]  Ann Nowé,et al.  Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[15]  Sam Devlin,et al.  Resource Abstraction for Reinforcement Learning in Multiagent Congestion Problems , 2016, AAMAS.

[16]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[17]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[18]  Kagan Tumer,et al.  An Evolutionary Game Theoretic Analysis of Difference Evaluation Functions , 2015, GECCO.

[19]  Jen Jen Chung,et al.  D++: Structural credit assignment in tightly coupled multiagent domains , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[20]  Sam Devlin,et al.  Potential-based difference rewards for multiagent reinforcement learning , 2014, AAMAS.

[21]  Kagan Tumer,et al.  Collective Intelligence, Data Routing and Braess' Paradox , 2002, J. Artif. Intell. Res..

[22]  Jim Duggan,et al.  A Theoretical and Empirical Analysis of Reward Transformations in Multi-Objective Stochastic Games , 2017, AAMAS.

[23]  Sam Devlin,et al.  Theoretical considerations of potential-based reward shaping for multi-agent systems , 2011, AAMAS.

[24]  Marek Grzes,et al.  Reward Shaping in Episodic Reinforcement Learning , 2017, AAMAS.

[25]  Jim Duggan,et al.  An Experimental Review of Reinforcement Learning Algorithms for Adaptive Traffic Signal Control , 2016, Autonomic Road Transport Support Systems.

[26]  Yoav Shoham,et al.  If multi-agent learning is the answer, what is the question? , 2007, Artif. Intell..

[27]  Sam Devlin,et al.  An Empirical Study of Potential-Based Reward Shaping and Advice in Complex, Multi-Agent Systems , 2011, Adv. Complex Syst..

[28]  W. Arthur Inductive Reasoning and Bounded Rationality , 1994 .

[29]  Matthew E. Taylor,et al.  Distributed learning and multi-objectivity in traffic light control , 2014, Connect. Sci..

[30]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[31]  Sam Devlin,et al.  Multi-Objective Dynamic Dispatch Optimisation using Multi-Agent Reinforcement Learning: (Extended Abstract) , 2016, AAMAS.

[32]  Ann Nowé,et al.  Scalarized multi-objective reinforcement learning: Novel design techniques , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[33]  Kagan Tumer,et al.  Collective Intelligence for Control of Distributed Dynamical Systems , 1999, ArXiv.