Policy invariance under reward transformations for multi-objective reinforcement learning

Reinforcement Learning (RL) is a powerful and well-studied Machine Learning paradigm, where an agent learns to improve its performance in an environment by maximising a reward signal. In multi-objective Reinforcement Learning (MORL) the reward signal is a vector, where each component represents the performance on a different objective. Reward shaping is a well-established family of techniques that have been successfully used to improve the performance and learning speed of RL agents in single-objective problems. The basic premise of reward shaping is to add an additional shaping reward to the reward naturally received from the environment, to incorporate domain knowledge and guide an agent’s exploration. Potential-Based Reward Shaping (PBRS) is a specific form of reward shaping that offers additional guarantees. In this paper, we extend the theoretical guarantees of PBRS to MORL problems. Specifically, we provide theoretical proof that PBRS does not alter the true Pareto front in both single- and multi-agent MORL. We also contribute the first published empirical studies of the effect of PBRS in single- and multi-agent MORL problems.

[1]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[2]  Malabika Basu,et al.  Dynamic economic emission dispatch using nondominated sorting genetic algorithm-II , 2008 .

[3]  Jim Duggan,et al.  Analysing the Effects of Reward Shaping in Multi-Objective Stochastic Games , 2017 .

[4]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[5]  Srini Narayanan,et al.  Learning all optimal policies with multiple criteria , 2008, ICML '08.

[6]  Kagan Tumer,et al.  Collective Intelligence, Data Routing and Braess' Paradox , 2002, J. Artif. Intell. Res..

[7]  Sam Devlin,et al.  Potential-based reward shaping for knowledge-based, multi-agent reinforcement learning , 2013 .

[8]  Jim Duggan,et al.  A Theoretical and Empirical Analysis of Reward Transformations in Multi-Objective Stochastic Games , 2017, AAMAS.

[9]  Christian R. Shelton,et al.  Importance sampling for reinforcement learning with multiple objectives , 2001 .

[10]  Joseph A. Paradiso,et al.  The gesture recognition toolkit , 2014, J. Mach. Learn. Res..

[11]  Preben Alstrøm,et al.  Learning to Drive a Bicycle Using Reinforcement Learning and Shaping , 1998, ICML.

[12]  Ann Nowé,et al.  Multi-objective reinforcement learning using sets of pareto dominating policies , 2014, J. Mach. Learn. Res..

[13]  John Yearwood,et al.  On the Limitations of Scalarisation for Multi-objective Reinforcement Learning of Pareto Fronts , 2008, Australasian Conference on Artificial Intelligence.

[14]  Kagan Tumer,et al.  Distributed agent-based air traffic flow management , 2007, AAMAS '07.

[15]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[16]  Matthew E. Taylor,et al.  Multi-objectivization of reinforcement learning problems by reward shaping , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[17]  Michael Wooldridge,et al.  Introduction to multiagent systems , 2001 .

[18]  Shimon Whiteson,et al.  A Survey of Multi-Objective Sequential Decision-Making , 2013, J. Artif. Intell. Res..

[19]  Jim Duggan,et al.  An Experimental Review of Reinforcement Learning Algorithms for Adaptive Traffic Signal Control , 2016, Autonomic Road Transport Support Systems.

[20]  Yoav Shoham,et al.  If multi-agent learning is the answer, what is the question? , 2007, Artif. Intell..

[21]  Sam Devlin,et al.  An Empirical Study of Potential-Based Reward Shaping and Advice in Complex, Multi-Agent Systems , 2011, Adv. Complex Syst..

[22]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[23]  Sam Devlin,et al.  Theoretical considerations of potential-based reward shaping for multi-agent systems , 2011, AAMAS.

[24]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[25]  Marek Grzes,et al.  Reward Shaping in Episodic Reinforcement Learning , 2017, AAMAS.

[26]  V. Pareto Manual of Political Economy: A Critical and Variorum Edition , 2014 .

[27]  Babita Majhi,et al.  Multiobjective optimization based adaptive models with fuzzy decision making for stock market forecasting , 2015, Neurocomputing.

[28]  Alice E. Smith,et al.  Penalty functions , 1996 .

[29]  Sam Devlin,et al.  Multi-Objective Dynamic Dispatch Optimisation using Multi-Agent Reinforcement Learning: (Extended Abstract) , 2016, AAMAS.

[30]  Evan Dekker,et al.  Empirical evaluation methods for multiobjective reinforcement learning algorithms , 2011, Machine Learning.

[31]  David H. Wolpert,et al.  Collective Intelligence , 1999 .

[32]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[33]  Yan Shi,et al.  Multiobjective optimization technique for demand side management with load balancing approach in smart grid , 2016, Neurocomputing.

[34]  Sam Devlin,et al.  Dynamic potential-based reward shaping , 2012, AAMAS.