Multi-Agent Advisor Q-Learning

In the last decade, there have been significant advances in multi-agent reinforcement learning (MARL) but there are still numerous challenges, such as high sample complexity and slow convergence to stable policies, that need to be overcome before wide-spread deployment is possible. However, many real-world environments already, in practice, deploy sub-optimal or heuristic approaches for generating policies. An interesting question which arises is how to best use such approaches as advisors to help improve reinforcement learning in multi-agent domains. In this paper, we provide a principled framework for incorporating action recommendations from online suboptimal advisors in multi-agent settings. We describe the problem of ADvising Multiple Intelligent Reinforcement Agents (ADMIRAL) in nonrestrictive general-sum stochastic game environments and present two novelQ-learning based algorithms: ADMIRAL Decision Making (ADMIRAL-DM) and ADMIRAL Advisor Evaluation (ADMIRAL-AE), which allow us to improve learning by appropriately incorporating advice from an advisor (ADMIRAL-DM), and evaluate the effectiveness of an advisor (ADMIRAL-AE). We analyze the algorithms theoretically and provide fixed-point guarantees regarding their learning in general-sum stochastic games. Furthermore, extensive experiments illustrate that these algorithms: can be used in a variety of environments, have performances that compare favourably to other related baselines, can scale to large state-action spaces, and are robust to poor advice from advisors.

[1]  Stefano Ermon,et al.  Multi-Agent Generative Adversarial Imitation Learning , 2018, NeurIPS.

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Michael L. Littman,et al.  Friend-or-Foe Q-learning in General-Sum Games , 2001, ICML.

[4]  Junwei Lu,et al.  Expert-Supervised Reinforcement Learning for Offline Policy Learning and Evaluation , 2020, NeurIPS.

[5]  Victor Talpaert,et al.  Deep Reinforcement Learning for Autonomous Driving: A Survey , 2020, IEEE Transactions on Intelligent Transportation Systems.

[6]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.

[7]  Sonia Chernova,et al.  Integrating reinforcement learning with human demonstrations of varying ability , 2011, AAMAS.

[8]  Julian Togelius,et al.  A hybrid search agent in pommerman , 2018, FDG.

[9]  Robert E. Schapire,et al.  A Reduction from Apprenticeship Learning to Classification , 2010, NIPS.

[10]  Dean Pomerleau,et al.  Efficient Training of Artificial Neural Networks for Autonomous Navigation , 1991, Neural Computation.

[11]  Prashant Doshi,et al.  Multi-robot inverse reinforcement learning under occlusion with interactions , 2014, AAMAS.

[12]  Sime Curkovic,et al.  Sustainable Development : Authoritative and Leading Edge Content for Environmental Management , 2012 .

[13]  Sarit Kraus,et al.  Making friends on the fly: Cooperating with new teammates , 2017, Artif. Intell..

[14]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[15]  Eder Santana,et al.  Learning a Driving Simulator , 2016, ArXiv.

[16]  Jürgen Schmidhuber,et al.  A Machine Learning Approach to Visual Perception of Forest Trails for Mobile Robots , 2016, IEEE Robotics and Automation Letters.

[17]  Ofir Marom,et al.  Belief Reward Shaping in Reinforcement Learning , 2018, AAAI.

[18]  J. Neumann Zur Theorie der Gesellschaftsspiele , 1928 .

[19]  Vincent Conitzer,et al.  AWESOME: A general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents , 2003, Machine Learning.

[20]  Michael P. Wellman,et al.  Nash Q-Learning for General-Sum Stochastic Games , 2003, J. Mach. Learn. Res..

[21]  Miguel G. Cruz,et al.  Assessing crown fire potential in coniferous forests of western North America: a critique of current approaches and recent simulation studies. , 2010 .

[22]  Rob Fergus,et al.  Modeling Others using Oneself in Multi-Agent Reinforcement Learning , 2018, ICML.

[23]  Stefan Schaal,et al.  Robot Learning From Demonstration , 1997, ICML.

[24]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[25]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[26]  Nir Levine,et al.  Challenges of real-world reinforcement learning: definitions, benchmarks and analysis , 2021, Machine Learning.

[27]  Stephen C. Adams,et al.  Multi-agent Inverse Reinforcement Learning for Certain General-sum Stochastic Games , 2019, J. Artif. Intell. Res..

[28]  Michael I. Jordan,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 1996 .

[29]  Marcin Andrychowicz,et al.  Overcoming Exploration in Reinforcement Learning with Demonstrations , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[30]  Diogo Carvalho,et al.  A new convergent variant of Q-learning with linear function approximation , 2020, NeurIPS.

[31]  A. M. Fink,et al.  Equilibrium in a stochastic $n$-person game , 1964 .

[32]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[33]  Eder Santana,et al.  Exploring the Limitations of Behavior Cloning for Autonomous Driving , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Yuan Zhou,et al.  Learning Guidance Rewards with Trajectory-space Smoothing , 2020, NeurIPS.

[35]  Sergey Levine,et al.  Nonlinear Inverse Reinforcement Learning with Gaussian Processes , 2011, NIPS.

[36]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[37]  Zhang-Wei Hong,et al.  A Deep Policy Inference Q-Network for Multi-Agent Systems , 2017, AAMAS.

[38]  Nando de Freitas,et al.  Reinforcement and Imitation Learning for Diverse Visuomotor Skills , 2018, Robotics: Science and Systems.

[39]  Crystal S. Stonesifer,et al.  Wildfire Response Performance Measurement: Current and Future Directions , 2018, Fire.

[40]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[41]  Canan Eryiğit Marketing Models: A Review of the Literature , 2017 .

[42]  Garrison W. Cottrell,et al.  Principled Methods for Advising Reinforcement Learning Agents , 2003, ICML.

[43]  Siyuan Liu,et al.  Robust Bayesian Inverse Reinforcement Learning with Sparse Behavior Noise , 2014, AAAI.

[44]  H.H.T. Liu,et al.  A cooperative UAV/UGV platform for wildfire detection and fighting , 2008, 2008 Asia Simulation Conference - 7th International Conference on System Simulation and Scientific Computing.

[45]  Jordan L. Boyd-Graber,et al.  Opponent Modeling in Deep Reinforcement Learning , 2016, ICML.

[46]  Stefan Schaal,et al.  Learning from Demonstration , 1996, NIPS.

[47]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[48]  Junliang Xing,et al.  Hybrid Learning for Multi-agent Cooperation with Sub-optimal Demonstrations , 2020, IJCAI.

[49]  Alexey Dosovitskiy,et al.  End-to-End Driving Via Conditional Imitation Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[50]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[51]  D. Roberts,et al.  Evaluating the Ability of FARSITE to Simulate Wildfires Influenced by Extreme, Downslope Winds in Santa Barbara, California , 2020, Fire.

[52]  J. Andrew Bagnell,et al.  Maximum margin planning , 2006, ICML.

[53]  Felipe Leno da Silva,et al.  Simultaneously Learning and Advising in Multiagent Reinforcement Learning , 2017, AAMAS.

[54]  R. Rothermel A Mathematical Model for Predicting Fire Spread in Wildland Fuels , 2017 .

[55]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[56]  Mykel J. Kochenderfer,et al.  Imitating driver behavior with generative adversarial networks , 2017, 2017 IEEE Intelligent Vehicles Symposium (IV).

[57]  Stefan Schaal,et al.  Reinforcement learning of motor skills in high dimensions: A path integral approach , 2010, 2010 IEEE International Conference on Robotics and Automation.

[58]  Marek Petrik,et al.  Robust Maximum Entropy Behavior Cloning , 2021, ArXiv.

[59]  Jan Peters,et al.  Relative Entropy Inverse Reinforcement Learning , 2011, AISTATS.

[60]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[61]  Yu Wei,et al.  Risk Management and Analytics in Wildfire Response , 2019, Current Forestry Reports.

[62]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[63]  Peter A. Beling,et al.  Multi-agent Inverse Reinforcement Learning for Zero-sum Games , 2014, ArXiv.

[64]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[65]  S. G. Ponnambalam,et al.  Reinforcement learning: exploration–exploitation dilemma in multi-agent foraging task , 2012 .

[66]  Kevin Waugh,et al.  Computational Rationalization: The Inverse Equilibrium Problem , 2011, ICML.

[67]  Claudio Gentile,et al.  On the generalization ability of on-line learning algorithms , 2001, IEEE Transactions on Information Theory.

[68]  Philip S. Yu,et al.  Differential Advising in Multi-Agent Reinforcement Learning , 2020, ArXiv.

[69]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[70]  Markus Wulfmeier,et al.  Maximum Entropy Deep Inverse Reinforcement Learning , 2015, 1507.04888.

[71]  Tamer Basar,et al.  Multi-Agent Reinforcement Learning: A Selective Overview of Theories and Algorithms , 2019, Handbook of Reinforcement Learning and Control.

[72]  Yisong Yue,et al.  Coordinated Multi-Agent Imitation Learning , 2017, ICML.

[73]  Sham M. Kakade,et al.  On the sample complexity of reinforcement learning. , 2003 .

[74]  J. San-Miguel-Ayanz,et al.  Use of Remote Sensing in Wildfire Management , 2012 .

[75]  Sergey Levine,et al.  A Connection between Generative Adversarial Networks, Inverse Reinforcement Learning, and Energy-Based Models , 2016, ArXiv.

[76]  Julian Togelius,et al.  Pommerman: A Multi-Agent Playground , 2018, AIIDE Workshops.

[77]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[78]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[79]  M. Finney FARSITE : Fire Area Simulator : model development and evaluation , 1998 .

[80]  Matthew E. Taylor,et al.  A conceptual framework for externally-influenced agents: an assisted reinforcement learning review , 2020, Journal of Ambient Intelligence and Humanized Computing.

[81]  Csaba Szepesvári,et al.  A Unified Analysis of Value-Function-Based Reinforcement-Learning Algorithms , 1999, Neural Computation.

[82]  Ananth Hari,et al.  PettingZoo: Gym for Multi-Agent Reinforcement Learning , 2020, 2009.14471.

[83]  V Nikitin,et al.  Development of a robotic vehicle complex for wildfire-fighting by means of fire-protection roll screens , 2019 .

[84]  Vinicius G. Goecks,et al.  Integrating Behavior Cloning and Reinforcement Learning for Improved Performance in Dense and Sparse Reward Environments , 2020, AAMAS.

[85]  Tommi S. Jaakkola,et al.  Convergence Results for Single-Step On-Policy Reinforcement-Learning Algorithms , 2000, Machine Learning.

[86]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[87]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[88]  Yang Gao,et al.  Reinforcement Learning from Imperfect Demonstrations , 2018, ICLR.

[89]  Xingyu Wang,et al.  Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demonstrations , 2018, ICML.

[90]  C. Boutilier,et al.  Accelerating Reinforcement Learning through Implicit Imitation , 2003, J. Artif. Intell. Res..

[91]  Stefano Ermon,et al.  Generative Adversarial Imitation Learning , 2016, NIPS.

[92]  Sergey Levine,et al.  Learning Robust Rewards with Adversarial Inverse Reinforcement Learning , 2017, ICLR 2017.

[93]  Jonathan P. How,et al.  Learning to Teach in Cooperative Multiagent Reinforcement Learning , 2018, AAAI.

[94]  Sham M. Kakade,et al.  Mind the Duality Gap: Logarithmic regret algorithms for online optimization , 2008, NIPS.

[95]  Matthew E. Taylor,et al.  Improving Reinforcement Learning with Confidence-Based Demonstrations , 2017, IJCAI.

[96]  David M. Bradley,et al.  Boosting Structured Prediction for Imitation Learning , 2006, NIPS.

[97]  Radha Poovendran,et al.  Shaping Advice in Deep Multi-Agent Reinforcement Learning , 2021, ArXiv.

[98]  Gergely V. Záruba,et al.  Inverse reinforcement learning for decentralized non-cooperative multiagent systems , 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[99]  Chao Gao,et al.  On Hard Exploration for Reinforcement Learning: a Case Study in Pommerman , 2019, AIIDE.

[100]  Anca D. Dragan,et al.  DART: Noise Injection for Robust Imitation Learning , 2017, CoRL.

[102]  Harshad Khadilkar,et al.  Accelerating Training in Pommerman with Imitation and Reinforcement Learning , 2019, ArXiv.

[103]  G. DeJong,et al.  Theory and Application of Reward Shaping in Reinforcement Learning , 2004 .

[104]  Yoav Shoham,et al.  Multiagent Systems - Algorithmic, Game-Theoretic, and Logical Foundations , 2009 .

[105]  Matthew E. Taylor,et al.  A survey and critique of multiagent deep reinforcement learning , 2019, Autonomous Agents and Multi-Agent Systems.

[106]  Anca D. Dragan,et al.  Cooperative Inverse Reinforcement Learning , 2016, NIPS.

[107]  Mark Crowley,et al.  A review of machine learning applications in wildfire science and management , 2020, Environmental Reviews.

[108]  Mao Li,et al.  Two-level Q-learning: learning from conflict demonstrations , 2019, The Knowledge Engineering Review.

[109]  Fermín J. Alcasena,et al.  Evaluating fire modelling systems in recent wildfires of the Golestan National Park, Iran , 2016 .

[110]  Tom Schaul,et al.  Deep Q-learning From Demonstrations , 2017, AAAI.

[111]  Lantao Yu,et al.  Multi-Agent Adversarial Inverse Reinforcement Learning , 2019, ICML.

[112]  Xinli Cai,et al.  Wildfire management in Canada: Review, challenges and opportunities , 2020, Progress in Disaster Science.

[113]  Manuela Veloso,et al.  Reinforcement Learning for Market Making in a Multi-agent Dealer Market , 2019, ArXiv.

[114]  Felipe Leno da Silva,et al.  A Survey on Transfer Learning for Multiagent Reinforcement Learning Systems , 2019, J. Artif. Intell. Res..

[115]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[116]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[117]  Gang Pan,et al.  Knowledge-Guided Agent-Tactic-Aware Learning for StarCraft Micromanagement , 2018, IJCAI.

[118]  Michael H. Bowling,et al.  Convergence Problems of General-Sum Multiagent Reinforcement Learning , 2000, ICML.

[119]  Matthieu Geist,et al.  Boosted Bellman Residual Minimization Handling Expert Demonstrations , 2014, ECML/PKDD.

[120]  Alessandro Lazaric,et al.  Direct Policy Iteration with Demonstrations , 2015, IJCAI.

[121]  Wenbing Huang,et al.  Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance , 2019, AAAI.

[122]  Stefan Schaal,et al.  2008 Special Issue: Reinforcement learning of motor skills with policy gradients , 2008 .

[123]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[124]  Dean Pomerleau,et al.  ALVINN, an autonomous land vehicle in a neural network , 2015 .

[125]  Eric van Damme,et al.  Non-Cooperative Games , 2000 .

[126]  Sam Devlin,et al.  An Empirical Study of Potential-Based Reward Shaping and Advice in Complex, Multi-Agent Systems , 2011, Adv. Complex Syst..