Distributed Artificial Intelligence: Second International Conference, DAI 2020, Nanjing, China, October 24–27, 2020, Proceedings

Many real-world domains contain multiple agents behaving strategically with probabilistic transitions and uncertain (potentially infinite) duration. Such settings can be modeled as stochastic games. While algorithms have been developed for solving (i.e., computing a gametheoretic solution concept such as Nash equilibrium) two-player zero-sum stochastic games, research on algorithms for non-zero-sum and multiplayer stochastic games is limited. We present a new algorithm for these settings, which constitutes the first parallel algorithm for multiplayer stochastic games. We present experimental results on a 4-player stochastic game motivated by a naval strategic planning scenario, showing that our algorithm is able to quickly compute strategies constituting Nash equilibrium up to a very small degree of approximation error.

[1]  Jakub W. Pachocki,et al.  Learning dexterous in-hand manipulation , 2018, Int. J. Robotics Res..

[2]  Marcin Andrychowicz,et al.  Parameter Space Noise for Exploration , 2017, ICLR.

[3]  Marco Wiering,et al.  Multi-Agent Reinforcement Learning for Traffic Light control , 2000 .

[4]  Randal W. Beard,et al.  A decentralized scheme for spacecraft formation flying via the virtual structure approach , 2003, Proceedings of the 2003 American Control Conference, 2003..

[5]  Andreas Krause,et al.  Safe Model-based Reinforcement Learning with Stability Guarantees , 2017, NIPS.

[6]  Kimmo Berg,et al.  Exclusion Method for Finding Nash Equilibrium in Multiplayer Games , 2017, AAAI.

[7]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[8]  Mathijs de Weerdt,et al.  Planning under Uncertainty for Coordinating Infrastructural Maintenance , 2013, ICAPS.

[9]  Jonathan P. How,et al.  Socially aware motion planning with deep reinforcement learning , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Pieter Abbeel,et al.  Meta Learning Shared Hierarchies , 2017, ICLR.

[11]  Kenneth O. Stanley,et al.  Go-Explore: a New Approach for Hard-Exploration Problems , 2019, ArXiv.

[12]  René M. B. M. de Koster,et al.  A review of design and control of automated guided vehicle systems , 2006, Eur. J. Oper. Res..

[13]  Mathijs de Weerdt,et al.  Solving Transition-Independent Multi-Agent MDPs with Sparse Interactions , 2015, AAAI.

[14]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[15]  Francesco Borrelli,et al.  Decentralized receding horizon control for large scale dynamically decoupled systems , 2009, Autom..

[16]  Kate Saenko,et al.  Learning Multi-Level Hierarchies with Hindsight , 2017, ICLR.

[17]  Tom Schaul,et al.  Universal Value Function Approximators , 2015, ICML.

[18]  Hui Cheng,et al.  Decomposed Deep Reinforcement Learning for Robotic Control , 2020, AAMAS.

[19]  Marina L. Gavrilova,et al.  Roadmap-Based Path Planning - Using the Voronoi Diagram for a Clearance-Based Shortest Path , 2008, IEEE Robotics & Automation Magazine.

[20]  Ming Zhou,et al.  Mean Field Multi-Agent Reinforcement Learning , 2018, ICML.

[21]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[22]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[23]  Bart De Schutter,et al.  Decentralized Reinforcement Learning Control of a Robotic Manipulator , 2006, 2006 9th International Conference on Control, Automation, Robotics and Vision.

[24]  Qiang Liu,et al.  Learning to Explore via Meta-Policy Gradient , 2018, ICML.

[25]  Jonathan P. How,et al.  Collision Avoidance in Pedestrian-Rich Environments With Deep Reinforcement Learning , 2019, IEEE Access.

[26]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[27]  Doina Precup,et al.  Off-Policy Deep Reinforcement Learning without Exploration , 2018, ICML.

[28]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[29]  R. Olfati-Saber,et al.  Consensus Filters for Sensor Networks and Distributed Sensor Fusion , 2005, Proceedings of the 44th IEEE Conference on Decision and Control.

[30]  Zongqing Lu,et al.  Learning Attentional Communication for Multi-Agent Cooperation , 2018, NeurIPS.

[31]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[32]  Sergey Levine,et al.  Data-Efficient Hierarchical Reinforcement Learning , 2018, NeurIPS.

[33]  Gábor Lugosi,et al.  Minimax Policies for Combinatorial Prediction Games , 2011, COLT.

[34]  Yoshua Bengio,et al.  Learning a synaptic learning rule , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[35]  Xianhao Xu,et al.  Evaluating battery charging and swapping strategies in a robotic mobile fulfillment system , 2017, Eur. J. Oper. Res..

[36]  Wei Chen,et al.  Combinatorial Multi-Armed Bandit: General Framework and Applications , 2013, ICML.

[37]  Tamás Vicsek,et al.  Optimized flocking of autonomous drones in confined environments , 2018, Science Robotics.

[38]  Alexandru Iosup,et al.  The Grid Workloads Archive , 2008, Future Gener. Comput. Syst..

[39]  Roger McHaney,et al.  Modelling battery constraints in discrete event automated guided vehicle simulations , 1995 .

[40]  Rémi Coulom,et al.  Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search , 2006, Computers and Games.

[41]  Paolo Fiorini,et al.  Motion Planning in Dynamic Environments Using Velocity Obstacles , 1998, Int. J. Robotics Res..

[42]  Dan Ventura,et al.  Predicting and Preventing Coordination Problems in Cooperative Q-learning Systems , 2007, IJCAI.

[43]  Nikos A. Vlassis,et al.  Collaborative Multiagent Reinforcement Learning by Payoff Propagation , 2006, J. Mach. Learn. Res..

[44]  Nando de Freitas,et al.  Sample Efficient Actor-Critic with Experience Replay , 2016, ICLR.

[45]  Paul W. Goldberg,et al.  The complexity of computing a Nash equilibrium , 2006, STOC '06.

[46]  Peter Auer,et al.  Finite-time Analysis of the Multiarmed Bandit Problem , 2002, Machine Learning.

[47]  Sergey Levine,et al.  Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables , 2019, ICML.

[48]  Pieter Abbeel,et al.  Automatic Goal Generation for Reinforcement Learning Agents , 2017, ICML.

[49]  L. Shapley,et al.  Stochastic Games* , 1953, Proceedings of the National Academy of Sciences.

[50]  David Silver,et al.  Deep Reinforcement Learning with Double Q-Learning , 2015, AAAI.

[51]  Ralf Stetter,et al.  A multi-agent reinforcement learning approach for the efficient control of mobile robot , 2013, 2013 IEEE 7th International Conference on Intelligent Data Acquisition and Advanced Computing Systems (IDAACS).

[52]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[53]  Tor Lattimore,et al.  Near-optimal PAC bounds for discounted MDPs , 2014, Theor. Comput. Sci..

[54]  Yang Yu,et al.  Reinforcement Learning with Derivative-Free Exploration , 2019, AAMAS.

[55]  Gordon F. Royle,et al.  Algebraic Graph Theory , 2001, Graduate texts in mathematics.

[56]  Joel Z. Leibo,et al.  Multi-agent Reinforcement Learning in Sequential Social Dilemmas , 2017, AAMAS.

[57]  Nicolò Cesa-Bianchi,et al.  Combinatorial Bandits , 2012, COLT.

[58]  Yuval Tassa,et al.  MuJoCo: A physics engine for model-based control , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[59]  Alexei A. Efros,et al.  Curiosity-Driven Exploration by Self-Supervised Prediction , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[60]  Peter L. Bartlett,et al.  RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning , 2016, ArXiv.

[61]  Shimon Whiteson,et al.  Computing Convex Coverage Sets for Faster Multi-objective Coordination , 2015, J. Artif. Intell. Res..

[62]  Tuomas Sandholm,et al.  Computing an approximate jam/fold equilibrium for 3-player no-limit Texas Hold'em tournaments , 2008, AAMAS.

[63]  Reza Olfati-Saber,et al.  Consensus and Cooperation in Networked Multi-Agent Systems , 2007, Proceedings of the IEEE.

[64]  Pieter Abbeel,et al.  A Simple Neural Attentive Meta-Learner , 2017, ICLR.

[65]  Pieter Abbeel,et al.  Benchmarking Deep Reinforcement Learning for Continuous Control , 2016, ICML.

[66]  Lubomír Bakule,et al.  Decentralized control: An overview , 2008, Annu. Rev. Control..

[67]  Wei Ren,et al.  Distributed leaderless consensus algorithms for networked Euler–Lagrange systems , 2009, Int. J. Control.

[68]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[69]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[70]  Carlos Guestrin,et al.  Multiagent Planning with Factored MDPs , 2001, NIPS.

[71]  Erfu Yang,et al.  Multiagent Reinforcement Learning for Multi-Robot Systems: A Survey , 2004 .

[72]  Marc G. Bellemare,et al.  Count-Based Exploration with Neural Density Models , 2017, ICML.

[73]  Michael L. Littman,et al.  Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.

[74]  Jonathan P. How,et al.  Decentralized non-communicating multiagent collision avoidance with deep reinforcement learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[75]  Sergey Levine,et al.  Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor , 2018, ICML.

[76]  D. Fudenberg,et al.  The Theory of Learning in Games , 1998 .

[77]  Daan Wierstra,et al.  Meta-Learning with Memory-Augmented Neural Networks , 2016, ICML.

[78]  Afsaneh Haddadi,et al.  Application of multi-agent systems in traffic and transportation , 1997, IEE Proc. Softw. Eng..

[79]  Anton Kabysh,et al.  INFLUENCE LEARNING FOR MULTI-AGENT SYSTEM BASED ON REINFORCEMENT LEARNING , 2014 .

[80]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[81]  Pierre-Yves Oudeyer,et al.  R-IAC: Robust Intrinsically Motivated Exploration and Active Learning , 2009, IEEE Transactions on Autonomous Mental Development.

[82]  N. Vieille Two-player stochastic games I: A reduction , 2000 .

[83]  Frank Harary,et al.  Graph Theory , 2016 .

[84]  Tom Schaul,et al.  Unifying Count-Based Exploration and Intrinsic Motivation , 2016, NIPS.

[85]  Fei Sha,et al.  Actor-Attention-Critic for Multi-Agent Reinforcement Learning , 2018, ICML.

[86]  Guy Lever,et al.  Deterministic Policy Gradient Algorithms , 2014, ICML.

[87]  Wei Chu,et al.  A contextual-bandit approach to personalized news article recommendation , 2010, WWW '10.

[88]  Dinesh Manocha,et al.  Reciprocal Velocity Obstacles for real-time multi-agent navigation , 2008, 2008 IEEE International Conference on Robotics and Automation.

[89]  Ann Nowé,et al.  Exploring selfish reinforcement learning in repeated games with stochastic rewards , 2007, Autonomous Agents and Multi-Agent Systems.

[90]  Pierre-Yves Oudeyer,et al.  Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning , 2017, J. Mach. Learn. Res..

[91]  Jie Xu,et al.  Budget-Constrained Edge Service Provisioning With Demand Estimation via Bandit Learning , 2019, IEEE Journal on Selected Areas in Communications.

[92]  Aleksandrs Slivkins,et al.  Contextual Bandits with Similarity Information , 2009, COLT.

[93]  Farzaneh Abdollahi,et al.  A Decentralized Cooperative Control Scheme With Obstacle Avoidance for a Team of Mobile Robots , 2014, IEEE Transactions on Industrial Electronics.

[94]  Shimon Whiteson,et al.  Counterfactual Multi-Agent Policy Gradients , 2017, AAAI.

[95]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[96]  Yi Wu,et al.  Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments , 2017, NIPS.

[97]  J. Andrew Bagnell,et al.  Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy , 2010 .

[98]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[99]  Sham M. Kakade,et al.  Provably Efficient Maximum Entropy Exploration , 2018, ICML.

[100]  Ben J. A. Kröse,et al.  Learning from delayed rewards , 1995, Robotics Auton. Syst..

[101]  Yuan Tian,et al.  H∞ Model-free Reinforcement Learning with Robust Stability Guarantee , 2019, ArXiv.

[102]  Wolfgang Hönig,et al.  Robust Trajectory Execution for Multi-robot Teams Using Distributed Real-time Replanning , 2018, DARS.

[103]  H. Kuk On equilibrium points in bimatrix games , 1996 .

[104]  Xiaoyan Zhu,et al.  Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation , 2014, SDM.

[105]  Rahul Savani,et al.  Negative Update Intervals in Deep Multi-Agent Reinforcement Learning , 2018, AAMAS.

[106]  Sebastian Scherer,et al.  Improving Stochastic Policy Gradients in Continuous Control with Deep Reinforcement Learning using the Beta Distribution , 2017, ICML.

[107]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[108]  Shlomo Zilberstein,et al.  Incremental Policy Generation for Finite-Horizon DEC-POMDPs , 2009, ICAPS.

[109]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[110]  Koray Kavukcuoglu,et al.  PGQ: Combining policy gradient and Q-learning , 2016, ArXiv.

[111]  Andrew Y. Ng,et al.  Near-Bayesian exploration in polynomial time , 2009, ICML '09.

[112]  Saptarshi Bandyopadhyay,et al.  Fast, On-line Collision Avoidance for Dynamic Vehicles Using Buffered Voronoi Cells , 2017, IEEE Robotics and Automation Letters.

[113]  Saptarshi Bandyopadhyay,et al.  Probabilistic swarm guidance using optimal transport , 2014, 2014 IEEE Conference on Control Applications (CCA).

[114]  David Q. Mayne,et al.  Constrained model predictive control: Stability and optimality , 2000, Autom..

[115]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[116]  Pierre-Yves Oudeyer,et al.  Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration , 2018, ICLR.

[117]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[118]  Xi Chen,et al.  3-NASH is PPAD-Complete , 2005, Electron. Colloquium Comput. Complex..

[119]  Hugh H. T. Liu,et al.  UDE-Based Robust Command Filtered Backstepping Control for Close Formation Flight , 2018, IEEE Transactions on Industrial Electronics.

[120]  Vijay Kumar,et al.  Sensing and coverage for a network of heterogeneous robots , 2008, 2008 47th IEEE Conference on Decision and Control.

[121]  Ying Wang,et al.  Multi-robot Box-pushing: Single-Agent Q-Learning vs. Team Q-Learning , 2006, 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[122]  Csaba Szepesvári,et al.  Bandit Based Monte-Carlo Planning , 2006, ECML.

[123]  Guillaume J. Laurent,et al.  Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems , 2012, The Knowledge Engineering Review.

[124]  Craig Boutilier,et al.  The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems , 1998, AAAI/IAAI.

[125]  Sean Luke,et al.  Lenient learners in cooperative multiagent systems , 2006, AAMAS '06.

[126]  Maria L. Gini,et al.  Adaptive Learning for Multi-Agent Navigation , 2015, AAMAS.

[127]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[128]  Bo An,et al.  Computing Solutions in Infinite-Horizon Discounted Adversarial Patrolling Games , 2014, ICAPS.

[129]  Shimon Whiteson,et al.  Learning to Communicate with Deep Multi-Agent Reinforcement Learning , 2016, NIPS.

[130]  Yoav Shoham,et al.  Simple search methods for finding a Nash equilibrium , 2004, Games Econ. Behav..

[131]  Maria L. Gini,et al.  Implicit Coordination in Crowded Multi-Agent Navigation , 2016, AAAI.

[132]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[133]  Jose B. Cruz,et al.  Game Theoretic Approach to Threat Prediction and Situation Awareness , 2006, 2006 9th International Conference on Information Fusion.

[134]  Mykel J. Kochenderfer,et al.  Cooperative Multi-agent Control Using Deep Reinforcement Learning , 2017, AAMAS Workshops.

[135]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[136]  Wei Pan,et al.  Model-Reference Reinforcement Learning for Collision-Free Tracking Control of Autonomous Surface Vehicles , 2020, IEEE Transactions on Intelligent Transportation Systems.

[137]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[138]  C A Nelson,et al.  Learning to Learn , 2017, Encyclopedia of Machine Learning and Data Mining.

[139]  Sébastien Bubeck,et al.  Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems , 2012, Found. Trends Mach. Learn..

[140]  Pieter Abbeel,et al.  Some Considerations on Learning to Explore via Meta-Reinforcement Learning , 2018, ICLR 2018.

[141]  Henry Zhu,et al.  Soft Actor-Critic Algorithms and Applications , 2018, ArXiv.

[142]  Peter Vrancx,et al.  Switching dynamics of multi-agent learning , 2008, AAMAS.

[143]  Victor Talpaert,et al.  Deep Reinforcement Learning for Autonomous Driving: A Survey , 2020, IEEE Transactions on Intelligent Transportation Systems.

[144]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[145]  Masashi Sugiyama,et al.  Reducing Overestimation Bias in Multi-Agent Domains Using Double Centralized Critics , 2019, ArXiv.

[146]  Dinesh Manocha,et al.  Reciprocal n-Body Collision Avoidance , 2011, ISRR.

[147]  Ann Nowé,et al.  Learning to Coordinate with Coordination Graphs in Repeated Single-Stage Multi-Agent Decision Problems , 2018, ICML.

[148]  Chao Qian,et al.  Self-Guided Evolution Strategies with Historical Estimated Gradients , 2020, IJCAI.

[149]  Ming Xin,et al.  Integrated Optimal Formation Control of Multiple Unmanned Aerial Vehicles , 2012, IEEE Transactions on Control Systems Technology.

[150]  Xiaotie Deng,et al.  Settling the Complexity of Two-Player Nash Equilibrium , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[151]  Yang Yu,et al.  Towards Sample Efficient Reinforcement Learning , 2018, IJCAI.

[152]  Javier Larrosa,et al.  Bucket elimination for multiobjective optimization problems , 2006, J. Heuristics.

[153]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[154]  John Enright,et al.  Optimization and Coordinated Autonomy in Mobile Fulfillment Systems , 2011, Automated Action Planning for Autonomous Mobile Robots.

[155]  Tuomas Sandholm,et al.  Computing Equilibria in Multiplayer Stochastic Games of Imperfect Information , 2009, IJCAI.

[156]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[157]  J. Nash Equilibrium Points in N-Person Games. , 1950, Proceedings of the National Academy of Sciences of the United States of America.

[158]  Ofir Nachum,et al.  A Lyapunov-based Approach to Safe Reinforcement Learning , 2018, NeurIPS.

[159]  Marcin Andrychowicz,et al.  Multi-Goal Reinforcement Learning: Challenging Robotics Environments and Request for Research , 2018, ArXiv.

[160]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[161]  Yi Guo,et al.  Nonlinear decentralized control of large-scale power systems , 2000, Autom..

[162]  Guillaume J. Laurent,et al.  Hysteretic q-learning :an algorithm for decentralized reinforcement learning in cooperative multi-agent teams , 2007, 2007 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[163]  Sergey Levine,et al.  Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review , 2018, ArXiv.

[164]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[165]  Marcin Andrychowicz,et al.  Hindsight Experience Replay , 2017, NIPS.

[166]  Wei Pan,et al.  Model-Reference Reinforcement Learning Control of Autonomous Surface Vehicles , 2020, 2020 59th IEEE Conference on Decision and Control (CDC).

[167]  Dolores Blanco,et al.  Voronoi diagram and fast marching applied to path planning , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[168]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.

[169]  Hugh H. T. Liu,et al.  Aerodynamic model-based robust adaptive control for close formation flight , 2018, Aerospace Science and Technology.

[170]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[171]  Yevgeniy Vorobeychik,et al.  Computing Stackelberg Equilibria in Discounted Stochastic Games , 2012, AAAI.

[172]  Krzysztof Choromanski,et al.  From Complexity to Simplicity: Adaptive ES-Active Subspaces for Blackbox Optimization , 2019, NeurIPS.

[173]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[174]  Nikos A. Vlassis,et al.  Multi-robot decision making using coordination graphs , 2003 .

[175]  Zeb Kurth-Nelson,et al.  Learning to reinforcement learn , 2016, CogSci.

[176]  Martin Lauer,et al.  An Algorithm for Distributed Reinforcement Learning in Cooperative Multi-Agent Systems , 2000, ICML.

[177]  Sean Luke,et al.  Lenient Learning in Independent-Learner Stochastic Cooperative Games , 2016, J. Mach. Learn. Res..

[178]  Peter Henderson,et al.  An Introduction to Deep Reinforcement Learning , 2018, Found. Trends Mach. Learn..

[179]  Amos J. Storkey,et al.  Exploration by Random Network Distillation , 2018, ICLR.

[180]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[181]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[182]  Bart De Schutter,et al.  A Comprehensive Survey of Multiagent Reinforcement Learning , 2008, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[183]  Peter Vrancx,et al.  Learning multi-agent state space representations , 2010, AAMAS.

[184]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[185]  Bhaskar Krishnamachari,et al.  Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations , 2010, IEEE/ACM Transactions on Networking.

[186]  Bo An,et al.  Impression Allocation for Combating Fraud in E-commerce Via Deep Reinforcement Learning with Action Norm Penalty , 2018, IJCAI.

[187]  Kagan Tumer,et al.  A multiagent approach to managing air traffic flow , 2010, Autonomous Agents and Multi-Agent Systems.

[188]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[189]  Mark Jacobus Richard Ebben,et al.  Logistic control in automated transportation networks , 2001 .

[190]  Robert Wilson,et al.  A global Newton method to compute Nash equilibria , 2003, J. Econ. Theory.

[191]  Shane Legg,et al.  Noisy Networks for Exploration , 2017, ICLR.

[192]  David A. Schoenwald,et al.  Decentralized control of cooperative robotic vehicles: theory and application , 2002, IEEE Trans. Robotics Autom..

[193]  Sergey Levine,et al.  Trust Region Policy Optimization , 2015, ICML.

[194]  Hugo Larochelle,et al.  Optimization as a Model for Few-Shot Learning , 2016, ICLR.