Decentralized Reinforcement Learning: Global Decision-Making via Local Economic Transactions

This paper seeks to establish a framework for directing a society of simple, specialized, self-interested agents to solve what traditionally are posed as monolithic single-agent sequential decision problems. What makes it challenging to use a decentralized approach to collectively optimize a central objective is the difficulty in characterizing the equilibrium strategy profile of non-cooperative games. To overcome this challenge, we design a mechanism for defining the learning environment of each agent for which we know that the optimal solution for the global objective coincides with a Nash equilibrium strategy profile of the agents optimizing their own local objectives. The society functions as an economy of agents that learn the credit assignment process itself by buying and selling to each other the right to operate on the environment state. We derive a class of decentralized reinforcement learning algorithms that are broadly applicable not only to standard reinforcement learning but also for selecting options in semi-MDPs and dynamically composing computation graphs. Lastly, we demonstrate the potential advantages of a society's inherent modular structure for more efficient transfer learning.

[1]  Matthew Riemer,et al.  Routing Networks: Adaptive Selection of Non-linear Functions for Multi-Task Learning , 2017, ICLR.

[2]  Eric B. Baum,et al.  Toward a Model of Mind as a Laissez-Faire Economy of Idiots , 1996, ICML.

[3]  Sergey Levine,et al.  Reinforcement Learning with Competitive Ensembles of Information-Constrained Primitives , 2019, ICLR.

[4]  O. G. Selfridge,et al.  Pandemonium: a paradigm for learning , 1988 .

[5]  V. Braitenberg Vehicles, Experiments in Synthetic Psychology , 1984 .

[6]  David Balduzzi,et al.  Cortical prediction markets , 2014, AAMAS.

[7]  Hiroshi Kajino,et al.  Neuron as an Agent , 2018, ICLR.

[8]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[9]  Xiaotie Deng,et al.  Settling the complexity of computing two-player Nash equilibria , 2007, JACM.

[10]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[11]  John H. Holland,et al.  Properties of the Bucket Brigade , 1985, ICGA.

[12]  Sergey Levine,et al.  MCP: Learning Composable Hierarchical Control with Multiplicative Compositional Policies , 2019, NeurIPS.

[13]  Michael I. Jordan,et al.  Policy-Gradient Algorithms Have No Guarantees of Convergence in Continuous Action and State Multi-Agent Settings , 2019, ArXiv.

[14]  Tim Roughgarden,et al.  Twenty Lectures on Algorithmic Game Theory , 2016, Bull. EATCS.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Marvin Minsky,et al.  Society of Mind: A Response to Four Reviews , 1991, Artif. Intell..

[17]  Paul W. Goldberg,et al.  The complexity of computing a Nash equilibrium , 2006, STOC '06.

[18]  Leslie Pack Kaelbling,et al.  Modular meta-learning , 2018, CoRL.

[19]  Sepp Hochreiter,et al.  RUDDER: Return Decomposition for Delayed Rewards , 2018, NeurIPS.

[20]  Sergey Levine,et al.  High-Dimensional Continuous Control Using Generalized Advantage Estimation , 2015, ICLR.

[21]  R. A. Brooks,et al.  Intelligence without Representation , 1991, Artif. Intell..

[22]  Jürgen Schmidhuber,et al.  Market-Based Reinforcement Learning in Partially Observable Worlds , 2001, ICANN.

[23]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[24]  Thomas L. Griffiths,et al.  Automatically Composing Representation Transformations as a Means for Generalization , 2018, ICLR.

[25]  Michael I. Jordan,et al.  Policy-Gradient Algorithms Have No Guarantees of Convergence in Linear Quadratic Games , 2019, AAMAS.

[26]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[27]  Jürgen Schmidhuber,et al.  Compete to Compute , 2013, NIPS.

[28]  Jürgen Schmidhuber,et al.  A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks , 1989 .

[29]  G. Reeke The society of mind , 1991 .

[30]  William Vickrey,et al.  Counterspeculation, Auctions, And Competitive Sealed Tenders , 1961 .

[31]  Alexei A. Efros,et al.  Learning to Control Self-Assembling Morphologies: A Study of Generalization via Modularity , 2019, NeurIPS.