Asynchronous Coagent Networks: Stochastic Networks for Reinforcement Learning without Backpropagation or a Clock

Coagent policy gradient algorithms (CPGAs) are reinforcement learning algorithms for training a class of stochastic neural networks called coagent networks. In this work, we prove that CPGAs converge to locally optimal policies. Additionally, we extend prior theory to encompass asynchronous and recurrent coagent networks. These extensions facilitate the straightforward design and analysis of hierarchical reinforcement learning algorithms like the option-critic, and eliminate the need for complex derivations of customized learning rules for these algorithms.

[1]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[2]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[3]  Doina Precup,et al.  The Option-Critic Architecture , 2016, AAAI.

[4]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[5]  Andrew G. Barto,et al.  Motor primitive discovery , 2012, 2012 IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL).

[6]  Peter Norvig,et al.  Artificial Intelligence: A Modern Approach , 1995 .

[7]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[8]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[9]  Paul J. Werbos,et al.  Regular Cycles of Forward and Backward Signal Propagation in Prefrontal Cortex and in Consciousness , 2016, Front. Syst. Neurosci..

[10]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[11]  Gerald Tesauro,et al.  Learning Abstract Options , 2018, NeurIPS.

[12]  Xinhua Zhang,et al.  Conditional random fields for multi-agent reinforcement learning , 2007, ICML '07.

[13]  Patrick M. Pilarski,et al.  Horde: a scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction , 2011, AAMAS.

[14]  Lakhmi C. Jain,et al.  Innovations in Multi-Agent Systems and Applications - 1 , 2010 .

[15]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[16]  D. Bertsekas Gradient convergence in gradient methods , 1997 .

[17]  Andrew G. Barto,et al.  Conjugate Markov Decision Processes , 2011, ICML.

[18]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[19]  John S. Edwards,et al.  The Hedonistic Neuron: A Theory of Memory, Learning and Intelligence , 1983 .

[20]  Richard L. Lewis,et al.  Optimal Rewards for Cooperative Agents , 2014, IEEE Transactions on Autonomous Mental Development.

[21]  Bart De Schutter,et al.  Multi-agent Reinforcement Learning: An Overview , 2010 .

[22]  Philip S. Thomas,et al.  Policy Gradient Coagent Networks , 2011, NIPS.

[23]  Michail G. Lagoudakis,et al.  Coordinated Reinforcement Learning , 2002, ICML.