Looking Back on the Actor–Critic Architecture

This retrospective describes the overall research project that gave rise to the authors’ paper “Neuronlike adaptive elements that can solve difficult learning control problems” that was published in the 1983 Neural and Sensory Information Processing special issue of the IEEE Transactions on Systems, Man, and Cybernetics. This look back explains how this project came about, presents the ideas and previous publications that influenced it, and describes our most closely related subsequent research. It concludes by pointing out some noteworthy aspects of this article that have been eclipsed by its main contributions, followed by commenting on some of the directions and cautions that should inform future research.

[1]  J. Stevens,et al.  Animal Intelligence , 1883, Nature.

[2]  Donald Michie Experiments on the Mechanization of Game-Learning Part I. Characterization of the Model and its parameters , 1963, Comput. J..

[3]  A. H. Klopf,et al.  Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[4]  Bernard Widrow,et al.  Punish/Reward: Learning with a Critic in Adaptive Threshold Systems , 1973, IEEE Trans. Syst. Man Cybern..

[5]  M. L. Tsetlin,et al.  Automaton theory and modeling of biological systems , 1973 .

[6]  E Harth,et al.  Alopex: a stochastic method for determining visual receptive fields. , 1974, Vision research.

[7]  Kumpati S. Narendra,et al.  Learning Automata - A Survey , 1974, IEEE Trans. Syst. Man Cybern..

[8]  Stephen A. Ritz,et al.  Distinctive features, categorical perception, and probability learning: some applications of a neural model , 1977 .

[9]  Teuvo Kohonen,et al.  Associative memory. A system-theoretical approach , 1977 .

[10]  J. Gittins,et al.  A dynamic allocation index for the discounted multiarmed bandit problem , 1979 .

[11]  A G Barto,et al.  Toward a modern theory of adaptive networks: expectation and prediction. , 1981, Psychological review.

[12]  Richard S. Sutton,et al.  Goal Seeking Components for Adaptive Intelligence: An Initial Assessment. , 1981 .

[13]  R. Sutton,et al.  Simulation of anticipatory responses in classical conditioning by a neuron-like adaptive element , 1982, Behavioural Brain Research.

[14]  Wg Lehnert,et al.  THE HEDONISTIC NEURON - A THEORY OF MEMORY, LEARNING, AND INTELLIGENCE - KLOPF,AH , 1983 .

[15]  Kumpati S. Narendra,et al.  An N-player sequential stochastic game with identical payoffs , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[16]  Richard S. Sutton,et al.  Temporal credit assignment in reinforcement learning , 1984 .

[17]  Richard S. Sutton,et al.  Training and Tracking in Robotics , 1985, IJCAI.

[18]  A G Barto,et al.  Learning by statistical cooperation of self-interested neuron-like computing elements. , 1985, Human neurobiology.

[19]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[20]  Charles W. Anderson,et al.  Learning and problem-solving with multilayer connectionist systems (adaptive, strategy learning, neural networks, reinforcement learning) , 1986 .

[21]  P. Anandan,et al.  Cooperativity in Networks of Pattern Recognizing Stochastic Learning Automata , 1986 .

[22]  Andrew G. Barto,et al.  Game-theoretic cooperativity in networks of self-interested units , 1987 .

[23]  Charles W. Anderson,et al.  Strategy Learning with Multilayer Connectionist Representations , 1987 .

[24]  O. G. Selfridge,et al.  Pandemonium: a paradigm for learning , 1988 .

[25]  Kumpati S. Narendra,et al.  Learning automata - an introduction , 1989 .

[26]  C.W. Anderson,et al.  Learning to control an inverted pendulum using neural networks , 1989, IEEE Control Systems Magazine.

[27]  W S McCulloch,et al.  A logical calculus of the ideas immanent in nervous activity , 1990, The Philosophy of Artificial Intelligence.

[28]  A. Barto,et al.  Adaptive Critics and the Basal Ganglia , 1994 .

[29]  Joel L. Davis,et al.  Adaptive Critics and the Basal Ganglia , 1995 .

[30]  Peter Dayan,et al.  A Neural Substrate of Prediction and Reward , 1997, Science.

[31]  Charles W. Anderson,et al.  Approximating a Policy Can be Easier Than Approximating a Value Function , 2000 .

[32]  Derong Liu,et al.  Action-dependent adaptive critic designs , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[33]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[34]  Richard S. Sutton,et al.  Associative search network: A reinforcement learning associative memory , 1981, Biological Cybernetics.

[35]  S.-I. Amari,et al.  Neural theory of association and concept-formation , 1977, Biological Cybernetics.

[36]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[37]  E. Harth,et al.  The Alopex process: Visual receptive fields by response feedback , 1979, Biological Cybernetics.

[38]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[39]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[40]  Shalabh Bhatnagar,et al.  Natural actor-critic algorithms , 2009, Autom..

[41]  A. Cooper,et al.  Predictive Reward Signal of Dopamine Neurons , 2011 .

[42]  W. Ashby,et al.  Design for a brain; the origin of adaptive behavior , 2011 .

[43]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[44]  D. Newnham Trial and error. , 2013, Nursing standard (Royal College of Nursing (Great Britain) : 1987).

[45]  Donald Michie,et al.  BOXES: AN EXPERIMENT IN ADAPTIVE CONTROL , 2013 .

[46]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[47]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[48]  C. Robert Superintelligence: Paths, Dangers, Strategies , 2017 .

[49]  M. Mohri,et al.  Bandit Problems , 2006 .

[50]  Peter W. Hawkins Distinctive features , 2018, Introducing Phonology.