Actor-Critic Reinforcement Learning with Neural Networks in Continuous Games

Reinforcement learning agents with artificial neural networks have previously been shown to acquire human level dexterity in discrete video game environments where only the current state of the game and a reward are given at each time step. A harder problem than discrete environments is posed by continuous environments where the states, observations, and actions are continuous, which is what this paper focuses on. The algorithm called the Continuous Actor-Critic Learning Automaton (CACLA) is applied to a 2D aerial combat simulation environment, which consists of continuous state and action spaces. The Actor and the Critic both employ multilayer perceptrons. For our game environment it is shown: 1) The exploration of CACLA’s action space strongly improves when Gaussian noise is replaced by an Ornstein-Uhlenbeck process. 2) A novel Monte Carlo variant of CACLA is introduced which turns out to be inferior to the original CACLA. 3) From the latter new insights are obtained that lead to a novel algorithm that is a modified version of CACLA. It relies on a third multilayer perceptron to estimate the absolute error of the critic which is used to correct the learning rule of the Actor. The Corrected CACLA is able to outperform the original CACLA algorithm.

[1]  M.A. Wiering,et al.  Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[2]  J. Jensen Sur les fonctions convexes et les inégalités entre les valeurs moyennes , 1906 .

[3]  Jürgen Schmidhuber,et al.  Evolving large-scale neural networks for vision-based reinforcement learning , 2013, GECCO '13.

[4]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[5]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[6]  Marco Wiering,et al.  Reinforcement learning to train Ms. Pac-Man using higher-order action-relative inputs , 2013, 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).

[7]  Kelly Cohen,et al.  Genetic Fuzzy based Artificial Intelligence for Unmanned Combat Aerial Vehicle Control in Simulated Air Combat Missions , 2016 .

[8]  Gerald Tesauro,et al.  Temporal difference learning and TD-Gammon , 1995, CACM.

[9]  G. Uhlenbeck,et al.  On the Theory of the Brownian Motion , 1930 .

[10]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[11]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[12]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[13]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[14]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.