Reinforcement Learning with Interacting Continually Running Fully Recurrent Networks

We describe an on-line learning algorithm for attacking the fundamental credit assignment problem in non-stationary reactive environments. Reinforcement and pain are considered as special types of input to an agent living in the environment. The agent’s only goal is to maximize cumulative reinforcement and to minimize cumulative pain. This simple goal may require to produce complicated action sequences. Supervised learning techniques for recurrent networks serve to construct a differentiable model of the environmental dynamics which includes a model of future reinforcement. While this model is adapted, it is concurrently used for learning goal directed behavior. The method extends work done by Munro, Robinson and Fallside, Werbos, Widrow, and Jordan.