Adaptive confidence and adaptive curiosity

Much of the recent research on adaptive neuro-control and reinforcement learning focusses on systems with adaptivèworld models'. Previous approaches, however, do not address the problem of modelling the reliability of the world model's predictions in uncertain environments. Furthermore, with previous approaches usually some ad-hoc method (like random search) is used to train the world model to predict future environmental inputs from previous inputs and control outputs of the system. This paper introduces ways for modelling the reliability of the outputs of adaptive predictors, and it describes more sophisticated and sometimes more ecient methods for their adaptive construction by on-line state space exploration: For instance, a 4-network reinforcement learning system is described which tries to maximize the expectation of the temporal derivative of the adaptive assumed reliability of future predictions. The system is`curious' in the sense that it actively tries to provoke situations for which it learned to expect to learn something about the environment. An experiment with an articial non-deterministic environment demonstrates that the method can be faster than the conventional model-building strategy. Much of the recent research on adaptive neuro-control and reinforcement learning focusses on systems with sub-modules that learn to predict inputs from the environment. These sub-modules often are calledàdaptive world models'; they are useful for a whole variety of control tasks. For instance, Werbos' and Jordan's architectures for neuro-control [16][3] contain an adaptive world model in form of a back-propagation module (the model network) which is trained to predict the next input, given the current input and the current output of an adaptive control network. The model network allows to compute error gradients for the controller outputs. This is essential, since with typical adaptive neuro-control tasks there is no teacher who provides desired controller outputs. There is only a desired environmental input. Extensions of this approach (e.g. [11]) rely on the same basic principles. Sutton's `DYNA-systems' [13] use adaptive world models for limiting the number of`real-world experiences' necessary to solve certain reinforcement learning tasks. There are at least two important problems with all of these approaches that have not been addressed so far: 1. Previous model-building control systems are not well-suited for uncertain non-deterministic environments. In particular, they do not model the reliability of the predictions of the adaptive world models. Therefore, if credit assignment for the controller is based on the assumption of a correct world model, unexpected results may be obtained. 2. Previous model-building control …

[1]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[2]  Paul J. Werbos,et al.  Building and Understanding Adaptive Systems: A Statistical/Numerical Approach to Factory Automation and Brain Research , 1987, IEEE Transactions on Systems, Man, and Cybernetics.

[3]  B. Widrow,et al.  The truck backer-upper: an example of self-learning in neural networks , 1989, International 1989 Joint Conference on Neural Networks.

[4]  C. Watkins Learning from delayed rewards , 1989 .

[5]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[6]  Jürgen Schmidhuber,et al.  Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem , 1990 .

[7]  J. Jameson,et al.  A neurocontroller based on model feedback and the adaptive heuristic critic , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[8]  Stewart W. Wilson,et al.  A Possibility for Implementing Curiosity and Boredom in Model-Building Neural Controllers , 1991 .

[9]  Sebastian Thrun,et al.  On Planning And Exploration In Non-Discrete Environments , 1991 .

[10]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[11]  J. Urgen Schmidhuber Adaptive Decomposition Of Time , 1991 .

[12]  Long-Ji Lin,et al.  Self-improving reactive agents: case studies of reinforcement learning frameworks , 1991 .

[13]  Jürgen Schmidhuber,et al.  Learning to Generate Artificial Fovea Trajectories for Target Detection , 1991, Int. J. Neural Syst..

[14]  Michael I. Jordan,et al.  Forward Models: Supervised Learning with a Distal Teacher , 1992, Cogn. Sci..