Exploration of Multi-State Environments: Local Measures and Back-Propagation of Uncertainty

This paper presents an action selection technique for reinforcement learning in stationary Markovian environments. This technique may be used in direct algorithms such as Q-learning, or in indirect algorithms such as adaptive dynamic programming. It is based on two principles. The first is to define a local measure of the uncertainty using the theory of bandit problems. We show that such a measure suffers from several drawbacks. In particular, a direct application of it leads to algorithms of low quality that can be easily misled by particular configurations of the environment. The second basic principle was introduced to eliminate this drawback. It consists of assimilating the local measures of uncertainty to rewards, and back-propagating them with the dynamic programming or temporal difference mechanisms. This allows reproducing global-scale reasoning about the uncertainty, using only local measures of it. Numerical simulations clearly show the efficiency of these propositions.

[1]  Ronald A. Howard,et al.  Dynamic Programming and Markov Processes , 1960 .

[2]  G. W. Snedecor Statistical Methods , 1964 .

[3]  D. Naidu,et al.  Optimal Control Systems , 2018 .

[4]  J. J. Martin Bayesian Decision Problems and Markov Chains , 1967 .

[5]  Dimitri Bertsekas,et al.  Distributed dynamic programming , 1981, 1981 20th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes.

[6]  Y. Bar-Shalom Stochastic dynamic programming: Caution and probing , 1981 .

[7]  P. Kumar,et al.  Optimal adaptive controllers for unknown Markov chains , 1982 .

[8]  P. Kumar,et al.  A new family of optimal adaptive controllers for Markov chains , 1982 .

[9]  Mitsuo Sato,et al.  Learning control of finite Markov chains with unknown transition probabilities , 1982 .

[10]  Donald A. Berry,et al.  Bandit Problems: Sequential Allocation of Experiments. , 1986 .

[11]  Patchigolla Kiran Kumar,et al.  A Survey of Some Results in Stochastic Adaptive Control , 1985 .

[12]  Mitsuo Sato,et al.  An asymptotically optimal learning controller for finite Markov chains with unknown transition probabilities , 1985 .

[13]  R. Larsen,et al.  An introduction to mathematical statistics and its applications (2nd edition) , by R. J. Larsen and M. L. Marx. Pp 630. £17·95. 1987. ISBN 13-487166-9 (Prentice-Hall) , 1987, The Mathematical Gazette.

[14]  MITSUO SATO,et al.  Learning control of finite Markov chains with an explicit trade-off between estimation and control , 1988, IEEE Trans. Syst. Man Cybern..

[15]  Richard S. Sutton,et al.  Integrated Modeling and Control Based on Reinforcement Learning and Dynamic Programming , 1990, NIPS 1990.

[16]  J. Bather,et al.  Multi‐Armed Bandit Allocation Indices , 1990 .

[17]  Andrew W. Moore,et al.  Efficient memory-based learning for robot control , 1990 .

[18]  Richard S. Sutton,et al.  Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming , 1990, ML.

[19]  Richard S. Sutton,et al.  Reinforcement Learning is Direct Adaptive Optimal Control , 1992, 1991 American Control Conference.

[20]  Sebastian Thrun,et al.  On Planning And Exploration In Non-Discrete Environments , 1991 .

[21]  Richard S. Sutton,et al.  Dyna, an integrated architecture for learning, planning, and reacting , 1990, SGAR.

[22]  Sebastian Thrun,et al.  Active Exploration in Dynamic Environments , 1991, NIPS.

[23]  Jürgen Schmidhuber,et al.  Curious model-building control systems , 1991, [Proceedings] 1991 IEEE International Joint Conference on Neural Networks.

[24]  R. Sutton Introduction: The Challenge of Reinforcement Learning , 1992 .

[25]  Sebastian Thrun,et al.  The role of exploration in learning control , 1992 .

[26]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[27]  Leslie Pack Kaelbling,et al.  Learning in embedded systems , 1993 .

[28]  R. Tibshirani,et al.  An introduction to the bootstrap , 1993 .

[29]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[30]  John N. Tsitsiklis,et al.  Asynchronous stochastic approximation and Q-learning , 1994, Mach. Learn..

[31]  Claude-Nicolas Fiechter,et al.  Efficient reinforcement learning , 1994, COLT '94.

[32]  Andrew G. Barto,et al.  Learning to Act Using Real-Time Dynamic Programming , 1995, Artif. Intell..

[33]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[34]  Nicolas Meuleau Le dilemme entre exploration et exploitation dans l'apprentissage par renforcement : optimisation adaptative des modeles de decision multi-etats , 1996 .

[35]  Leslie Pack Kaelbling,et al.  The NSF Workshop on Reinforcement Learning: Summary and Observations , 1996 .

[36]  Leslie Pack Kaelbling,et al.  The National Science Foundation Workshop on Reinforcement Learning , 1996, AI Mag..

[37]  V. Borkar Asynchronous Stochastic Approximations , 1998 .

[38]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[39]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[40]  R. Simmons,et al.  The effect of representation and knowledge on goal-directed exploration with reinforcement-learning algorithms , 2004, Machine Learning.

[41]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[42]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.