In a recent paper the authors proposed a new approach to reinforcement learning based on multiple estimation models. Simple situations involving the use of direct schemes in learning automata, and indirect (estimation based) schemes in feed-forward networks, were presented. The simulation results demonstrated that the proposed schemes are an order of magnitude faster than the linear reward-inaction scheme of learning automata, and comparable to the indirect scheme based on the pursuit algorithm. At the same time they are also substantially more robust than the latter. This makes them attractive in practical applications which are significantly more complex due to interacting decision makers. The main objective of this paper is two fold: (i) To provide reasons for the observed robustness of the proposed scheme and (ii) To demonstrate through simulation studies that the scheme performs even better than the pursuit algorithm in more complex situations involving feedback (i.e. Markov Decision Processes MDP). A simulation study of Blackjack is also included in the last section.
[1]
Andrew W. Moore,et al.
Reinforcement Learning: A Survey
,
1996,
J. Artif. Intell. Res..
[2]
L. S. Pontryagin,et al.
Mathematical Theory of Optimal Processes
,
1962
.
[3]
Kumpati S. Narendra,et al.
Learning automata - an introduction
,
1989
.
[4]
Snehasis Mukhopadhyay,et al.
Fast Reinforcement Learning using multiple models
,
2016,
2016 IEEE 55th Conference on Decision and Control (CDC).
[5]
M. Thathachar,et al.
Networks of Learning Automata: Techniques for Online Stochastic Optimization
,
2003
.
[6]
Richard S. Sutton,et al.
Reinforcement Learning: An Introduction
,
1998,
IEEE Trans. Neural Networks.