Labeling Q-learning in hidden state environments

Recently,reinforcement learning (RL) methods have been used for learning problems in environments with embedded hidden states. However, conventional RL methods have been limited to handlingMarkov decision process problems. In order to overcome hidden states, several algorithms were proposed, but these need an extreme amount of memory of past sequences which represent historical state transitions. The aim of this work was to extend our previously proposed algorithm for hidden states in an environment, calledlabeling Q-learning (LQ-learning), which reinforces incompletely observed perception by labeling. In LQ-learning, the agent has a perception structure which consists of pairs of observations and labels. From these pairs, the agent can distinguish more exactly hidden states which look the same but are actually different each other. Labeling is carried out by labeling functions. Numerous labeling functions can be considered, but here we introduce some labeling functions based on the sequence of only the last and the current observations. This extended LQ-learning is applied to grid-world problems which have hidden states. The results of these simulations show the availability of LQ-learning.

[1]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[2]  Kenichi Abe,et al.  Labeling Q-learning for non-Markovian environments , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[3]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[4]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[5]  R. Andrew Hidden State and Reinforcement Learning with Instance-Based State Identification , 1996 .

[6]  Jürgen Schmidhuber,et al.  HQ-Learning , 1997, Adapt. Behav..

[7]  Hiroyuki Kamaya,et al.  Labeling Q-Learnign for Maze Problems with Partially Observable States , 2000 .

[8]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[9]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[10]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[11]  Ron Sun,et al.  Self-segmentation of sequences: automatic formation of hierarchies of sequential behaviors , 2000, IEEE Trans. Syst. Man Cybern. Part B.

[12]  John Loch,et al.  Using Eligibility Traces to Find the Best Memoryless Policy in Partially Observable Markov Decision Processes , 1998, ICML.

[13]  Peter Dayan,et al.  The convergence of TD(λ) for general λ , 1992, Machine Learning.