Mixed Reinforcement Learning for Partially Observable Markov Decision Process

Reinforcement learning has been widely used to solve problems with a little feedback from environment. Q learning can solve full observable Markov decision processes quite well. For partially observable Markov decision processes (POMDPs), a recurrent neural network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, Mixed Reinforcement Learning is presented to find an optimal policy for POMDPs in a shorter learning time. This method uses both a Q value table and a RNN. Q value table stores Q values for full observable states and the RNN approximates Q values for hidden states. An observable degree is calculated for each state while the agent explores the environment. If the observable degree is less than a threshold, the state is considered as a hidden state. Results of experiment in lighting grid world problem show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by using only a RNN, with better learning performance.

[1]  Andrew McCallum,et al.  Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Mohamed S. Kamel,et al.  Reinforcement learning using a recurrent neural network , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[4]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[5]  Jürgen Schmidhuber,et al.  A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[6]  Jürgen Schmidhuber,et al.  Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[7]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[8]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[9]  Secundino Soares,et al.  A recurrent neuro-fuzzy network structure and learning procedure , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[10]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[11]  Tom M. Mitchell,et al.  Reinforcement learning with hidden states , 1993 .

[12]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[13]  Risto Miikkulainen,et al.  Efficient Non-linear Control Through Neuroevolution , 2006, ECML.

[14]  I. Noda,et al.  Using suitable action selection rule in reinforcement learning , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Hajime Kita,et al.  Recurrent neural networks for reinforcement learning: architecture, learning algorithms and internal representation , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[17]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[18]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[19]  Jürgen Schmidhuber,et al.  Training Recurrent Networks by Evolino , 2007, Neural Computation.

[20]  Hajime Kita,et al.  Reinforcement learning of dynamic behavior by using recurrent neural networks , 1997, Artificial Life and Robotics.