论文信息 - Mixed Reinforcement Learning for Partially Observable Markov Decision Process

Mixed Reinforcement Learning for Partially Observable Markov Decision Process

Reinforcement learning has been widely used to solve problems with a little feedback from environment. Q learning can solve full observable Markov decision processes quite well. For partially observable Markov decision processes (POMDPs), a recurrent neural network (RNN) can be used to approximate Q values. However, learning time for these problems is typically very long. In this paper, Mixed Reinforcement Learning is presented to find an optimal policy for POMDPs in a shorter learning time. This method uses both a Q value table and a RNN. Q value table stores Q values for full observable states and the RNN approximates Q values for hidden states. An observable degree is calculated for each state while the agent explores the environment. If the observable degree is less than a threshold, the state is considered as a hidden state. Results of experiment in lighting grid world problem show that the proposed method enables an agent to acquire a policy, as good as the policy acquired by using only a RNN, with better learning performance.

[1] Andrew McCallum,et al. Instance-Based Utile Distinctions for Reinforcement Learning with Hidden State , 1995, ICML.

[2] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[3] Mohamed S. Kamel,et al. Reinforcement learning using a recurrent neural network , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[4] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .

[5] Jürgen Schmidhuber,et al. A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[6] Jürgen Schmidhuber,et al. Learning to Forget: Continual Prediction with LSTM , 2000, Neural Computation.

[7] Jürgen Schmidhuber,et al. Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[8] Michael L. Littman,et al. Memoryless policies: theoretical limitations and practical results , 1994 .

[9] Secundino Soares,et al. A recurrent neuro-fuzzy network structure and learning procedure , 2001, 10th IEEE International Conference on Fuzzy Systems. (Cat. No.01CH37297).

[10] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[11] Tom M. Mitchell,et al. Reinforcement learning with hidden states , 1993 .

[12] Jürgen Schmidhuber,et al. Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[13] Risto Miikkulainen,et al. Efficient Non-linear Control Through Neuroevolution , 2006, ECML.

[14] I. Noda,et al. Using suitable action selection rule in reinforcement learning , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Hajime Kita,et al. Recurrent neural networks for reinforcement learning: architecture, learning algorithms and internal representation , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[17] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[18] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[19] Jürgen Schmidhuber,et al. Training Recurrent Networks by Evolino , 2007, Neural Computation.

[20] Hajime Kita,et al. Reinforcement learning of dynamic behavior by using recurrent neural networks , 1997, Artificial Life and Robotics.