论文信息 - Efficient experience reuse in non-Markovian environments

Efficient experience reuse in non-Markovian environments

Learning time is always a critical issue in Reinforcement Learning, especially when Recurrent Neural Networks are used to predict Q values in non-Markovian environments. Experience reuse has been received much attention due to its ability to reduce learning time. In this paper, we propose a new method to efficiently reuse experience. Our method generates new episodes from recorded episodes using an action-pair merger. Recorded episodes and new episodes are replayed after each learning epoch. We compare our method with standard online learning, and learning using experience replay in a vision based robot problem. The results show the potential of this approach.

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Jürgen Schmidhuber,et al. Learning to forget: continual prediction with LSTM , 1999 .

[3] Jürgen Schmidhuber,et al. Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[4] Jürgen Schmidhuber,et al. Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[5] Risto Miikkulainen,et al. Efficient Non-linear Control Through Neuroevolution , 2006, ECML.

[6] Michael L. Littman,et al. Memoryless policies: theoretical limitations and practical results , 1994 .

[7] Mohamed S. Kamel,et al. Reinforcement learning using a recurrent neural network , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[8] Longxin Lin. Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[9] Takashi Komeda,et al. REINFORCEMENT LEARNING FOR POMDP USING STATE CLASSIFICATION , 2008, MLMTA.

[10] Pierre Geurts,et al. Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[11] Kenji Doya,et al. Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[12] Peter Stone,et al. Batch reinforcement learning in a complex domain , 2007, AAMAS '07.

[13] Chris Watkins,et al. Learning from delayed rewards , 1989 .

[14] Jürgen Schmidhuber,et al. Quasi-online reinforcement learning for robots , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[15] Richard S. Sutton,et al. Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16] Douglas C. Hittle,et al. Robust Reinforcement Learning Control Using Integral Quadratic Constraints for Recurrent Neural Networks , 2007, IEEE Transactions on Neural Networks.

[17] Longxin Lin,et al. Reinforcement Learning in Non-Markov Environments , 1992 .

[18] Andrew W. Moore,et al. Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[19] I. Noda,et al. Using suitable action selection rule in reinforcement learning , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[20] Jürgen Schmidhuber,et al. Training Recurrent Networks by Evolino , 2007, Neural Computation.

[21] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[22] Bram Bakker,et al. Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[23] Thomas Martinetz,et al. Improving Optimality of Neural Rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification , 2007, ICANN.

[24] Samuel W. Hasinoff,et al. Reinforcement Learning for Problems with Hidden State , 2003 .

[25] Hajime Kita,et al. Recurrent neural networks for reinforcement learning: architecture, learning algorithms and internal representation , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[26] Junku Yuh,et al. Application of SONQL for real-time learning of robot behaviors , 2007, Robotics Auton. Syst..

[27] Jürgen Schmidhuber,et al. Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[28] Jürgen Schmidhuber,et al. A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[29] Long-Ji Lin,et al. Reinforcement learning for robots using neural networks , 1992 .