Efficient experience reuse in non-Markovian environments

Learning time is always a critical issue in Reinforcement Learning, especially when Recurrent Neural Networks are used to predict Q values in non-Markovian environments. Experience reuse has been received much attention due to its ability to reduce learning time. In this paper, we propose a new method to efficiently reuse experience. Our method generates new episodes from recorded episodes using an action-pair merger. Recorded episodes and new episodes are replayed after each learning epoch. We compare our method with standard online learning, and learning using experience replay in a vision based robot problem. The results show the potential of this approach.

[1]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[2]  Jürgen Schmidhuber,et al.  Learning to forget: continual prediction with LSTM , 1999 .

[3]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[4]  Jürgen Schmidhuber,et al.  Reinforcement Learning in Markovian and Non-Markovian Environments , 1990, NIPS.

[5]  Risto Miikkulainen,et al.  Efficient Non-linear Control Through Neuroevolution , 2006, ECML.

[6]  Michael L. Littman,et al.  Memoryless policies: theoretical limitations and practical results , 1994 .

[7]  Mohamed S. Kamel,et al.  Reinforcement learning using a recurrent neural network , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[8]  Longxin Lin Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching , 2004, Machine Learning.

[9]  Takashi Komeda,et al.  REINFORCEMENT LEARNING FOR POMDP USING STATE CLASSIFICATION , 2008, MLMTA.

[10]  Pierre Geurts,et al.  Tree-Based Batch Mode Reinforcement Learning , 2005, J. Mach. Learn. Res..

[11]  Kenji Doya,et al.  Reinforcement Learning in Continuous Time and Space , 2000, Neural Computation.

[12]  Peter Stone,et al.  Batch reinforcement learning in a complex domain , 2007, AAMAS '07.

[13]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[14]  Jürgen Schmidhuber,et al.  Quasi-online reinforcement learning for robots , 2006, Proceedings 2006 IEEE International Conference on Robotics and Automation, 2006. ICRA 2006..

[15]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[16]  Douglas C. Hittle,et al.  Robust Reinforcement Learning Control Using Integral Quadratic Constraints for Recurrent Neural Networks , 2007, IEEE Transactions on Neural Networks.

[17]  Longxin Lin,et al.  Reinforcement Learning in Non-Markov Environments , 1992 .

[18]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[19]  I. Noda,et al.  Using suitable action selection rule in reinforcement learning , 2003, SMC'03 Conference Proceedings. 2003 IEEE International Conference on Systems, Man and Cybernetics. Conference Theme - System Security and Assurance (Cat. No.03CH37483).

[20]  Jürgen Schmidhuber,et al.  Training Recurrent Networks by Evolino , 2007, Neural Computation.

[21]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[22]  Bram Bakker,et al.  Reinforcement Learning with Long Short-Term Memory , 2001, NIPS.

[23]  Thomas Martinetz,et al.  Improving Optimality of Neural Rewards Regression for Data-Efficient Batch Near-Optimal Policy Identification , 2007, ICANN.

[24]  Samuel W. Hasinoff,et al.  Reinforcement Learning for Problems with Hidden State , 2003 .

[25]  Hajime Kita,et al.  Recurrent neural networks for reinforcement learning: architecture, learning algorithms and internal representation , 1998, 1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227).

[26]  Junku Yuh,et al.  Application of SONQL for real-time learning of robot behaviors , 2007, Robotics Auton. Syst..

[27]  Jürgen Schmidhuber,et al.  Solving Deep Memory POMDPs with Recurrent Policy Gradients , 2007, ICANN.

[28]  Jürgen Schmidhuber,et al.  A robot that reinforcement-learns to identify and memorize important previous observations , 2003, Proceedings 2003 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2003) (Cat. No.03CH37453).

[29]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .