Probability Redistribution using Time Hopping for Reinforcement Learning

A method for using the Time Hopping technique as a tool for probability redistribution is proposed. Applied to reinforcement learning in a simulation, it is able to re-shape the state probability distribution of the underlying Markov decision process as desired. This is achieved by modifying the target selection strategy of Time Hopping appropriately. Experiments with a robot maze reinforcement learning problem show that the method improves the exploration efficiency by re-shaping the state probability distribution to an almost uniform distribution.

[1]  Dana H. Ballard,et al.  Learning to perceive and act by trial and error , 1991, Machine Learning.

[2]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[3]  Pieter Abbeel,et al.  An Application of Reinforcement Learning to Aerobatic Helicopter Flight , 2006, NIPS.

[4]  Leslie Pack Kaelbling,et al.  Reinforcement Learning by Policy Search , 2002 .

[5]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[6]  Pieter Abbeel,et al.  Exploration and apprenticeship learning in reinforcement learning , 2005, ICML.

[7]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[8]  Richard S. Sutton,et al.  Learning to predict by the methods of temporal differences , 1988, Machine Learning.

[9]  Jin-Woo Jung,et al.  Development of Shopping Messenger (Shop-senger) for Getting More Firsthand Information , 2009 .

[10]  Robert Givan,et al.  Relational Reinforcement Learning: An Overview , 2004, ICML 2004.

[11]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[12]  Pieter Abbeel,et al.  Learning for control from multiple demonstrations , 2008, ICML '08.

[13]  Stefan Schaal,et al.  Natural Actor-Critic , 2003, Neurocomputing.

[14]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[15]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[16]  Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning for Robotic Control , 2006, Discovery Science.

[17]  Sanjoy Dasgupta,et al.  Off-Policy Temporal Difference Learning with Function Approximation , 2001, ICML.

[18]  Kaoru Hirota,et al.  Time Hopping technique for faster reinforcement learning in simulations , 2009, ArXiv.

[19]  Maja J. Matarić,et al.  Action Selection methods using Reinforcement Learning , 1996 .

[20]  Kaoru Hirota,et al.  Time manipulation technique for speeding up reinforcement learning in simulations , 2008, ArXiv.

[21]  P. Dayan,et al.  TD(λ) converges with probability 1 , 2004, Machine Learning.

[22]  Guido Bugmann,et al.  Neuro-Resistive Grid approach to trainable controllers: A pole balancing example , 1997, Neural Computing & Applications.

[23]  Peter Stone,et al.  Policy gradient reinforcement learning for fast quadrupedal locomotion , 2004, IEEE International Conference on Robotics and Automation, 2004. Proceedings. ICRA '04. 2004.

[24]  Hidetomo Ichihashi,et al.  A Study on Cluster Validation in Fuzzy Clustering Based on PCA-guided Procedure , 2009 .

[25]  Shlomo Geva,et al.  The Cart-Pole Experiment as a Benchmark for Trainable Controllers , 1992 .

[26]  Michael Kearns,et al.  Near-Optimal Reinforcement Learning in Polynomial Time , 2002, Machine Learning.

[27]  Stefan Schaal,et al.  Learning and generalization of motor skills by learning from demonstration , 2009, 2009 IEEE International Conference on Robotics and Automation.