Cyclic error correction based Q-learning for mobile robots navigation

Similar to control systems, reinforcement learning can capture notions of optimal behavior using natural interaction experience. In the context of reinforcement learning, the temporal difference error of the generated experience measures how well the learner responds to the system. Specially sequential difference of accumulated temporal difference error can indicate the learning performance. In this paper, we fully utilize the error correction in closed-loop peculiarity by mapping a representation error to the step-size component. The proposed cyclic step-size could better control how new estimates are iteratively blended together over time, and the new estimates guide the action selection process which in turn influence the value distribution. To guide more promising action decision, an ensemble action selector is proposed which incorporates the idea of ensemble wisdom of the weak. Experimental results conducted under gridworld mobile robot navigation task demonstrate the validity, capacity of fast learning and easy-plugged implementation of the derived algorithm, leading to increasing applicability to real-life problems.

[1]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[2]  Abhijit Gosavi,et al.  Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning , 2003 .

[3]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[4]  Qingquan Li,et al.  Reinforcement learning control for coordinated manipulation of multi-robots , 2015, Neurocomputing.

[5]  Peter Dayan,et al.  Technical Note: Q-Learning , 2004, Machine Learning.

[6]  Wolfram Burgard,et al.  Coordinated multi-robot exploration , 2005, IEEE Transactions on Robotics.

[7]  Cha Zhang,et al.  Ensemble Machine Learning , 2012 .

[8]  Sean Luke,et al.  Cooperative Multi-Agent Learning: The State of the Art , 2005, Autonomous Agents and Multi-Agent Systems.

[9]  Yang Liu,et al.  A new Q-learning algorithm based on the metropolis criterion , 2004, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Yishay Mansour,et al.  Learning Rates for Q-learning , 2004, J. Mach. Learn. Res..

[11]  Martin L. Puterman,et al.  Markov Decision Processes: Discrete Stochastic Dynamic Programming , 1994 .

[12]  Ana L. C. Bazzan,et al.  The Wisdom of Crowds in Bioinformatics: What Can We Learn (and Gain) from Ensemble Predictions? , 2013, AAAI.

[13]  Ying Wang,et al.  A reinforcement learning based robotic navigation system , 2014, 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[14]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[15]  José del R. Millán,et al.  Learning to Avoid Obstacles Through Reinforcement , 1991, ML.

[16]  Patrick M. Pilarski,et al.  Tuning-free step-size adaptation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  F.L. Lewis,et al.  Reinforcement learning and adaptive dynamic programming for feedback control , 2009, IEEE Circuits and Systems Magazine.

[18]  Günther Palm,et al.  Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax , 2011, KI.

[19]  Michael L. Littman,et al.  Reinforcement learning improves behaviour from evaluative feedback , 2015, Nature.

[20]  Koichi Moriyama,et al.  Learning-Rate Adjusting Q-Learning for Prisoner's Dilemma Games , 2008, 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[21]  Lynne E. Parker,et al.  A Reinforcement Learning Algorithm in Cooperative Multi-Robot Domains , 2005, J. Intell. Robotic Syst..