Action Selection Methods in a Robotic Reinforcement Learning Scenario

Reinforcement learning allows an agent to learn a new task while autonomously exploring its environment. For this aim, the agent chooses an action to perform among the available ones for a certain state. Nonetheless, a common problem for a reinforcement learning agent is to find a proper balance between exploration and exploitation of actions in order to achieve an optimal behavior. This paper compares multiple approaches to the exploration/exploitation dilemma in reinforcement learning and, moreover, it implements an exemplary reinforcement learning task within the domain of domestic robotics to show the performance of different exploration policies on it. We perform the domestic task using ∊-greedy, softmax, VDBE, and VDBE-Softmax with online and offline temporal-difference learning. The obtained results show that the agent is able to collect larger and faster reward by using the VDBE-Softmax exploration strategy with both Q-learning and SARSA.

[1]  Michel Tokic Adaptive ε-greedy Exploration in Reinforcement Learning Based on Value Differences , 2010 .

[2]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[3]  Peter Dayan,et al.  Q-learning , 1992, Machine Learning.

[4]  Oliver Lemon,et al.  Reinforcement Learning for Adaptive Dialogue Systems - ReadingSample , 2017 .

[5]  Stefan Wermter,et al.  Training Agents With Interactive Reinforcement Learning and Contextual Affordances , 2016, IEEE Transactions on Cognitive and Developmental Systems.

[6]  M.A. Wiering,et al.  Reinforcement Learning in Continuous Action Spaces , 2007, 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning.

[7]  Y. Niv Reinforcement learning in the brain , 2009 .

[8]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[9]  Peter Auer,et al.  Using Confidence Bounds for Exploitation-Exploration Trade-offs , 2003, J. Mach. Learn. Res..

[10]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[11]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[12]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[13]  Angela J. Yu,et al.  Should I stay or should I go? How the human brain manages the trade-off between exploitation and exploration , 2007, Philosophical Transactions of the Royal Society B: Biological Sciences.

[14]  Michel Tokic,et al.  Adaptive epsilon-Greedy Exploration in Reinforcement Learning Based on Value Difference , 2010, KI.

[15]  Stephen R. Marsland,et al.  Machine Learning - An Algorithmic Perspective , 2009, Chapman and Hall / CRC machine learning and pattern recognition series.

[16]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[17]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[18]  Günther Palm,et al.  Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax , 2011, KI.