Towards a common implementation of reinforcement learning for multiple robotic tasks

Abstract Mobile robots are increasingly being employed for performing complex tasks in dynamic environments. Those tasks can be either explicitly programmed by an engineer or learned by means of some automatic learning method, which improves the adaptability of the robot and reduces the effort of setting it up. In this sense, reinforcement learning (RL) methods are recognized as a promising tool for a machine to learn autonomously how to do tasks that are specified in a relatively simple manner. However, the dependency between these methods and the particular task to learn is a well-known problem that has strongly restricted practical implementations in robotics so far. Breaking this barrier would have a significant impact on these and other intelligent systems; in particular, having a core method that requires little tuning effort for being applicable to diverse tasks would boost their autonomy in learning and self-adaptation capabilities. In this paper we present such a practical core implementation of RL, which enables the learning process for multiple robotic tasks with minimal per-task tuning or none. Based on value iteration methods, we introduce a novel approach for action selection, called Q-biased softmax regression (QBIASSR), that takes advantage of the structure of the state space by attending the physical variables involved (e.g., distances to obstacles, robot pose, etc.), thus experienced sets of states accelerate the decision-making process of unexplored or rarely-explored states. Intensive experiments with both real and simulated robots, carried out with the software framework also introduced here, show that our implementation is able to learn different robotic tasks without tuning the learning method. They also suggest that the combination of true online SARSA(λ) (TOSL) with QBIASSR can outperform the existing RL core algorithms in low-dimensional robotic tasks. All of these are promising results towards the possibility of learning much more complex tasks autonomously by a robotic agent.

[1]  Günther Palm,et al.  Value-Difference Based Exploration: Adaptive Control between Epsilon-Greedy and Softmax , 2011, KI.

[2]  Jan Peters,et al.  Reinforcement learning in robotics: A survey , 2013, Int. J. Robotics Res..

[3]  Patrick M. Pilarski,et al.  True Online Temporal-Difference Learning , 2015, J. Mach. Learn. Res..

[4]  Anton Schwartz,et al.  A Reinforcement Learning Method for Maximizing Undiscounted Rewards , 1993, ICML.

[5]  Long-Ji Lin,et al.  Reinforcement learning for robots using neural networks , 1992 .

[6]  Gary G. Yen,et al.  Reinforcement learning algorithms for robotic navigation in dynamic environments. , 2004, ISA transactions.

[7]  Peter Stone,et al.  Transfer Learning for Reinforcement Learning Domains: A Survey , 2009, J. Mach. Learn. Res..

[8]  Stefan Wermter,et al.  Real-world reinforcement learning for autonomous humanoid robot docking , 2012, Robotics Auton. Syst..

[9]  Patrick M. Pilarski,et al.  Model-Free reinforcement learning with continuous action in practice , 2012, 2012 American Control Conference (ACC).

[10]  Jan Peters,et al.  Learning Motor Skills - From Algorithms to Robot Experiments , 2013, Springer Tracts in Advanced Robotics.

[11]  Handy Wicaksono Q learning behavior on autonomous navigation of physical robot , 2011, 2011 8th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI).

[12]  Sidney Nascimento Givigi,et al.  Multiple Model Q-Learning for Stochastic Asynchronous Rewards , 2016, J. Intell. Robotic Syst..

[13]  Marco Wiering,et al.  Reinforcement Learning , 2014, Adaptation, Learning, and Optimization.

[14]  Sebastian Thrun,et al.  Efficient Exploration In Reinforcement Learning , 1992 .

[15]  Alex Graves,et al.  Playing Atari with Deep Reinforcement Learning , 2013, ArXiv.

[16]  Maya Cakmak,et al.  Towards grounding concepts for transfer in goal learning from demonstration , 2011, 2011 IEEE International Conference on Development and Learning (ICDL).

[17]  Athanasios S. Polydoros,et al.  Survey of Model-Based Reinforcement Learning: Applications on Robotics , 2017, J. Intell. Robotic Syst..

[18]  Chris Watkins,et al.  Learning from delayed rewards , 1989 .

[19]  Gheorghe Mogan,et al.  Neural networks based reinforcement learning for mobile robots obstacle avoidance , 2016, Expert Syst. Appl..

[20]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[21]  Sergey Levine,et al.  Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models , 2015, ArXiv.

[22]  Javier García,et al.  A comprehensive survey on safe reinforcement learning , 2015, J. Mach. Learn. Res..

[23]  Mahesan Niranjan,et al.  On-line Q-learning using connectionist systems , 1994 .

[24]  Carlos V. Regueiro,et al.  Learning on real robots from experience and simple user feedback , 2013 .

[25]  R. Bellman Dynamic programming. , 1957, Science.

[26]  Hado van Hasselt,et al.  Reinforcement Learning in Continuous State and Action Spaces , 2012, Reinforcement Learning.

[27]  Carl E. Rasmussen,et al.  Gaussian Processes for Data-Efficient Learning in Robotics and Control , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Morgan Quigley,et al.  ROS: an open-source Robot Operating System , 2009, ICRA 2009.

[29]  Surya P. N. Singh,et al.  V-REP: A versatile and scalable robot simulation framework , 2013, 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[30]  J. Tukey Comparing individual means in the analysis of variance. , 1949, Biometrics.

[31]  Alborz Geramifard,et al.  RLPy: a value-function-based reinforcement learning framework for education and research , 2015, J. Mach. Learn. Res..

[32]  Alborz Geramifard,et al.  Dyna-Style Planning with Linear Function Approximation and Prioritized Sweeping , 2008, UAI.

[33]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[34]  Richard S. Sutton,et al.  Reinforcement learning with replacing eligibility traces , 2004, Machine Learning.

[35]  Richard S. Sutton,et al.  True Online TD(lambda) , 2014, ICML.

[36]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[37]  Chris Gaskett,et al.  Q-Learning for Robot Control , 2002 .

[38]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[39]  Alex Graves,et al.  Asynchronous Methods for Deep Reinforcement Learning , 2016, ICML.

[40]  Marco Wiering,et al.  Explorations in efficient reinforcement learning , 1999 .

[41]  Richard S. Sutton,et al.  Neuronlike adaptive elements that can solve difficult learning control problems , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[42]  Peter Vrancx,et al.  Reinforcement Learning: State-of-the-Art , 2012 .

[43]  Yuval Tassa,et al.  Continuous control with deep reinforcement learning , 2015, ICLR.

[44]  Reinaldo A. C. Bianchi,et al.  Transferring knowledge as heuristics in reinforcement learning: A case-based approach , 2015, Artif. Intell..

[45]  Tom Schaul,et al.  Prioritized Experience Replay , 2015, ICLR.

[46]  Darwin G. Caldwell,et al.  Reinforcement Learning in Robotics: Applications and Real-World Challenges , 2013, Robotics.

[47]  Peter Stone,et al.  RTMBA: A Real-Time Model-Based Reinforcement Learning Architecture for robot control , 2011, 2012 IEEE International Conference on Robotics and Automation.

[48]  Peter I. Corke,et al.  Towards Vision-Based Deep Reinforcement Learning for Robotic Motion Control , 2015, ICRA 2015.

[49]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[50]  M. Botvinick Hierarchical reinforcement learning and decision making , 2012, Current Opinion in Neurobiology.

[51]  Olivier Sigaud,et al.  Towards Deep Developmental Learning , 2016, IEEE Transactions on Cognitive and Developmental Systems.

[52]  Andrew W. Moore,et al.  Prioritized Sweeping: Reinforcement Learning with Less Data and Less Time , 1993, Machine Learning.

[53]  Stefan Schaal,et al.  Reinforcement Learning for Humanoid Robotics , 2003 .

[54]  Leslie Pack Kaelbling,et al.  Effective reinforcement learning for mobile robots , 2002, Proceedings 2002 IEEE International Conference on Robotics and Automation (Cat. No.02CH37292).

[55]  R Bellman,et al.  DYNAMIC PROGRAMMING AND LAGRANGE MULTIPLIERS. , 1956, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Tomás Svoboda,et al.  Safe Exploration Techniques for Reinforcement Learning - An Overview , 2014, MESAS.

[57]  Peter Stone,et al.  TEXPLORE: real-time sample-efficient reinforcement learning for robots , 2012, Machine Learning.

[58]  A. H. Klopf,et al.  Brain Function and Adaptive Systems: A Heterostatic Theory , 1972 .

[59]  Pierre-Yves Oudeyer,et al.  Exploration in Model-based Reinforcement Learning by Empirically Estimating Learning Progress , 2012, NIPS.

[60]  Andrew W. Moore,et al.  Reinforcement Learning: A Survey , 1996, J. Artif. Intell. Res..

[61]  Pieter Abbeel,et al.  Autonomous Helicopter Aerobatics through Apprenticeship Learning , 2010, Int. J. Robotics Res..

[62]  Brian Tanner,et al.  RL-Glue: Language-Independent Software for Reinforcement-Learning Experiments , 2009, J. Mach. Learn. Res..

[63]  Giulio Sandini,et al.  Developmental robotics: a survey , 2003, Connect. Sci..

[64]  Ben Tse,et al.  Autonomous Inverted Helicopter Flight via Reinforcement Learning , 2004, ISER.

[65]  Manuel Lopes,et al.  Learning exploration strategies in model-based reinforcement learning , 2013, AAMAS.

[66]  Peter Stone,et al.  Transfer learning for reinforcement learning on a physical robot , 2010, AAMAS 2010.

[67]  Csaba Szepesvári,et al.  Algorithms for Reinforcement Learning , 2010, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[68]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.