Behavior policy learning: Learning multi-stage tasks via solution sketches and model-based controllers

Multi-stage tasks are a challenge for reinforcement learning methods, and require either specific task knowledge (e.g., task segmentation) or big amount of interaction times to be learned. In this paper, we propose Behavior Policy Learning (BPL) that effectively combines 1) only few solution sketches, that is demonstrations without the actions, but only the states, 2) model-based controllers, and 3) simulations to effectively solve multi-stage tasks without strong knowledge about the underlying task. Our main intuition is that solution sketches alone can provide strong data for learning a high-level trajectory by imitation, and model-based controllers can be used to follow this trajectory (we call it behavior) effectively. Finally, we utilize robotic simulations to further improve the policy and make it robust in a Sim2Real style. We evaluate our method in simulation with a robotic manipulator that has to perform two tasks with variations: 1) grasp a box and place it in a basket, and 2) re-place a book on a different level within a bookcase. We also validate the Sim2Real capabilities of our method by performing real-world experiments and realistic simulated experiments where the objects are tracked through an RGB-D camera for the first task.

[1]  Philipp Reist,et al.  Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning , 2021, CoRL.

[2]  Abdeslam Boularias,et al.  Self-Supervised Learning of Long-Horizon Manipulation Tasks with Finite-State Task Machines , 2021, L4DC.

[3]  Abhinav Gupta,et al.  Neural Dynamic Policies for End-to-End Sensorimotor Learning , 2020, NeurIPS.

[4]  Lorenz Wellhausen,et al.  Learning quadrupedal locomotion over challenging terrain , 2020, Science Robotics.

[5]  Karol Hausman,et al.  Modeling Long-horizon Tasks as Sequential Interaction Landscapes , 2020, CoRL.

[6]  Sonia Chernova,et al.  Recent Advances in Robot Learning from Demonstration , 2020, Annu. Rev. Control. Robotics Auton. Syst..

[7]  Aude Billard,et al.  A Dynamical System Approach for Adaptive Grasping, Navigation and Co-Manipulation with Humanoid Robots , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[8]  Li Fei-Fei,et al.  Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations , 2020, Robotics: Science and Systems.

[9]  S. Levine,et al.  Safety Augmented Value Estimation From Demonstrations (SAVED): Safe Deep Model-Based RL for Sparse Cost Robotic Tasks , 2019, IEEE Robotics and Automation Letters.

[10]  Sylvain Calinon,et al.  A Survey on Policy Search Algorithms for Learning Robot Controllers in a Handful of Trials , 2018, IEEE Transactions on Robotics.

[11]  Sergey Levine,et al.  Relay Policy Learning: Solving Long-Horizon Tasks via Imitation and Reinforcement Learning , 2019, CoRL.

[12]  Scott Kuindersma,et al.  A Comparison of Action Spaces for Learning Manipulation Tasks , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13]  Silvio Savarese,et al.  Variable Impedance Control in End-Effector Space: An Action Space for Reinforcement Learning in Contact-Rich Tasks , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[14]  Joonho Lee,et al.  Learning agile and dynamic motor skills for legged robots , 2019, Science Robotics.

[15]  Yevgen Chebotar,et al.  Closing the Sim-to-Real Loop: Adapting Simulation Randomization with Real World Experience , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[16]  Jean-Baptiste Mouret,et al.  Multi-objective Model-based Policy Search for Data-efficient Learning with Sparse Rewards , 2018, CoRL.

[17]  Atil Iscen,et al.  Sim-to-Real: Learning Agile Locomotion For Quadruped Robots , 2018, Robotics: Science and Systems.

[18]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[19]  Herke van Hoof,et al.  Addressing Function Approximation Error in Actor-Critic Methods , 2018, ICML.

[20]  Nando de Freitas,et al.  Reinforcement and Imitation Learning for Diverse Visuomotor Skills , 2018, Robotics: Science and Systems.

[21]  Siddhartha S. Srinivasa,et al.  DART: Dynamic Animation and Robotics Toolkit , 2018, J. Open Source Softw..

[22]  Marcin Andrychowicz,et al.  Sim-to-Real Transfer of Robotic Control with Dynamics Randomization , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Sergey Levine,et al.  Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations , 2017, Robotics: Science and Systems.

[24]  Jean-Baptiste Mouret,et al.  Using Parameterized Black-Box Priors to Scale Up Model-Based Policy Search for Robotics , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Jean-Baptiste Mouret,et al.  Using Centroidal Voronoi Tessellations to Scale Up the Multidimensional Archive of Phenotypic Elites Algorithm , 2016, IEEE Transactions on Evolutionary Computation.

[26]  Demis Hassabis,et al.  Mastering the game of Go without human knowledge , 2017, Nature.

[27]  Martin A. Riedmiller,et al.  Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards , 2017, ArXiv.

[28]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[29]  Yuval Tassa,et al.  Emergence of Locomotion Behaviours in Rich Environments , 2017, ArXiv.

[30]  Andrew J. Davison,et al.  Transferring End-to-End Visuomotor Control from Simulation to Real World for a Multi-Stage Task , 2017, CoRL.

[31]  Jean-Baptiste Mouret,et al.  Black-box data-efficient policy search for robotics , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[33]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[34]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[35]  Jonathan P. How,et al.  Efficient reinforcement learning for robots using informative simulated priors , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[36]  J A Bagnell,et al.  An Invitation to Imitation , 2015 .

[37]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[38]  Pierre-Brice Wieber,et al.  Hierarchical quadratic programming: Fast online humanoid-robot motion generation , 2014, Int. J. Robotics Res..

[39]  Olivier Sigaud,et al.  Robot Skill Learning: From Reinforcement Learning to Evolution Strategies , 2013, Paladyn J. Behav. Robotics.

[40]  Aude Billard,et al.  Learning Stable Nonlinear Dynamical Systems With Gaussian Mixture Models , 2011, IEEE Transactions on Robotics.

[41]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[42]  Stefan Schaal,et al.  Robot Programming by Demonstration , 2009, Springer Handbook of Robotics.

[43]  Nikolaus Hansen,et al.  The CMA Evolution Strategy: A Comparing Review , 2006, Towards a New Evolutionary Computation.

[44]  Richard S. Sutton,et al.  Reinforcement Learning: An Introduction , 1998, IEEE Trans. Neural Networks.

[45]  Qiang Du,et al.  Centroidal Voronoi Tessellations: Applications and Algorithms , 1999, SIAM Rev..