SWIRL: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards

We present sequential windowed inverse reinforcement learning (SWIRL), a policy search algorithm that is a hybrid of exploration and demonstration paradigms for robot learning. We apply unsupervised learning to a small number of initial expert demonstrations to structure future autonomous exploration. SWIRL approximates a long time horizon task as a sequence of local reward functions and subtask transition conditions. Over this approximation, SWIRL applies Q-learning to compute a policy that maximizes rewards. Experiments suggest that SWIRL requires significantly fewer rollouts than pure reinforcement learning and fewer expert demonstrations than behavioral cloning to learn a policy. We evaluate SWIRL in two simulated control tasks, parallel parking and a two-link pendulum. On the parallel parking task, SWIRL achieves the maximum reward on the task with 85% fewer rollouts than Q-learning, and one-eight of demonstrations needed by behavioral cloning. We also consider physical experiments on surgical tensioning and cutting deformable sheets using a da Vinci surgical robot. On the deformable tensioning task, SWIRL achieves a 36% relative improvement in reward compared with a baseline of behavioral cloning with segmentation.

[1]  Pieter Abbeel,et al.  Third-Person Imitation Learning , 2017, ICLR.

[2]  Pravesh Ranchod,et al.  Nonparametric Bayesian reward segmentation for skill discovery using inverse reinforcement learning , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3]  Andrew G. Barto,et al.  Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[4]  P. Morasso Three dimensional arm trajectories , 1983, Biological Cybernetics.

[5]  Pieter Abbeel,et al.  An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[6]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[7]  Gregory D. Hager,et al.  Motion generation of robotic surgical tasks: Learning from expert demonstrations , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[8]  Danica Kragic,et al.  Learning Actions from Observations , 2010, IEEE Robotics & Automation Magazine.

[9]  Aude Billard,et al.  Stochastic gesture production and recognition model for a humanoid robot , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[10]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[11]  Roderic A. Grupen,et al.  A feedback control structure for on-line learning tasks , 1997, Robotics Auton. Syst..

[12]  Pieter Abbeel,et al.  Learning by observation for surgical subtasks: Multilateral cutting of 3D viscoelastic and 2D Orthotropic Tissue Phantoms , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[13]  Sridhar Mahadevan,et al.  Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[14]  Michael I. Jordan,et al.  Nonparametric Bayesian Learning of Switching Linear Dynamical Systems , 2008, NIPS.

[15]  Jan Peters,et al.  Probabilistic segmentation applied to an assembly task , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[16]  Sylvain Calinoti,et al.  Skills learning in robots by interaction with users and environment , 2014, URAI.

[17]  Pieter Abbeel,et al.  Learning for control from multiple demonstrations , 2008, ICML '08.

[18]  Abhinav Gupta,et al.  The Curious Robot: Learning Visual Representations via Physical Interactions , 2016, ECCV.

[19]  Thomas G. Dietterich Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[20]  Doina Precup,et al.  Learning with options : Just deliberate and relax , 2015 .

[21]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[22]  Scott Niekum,et al.  Learning and generalization of complex tasks from unstructured demonstrations , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23]  Trevor Darrell,et al.  TSC-DL: Unsupervised trajectory segmentation of multi-modal surgical demonstrations with Deep Learning , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Jan Peters,et al.  Learning movement primitive attractor goals and sequential skills from kinesthetic demonstrations , 2015, Robotics Auton. Syst..

[25]  Jeffrey M. Zacks,et al.  Prediction Error Associated with the Perceptual Segmentation of Naturalistic Events , 2011, Journal of Cognitive Neuroscience.

[26]  Jernej Barbic,et al.  Segmenting Motion Capture Data into Distinct Behaviors , 2004, Graphics Interface.

[27]  Scott Kuindersma,et al.  Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[28]  Doina Precup,et al.  Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[29]  Anind K. Dey,et al.  Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[30]  Brijen Thananjeyan,et al.  SWIRL: A SequentialWindowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards , 2016, Workshop on the Algorithmic Foundations of Robotics.

[31]  Andrew T. Irish,et al.  Trajectory Learning for Robot Programming by Demonstration Using Hidden Markov Model and Dynamic Time Warping , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[32]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[33]  Sanjay Krishnan,et al.  HIRL: Hierarchical Inverse Reinforcement Learning for Long-Horizon Tasks with Delayed Rewards , 2016, ArXiv.

[34]  Michael I. Jordan,et al.  Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[35]  Ajay Kumar Tanwani,et al.  Learning Robot Manipulation Tasks With Task-Parameterized Semitied Hidden Semi-Markov Model , 2016, IEEE Robotics and Automation Letters.

[36]  Pieter Abbeel,et al.  Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion , 2007, NIPS.

[37]  Alec Solway,et al.  Optimal Behavioral Hierarchy , 2014, PLoS Comput. Biol..

[38]  Tamim Asfour,et al.  Imitation Learning of Dual-Arm Manipulation Tasks in Humanoid Robots , 2006, 2006 6th IEEE-RAS International Conference on Humanoid Robots.

[39]  Jan Peters,et al.  Movement extraction by detecting dynamics switches and repetitions , 2010, NIPS.

[40]  Bernhard Schölkopf,et al.  Switched Latent Force Models for Movement Segmentation , 2010, NIPS.

[41]  Jun Nakanishi,et al.  Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[42]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[43]  Shane Legg,et al.  Human-level control through deep reinforcement learning , 2015, Nature.

[44]  Andrew Y. Ng,et al.  Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[45]  S. Schaal,et al.  Segmentation of endpoint trajectories does not imply segmented control , 1999, Experimental Brain Research.

[46]  Sergey Levine,et al.  Optimism-driven exploration for nonlinear systems , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[47]  Giulio Sandini,et al.  Imitation learning of non-linear point-to-point robot motions using dirichlet processes , 2012, 2012 IEEE International Conference on Robotics and Automation.

[48]  Brett Browning,et al.  A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[49]  Darwin G. Caldwell,et al.  Learning and Reproduction of Gestures by Imitation , 2010, IEEE Robotics & Automation Magazine.

[50]  Stuart J. Russell,et al.  Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[51]  Sebastian Thrun,et al.  Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[52]  Gregory D. Hager,et al.  Transition state clustering: Unsupervised surgical trajectory segmentation for robot learning , 2017, ISRR.

[53]  Peter Kazanzides,et al.  An open-source research kit for the da Vinci® Surgical System , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[54]  Thomas B. Moeslund,et al.  A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[55]  Sergey Levine,et al.  Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[56]  Emilio Frazzoli,et al.  Incremental Sampling-based Algorithms for Optimal Motion Planning , 2010, Robotics: Science and Systems.

[57]  Lucas Monteiro Chaves,et al.  ON THE PREDICTION ERROR , 2019 .

[58]  Stefan Schaal,et al.  Learning and generalization of motor skills by learning from demonstration , 2009, 2009 IEEE International Conference on Robotics and Automation.

[59]  M. Botvinick,et al.  Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[60]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[61]  Mirko Wächter,et al.  Hierarchical segmentation of manipulation actions based on object relations and motion characteristics , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[62]  Alan Fern,et al.  Imitation Learning with Demonstrations and Shaping Rewards , 2014, AAAI.

[63]  Anind K. Dey,et al.  Probabilistic pointing target prediction via inverse optimal control , 2012, IUI '12.

[64]  Jun Morimoto,et al.  Task-Specific Generalization of Discrete and Periodic Dynamic Movement Primitives , 2010, IEEE Transactions on Robotics.

[65]  Aude Billard,et al.  Learning Stable Nonlinear Dynamical Systems With Gaussian Mixture Models , 2011, IEEE Transactions on Robotics.

[66]  Leslie Pack Kaelbling,et al.  Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[67]  Sang Hyoung Lee,et al.  Autonomous framework for segmenting robot trajectories of manipulation task , 2015, Auton. Robots.

[68]  M. Botvinick Hierarchical models of behavior and prefrontal function , 2008, Trends in Cognitive Sciences.

[69]  P Viviani,et al.  Segmentation and coupling in complex movements. , 1985, Journal of experimental psychology. Human perception and performance.

[70]  Rodney A. Brooks,et al.  A Robust Layered Control Syste For A Mobile Robot , 2022 .

[71]  A. Whiten,et al.  Imitation of hierarchical action structure by young children. , 2006, Developmental science.

[72]  John N. Tsitsiklis,et al.  Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[73]  Bernhard Hengst,et al.  Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[74]  Jonathan Lee,et al.  Iterative Noise Injection for Scalable Imitation Learning , 2017, ArXiv.

[75]  Andrew Y. Ng,et al.  Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[76]  Sergey Levine,et al.  Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[77]  Sergey Levine,et al.  Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection , 2016, ISER.

[78]  Pieter Abbeel,et al.  Apprenticeship learning via inverse reinforcement learning , 2004, ICML.