论文信息 - SWIRL: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards

SWIRL: A sequential windowed inverse reinforcement learning algorithm for robot tasks with delayed rewards

We present sequential windowed inverse reinforcement learning (SWIRL), a policy search algorithm that is a hybrid of exploration and demonstration paradigms for robot learning. We apply unsupervised learning to a small number of initial expert demonstrations to structure future autonomous exploration. SWIRL approximates a long time horizon task as a sequence of local reward functions and subtask transition conditions. Over this approximation, SWIRL applies Q-learning to compute a policy that maximizes rewards. Experiments suggest that SWIRL requires significantly fewer rollouts than pure reinforcement learning and fewer expert demonstrations than behavioral cloning to learn a policy. We evaluate SWIRL in two simulated control tasks, parallel parking and a two-link pendulum. On the parallel parking task, SWIRL achieves the maximum reward on the task with 85% fewer rollouts than Q-learning, and one-eight of demonstrations needed by behavioral cloning. We also consider physical experiments on surgical tensioning and cutting deformable sheets using a da Vinci surgical robot. On the deformable tensioning task, SWIRL achieves a 36% relative improvement in reward compared with a baseline of behavioral cloning with segmentation.

[1] Pieter Abbeel,et al. Third-Person Imitation Learning , 2017, ICLR.

[2] Pravesh Ranchod,et al. Nonparametric Bayesian reward segmentation for skill discovery using inverse reinforcement learning , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[3] Andrew G. Barto,et al. Building Portable Options: Skill Transfer in Reinforcement Learning , 2007, IJCAI.

[4] P. Morasso. Three dimensional arm trajectories , 1983, Biological Cybernetics.

[5] Pieter Abbeel,et al. An Algorithmic Perspective on Imitation Learning , 2018, Found. Trends Robotics.

[6] Gunnar Rätsch,et al. Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[7] Gregory D. Hager,et al. Motion generation of robotic surgical tasks: Learning from expert demonstrations , 2010, 2010 Annual International Conference of the IEEE Engineering in Medicine and Biology.

[8] Danica Kragic,et al. Learning Actions from Observations , 2010, IEEE Robotics & Automation Magazine.

[9] Aude Billard,et al. Stochastic gesture production and recognition model for a humanoid robot , 2004, 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (IEEE Cat. No.04CH37566).

[10] Jitendra Malik,et al. Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[11] Roderic A. Grupen,et al. A feedback control structure for on-line learning tasks , 1997, Robotics Auton. Syst..

[12] Pieter Abbeel,et al. Learning by observation for surgical subtasks: Multilateral cutting of 3D viscoelastic and 2D Orthotropic Tissue Phantoms , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[13] Sridhar Mahadevan,et al. Recent Advances in Hierarchical Reinforcement Learning , 2003, Discret. Event Dyn. Syst..

[14] Michael I. Jordan,et al. Nonparametric Bayesian Learning of Switching Linear Dynamical Systems , 2008, NIPS.

[15] Jan Peters,et al. Probabilistic segmentation applied to an assembly task , 2015, 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids).

[16] Sylvain Calinoti,et al. Skills learning in robots by interaction with users and environment , 2014, URAI.

[17] Pieter Abbeel,et al. Learning for control from multiple demonstrations , 2008, ICML '08.

[18] Abhinav Gupta,et al. The Curious Robot: Learning Visual Representations via Physical Interactions , 2016, ECCV.

[19] Thomas G. Dietterich. Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition , 1999, J. Artif. Intell. Res..

[20] Doina Precup,et al. Learning with options : Just deliberate and relax , 2015 .

[21] Henry C. Lin,et al. JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[22] Scott Niekum,et al. Learning and generalization of complex tasks from unstructured demonstrations , 2012, 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[23] Trevor Darrell,et al. TSC-DL: Unsupervised trajectory segmentation of multi-modal surgical demonstrations with Deep Learning , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[24] Jan Peters,et al. Learning movement primitive attractor goals and sequential skills from kinesthetic demonstrations , 2015, Robotics Auton. Syst..

[25] Jeffrey M. Zacks,et al. Prediction Error Associated with the Perceptual Segmentation of Naturalistic Events , 2011, Journal of Cognitive Neuroscience.

[26] Jernej Barbic,et al. Segmenting Motion Capture Data into Distinct Behaviors , 2004, Graphics Interface.

[27] Scott Kuindersma,et al. Robot learning from demonstration by constructing skill trees , 2012, Int. J. Robotics Res..

[28] Doina Precup,et al. Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning , 1999, Artif. Intell..

[29] Anind K. Dey,et al. Maximum Entropy Inverse Reinforcement Learning , 2008, AAAI.

[30] Brijen Thananjeyan,et al. SWIRL: A SequentialWindowed Inverse Reinforcement Learning Algorithm for Robot Tasks With Delayed Rewards , 2016, Workshop on the Algorithmic Foundations of Robotics.

[31] Andrew T. Irish,et al. Trajectory Learning for Robot Programming by Demonstration Using Hidden Markov Model and Dynamic Time Warping , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[32] Joshua B. Tenenbaum,et al. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[33] Sanjay Krishnan,et al. HIRL: Hierarchical Inverse Reinforcement Learning for Long-Horizon Tasks with Delayed Rewards , 2016, ArXiv.

[34] Michael I. Jordan,et al. Revisiting k-means: New Algorithms via Bayesian Nonparametrics , 2011, ICML.

[35] Ajay Kumar Tanwani,et al. Learning Robot Manipulation Tasks With Task-Parameterized Semitied Hidden Semi-Markov Model , 2016, IEEE Robotics and Automation Letters.

[36] Pieter Abbeel,et al. Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion , 2007, NIPS.

[37] Alec Solway,et al. Optimal Behavioral Hierarchy , 2014, PLoS Comput. Biol..

[38] Tamim Asfour,et al. Imitation Learning of Dual-Arm Manipulation Tasks in Humanoid Robots , 2006, 2006 6th IEEE-RAS International Conference on Humanoid Robots.

[39] Jan Peters,et al. Movement extraction by detecting dynamics switches and repetitions , 2010, NIPS.

[40] Bernhard Schölkopf,et al. Switched Latent Force Models for Movement Segmentation , 2010, NIPS.

[41] Jun Nakanishi,et al. Learning Attractor Landscapes for Learning Motor Primitives , 2002, NIPS.

[42] Geoffrey E. Hinton,et al. Feudal Reinforcement Learning , 1992, NIPS.

[43] Shane Legg,et al. Human-level control through deep reinforcement learning , 2015, Nature.

[44] Andrew Y. Ng,et al. Policy Invariance Under Reward Transformations: Theory and Application to Reward Shaping , 1999, ICML.

[45] S. Schaal,et al. Segmentation of endpoint trajectories does not imply segmented control , 1999, Experimental Brain Research.

[46] Sergey Levine,et al. Optimism-driven exploration for nonlinear systems , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[47] Giulio Sandini,et al. Imitation learning of non-linear point-to-point robot motions using dirichlet processes , 2012, 2012 IEEE International Conference on Robotics and Automation.

[48] Brett Browning,et al. A survey of robot learning from demonstration , 2009, Robotics Auton. Syst..

[49] Darwin G. Caldwell,et al. Learning and Reproduction of Gestures by Imitation , 2010, IEEE Robotics & Automation Magazine.

[50] Stuart J. Russell,et al. Reinforcement Learning with Hierarchies of Machines , 1997, NIPS.

[51] Sebastian Thrun,et al. Issues in Using Function Approximation for Reinforcement Learning , 1999 .

[52] Gregory D. Hager,et al. Transition state clustering: Unsupervised surgical trajectory segmentation for robot learning , 2017, ISRR.

[53] Peter Kazanzides,et al. An open-source research kit for the da Vinci® Surgical System , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[54] Thomas B. Moeslund,et al. A Survey of Computer Vision-Based Human Motion Capture , 2001, Comput. Vis. Image Underst..

[55] Sergey Levine,et al. Deep visual foresight for planning robot motion , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[56] Emilio Frazzoli,et al. Incremental Sampling-based Algorithms for Optimal Motion Planning , 2010, Robotics: Science and Systems.

[57] Lucas Monteiro Chaves,et al. ON THE PREDICTION ERROR , 2019 .

[58] Stefan Schaal,et al. Learning and generalization of motor skills by learning from demonstration , 2009, 2009 IEEE International Conference on Robotics and Automation.

[59] M. Botvinick,et al. Hierarchically organized behavior and its neural foundations: A reinforcement learning perspective , 2009, Cognition.

[60] Abhinav Gupta,et al. Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[61] Mirko Wächter,et al. Hierarchical segmentation of manipulation actions based on object relations and motion characteristics , 2015, 2015 International Conference on Advanced Robotics (ICAR).

[62] Alan Fern,et al. Imitation Learning with Demonstrations and Shaping Rewards , 2014, AAAI.

[63] Anind K. Dey,et al. Probabilistic pointing target prediction via inverse optimal control , 2012, IUI '12.

[64] Jun Morimoto,et al. Task-Specific Generalization of Discrete and Periodic Dynamic Movement Primitives , 2010, IEEE Transactions on Robotics.

[65] Aude Billard,et al. Learning Stable Nonlinear Dynamical Systems With Gaussian Mixture Models , 2011, IEEE Transactions on Robotics.

[66] Leslie Pack Kaelbling,et al. Hierarchical Learning in Stochastic Domains: Preliminary Results , 1993, ICML.

[67] Sang Hyoung Lee,et al. Autonomous framework for segmenting robot trajectories of manipulation task , 2015, Auton. Robots.

[68] M. Botvinick. Hierarchical models of behavior and prefrontal function , 2008, Trends in Cognitive Sciences.

[69] P Viviani,et al. Segmentation and coupling in complex movements. , 1985, Journal of experimental psychology. Human perception and performance.

[70] Rodney A. Brooks,et al. A Robust Layered Control Syste For A Mobile Robot , 2022 .

[71] A. Whiten,et al. Imitation of hierarchical action structure by young children. , 2006, Developmental science.

[72] John N. Tsitsiklis,et al. Neuro-dynamic programming: an overview , 1995, Proceedings of 1995 34th IEEE Conference on Decision and Control.

[73] Bernhard Hengst,et al. Discovering Hierarchy in Reinforcement Learning with HEXQ , 2002, ICML.

[74] Jonathan Lee,et al. Iterative Noise Injection for Scalable Imitation Learning , 2017, ArXiv.

[75] Andrew Y. Ng,et al. Pharmacokinetics of a novel formulation of ivermectin after administration to goats , 2000, ICML.

[76] Sergey Levine,et al. Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization , 2016, ICML.

[77] Sergey Levine,et al. Learning Hand-Eye Coordination for Robotic Grasping with Large-Scale Data Collection , 2016, ISER.

[78] Pieter Abbeel,et al. Apprenticeship learning via inverse reinforcement learning , 2004, ICML.