Transporter Networks: Rearranging the Visual World for Robotic Manipulation

Robotic manipulation can be formulated as inducing a sequence of spatial displacements: where the space being moved can encompass an object, part of an object, or end effector. In this work, we propose the Transporter Network, a simple model architecture that rearranges deep features to infer spatial displacements from visual input - which can parameterize robot actions. It makes no assumptions of objectness (e.g. canonical poses, models, or keypoints), it exploits spatial symmetries, and is orders of magnitude more sample efficient than our benchmarked alternatives in learning vision-based manipulation tasks: from stacking a pyramid of blocks, to assembling kits with unseen objects; from manipulating deformable ropes, to pushing piles of small objects with closed-loop feedback. Our method can represent complex multi-modal policy distributions and generalizes to multi-step sequential tasks, as well as 6DoF pick-and-place. Experiments on 10 simulated tasks show that it learns faster and generalizes better than a variety of end-to-end baselines, including policies that use ground-truth object poses. We validate our methods with hardware in the real world. Experiment videos and code will be made available at this https URL

[1]  Yann LeCun,et al.  Learning to Linearize Under Uncertainty , 2015, NIPS.

[2]  S. Srihari Mixture Density Networks , 1994 .

[3]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Peter Corke,et al.  Closing the Loop for Robotic Grasping: A Real-time, Generative Grasp Synthesis Approach , 2018, Robotics: Science and Systems.

[5]  Pieter Abbeel,et al.  Learning to Manipulate Deformable Objects without Demonstrations , 2019, Robotics: Science and Systems.

[6]  Robert Platt,et al.  Deictic Image Maps: An Abstraction For Learning Pose Invariant Manipulation Policies , 2018, AAAI.

[7]  Russ Tedrake,et al.  Self-Supervised Correspondence in Visuomotor Policy Learning , 2019, IEEE Robotics and Automation Letters.

[8]  Russ Tedrake,et al.  Dense Object Nets: Learning Dense Visual Object Descriptors By and For Robotic Manipulation , 2018, CoRL.

[9]  Wolfram Burgard,et al.  Metric learning for generalizing spatial relations to new objects , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[10]  Sergey Levine,et al.  End-to-End Training of Deep Visuomotor Policies , 2015, J. Mach. Learn. Res..

[11]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[12]  Leonidas J. Guibas,et al.  Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[14]  Kuan-Ting Yu,et al.  Multi-view self-supervised deep learning for 6D pose estimation in the Amazon Picking Challenge , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[17]  Jonathan T. Barron,et al.  NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis , 2020, ECCV.

[18]  Munther A. Dahleh,et al.  Maneuver-based motion planning for nonlinear systems with symmetries , 2005, IEEE Transactions on Robotics.

[19]  Timothy Bretl,et al.  Self-supervised 6D Object Pose Estimation for Robot Manipulation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Shuran Song,et al.  Form2Fit: Learning Shape Priors for Generalizable Assembly from Disassembly , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Priya Sundaresan,et al.  Learning Rope Manipulation Policies Using Dense Object Descriptors Trained on Synthetic Depth Data , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[23]  Alberto Rodriguez,et al.  Learning Synergies Between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[24]  Sergey Levine,et al.  Deep Dynamics Models for Learning Dexterous Manipulation , 2019, CoRL.

[25]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[26]  Wei Gao,et al.  kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation , 2019, ISRR.

[27]  Russ Tedrake,et al.  The Surprising Effectiveness of Linear Models for Visual Foresight in Object Pile Manipulation , 2020, WAFR.

[28]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[29]  Risi Kondor,et al.  On the Generalization of Equivariance and Convolution in Neural Networks to the Action of Compact Groups , 2018, ICML.

[30]  Wei Wu,et al.  SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Daniel Kappler,et al.  Action Image Representation: Learning Scalable Deep Grasping Policies with Zero Real World Data , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[32]  Wolfram Burgard,et al.  Optimization Beyond the Convolution: Generalizing Spatial Relations with End-to-End Metric Learning , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[33]  Dieter Fox,et al.  Self-Supervised Visual Descriptor Learning for Dense Correspondence , 2017, IEEE Robotics and Automation Letters.

[34]  Andy Zeng,et al.  Learning Visual Affordances for Robotic Manipulation , 2019 .

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Max Welling,et al.  Group Equivariant Convolutional Networks , 2016, ICML.

[37]  Gordon Wetzstein,et al.  Implicit Neural Representations with Periodic Activation Functions , 2020, NeurIPS.

[38]  Torsten Kröger,et al.  Self-Supervised Learning for Precise Pick-and-Place Without Object Model , 2020, IEEE Robotics and Automation Letters.

[39]  Rouhollah Rahmatizadeh,et al.  Vision-Based Multi-Task Manipulation for Inexpensive Robots Using End-to-End Learning from Demonstration , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[40]  Luca Bertinetto,et al.  Fully-Convolutional Siamese Networks for Object Tracking , 2016, ECCV Workshops.

[41]  Ian Taylor,et al.  Robotic pick-and-place of novel objects in clutter with multi-affordance grasping and cross-domain image matching , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[42]  Anelia Angelova,et al.  KeyPose: Multi-View 3D Labeling and Keypoint Estimation for Transparent Objects , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Kostas Daniilidis,et al.  Single image 3D object detection and pose estimation for grasping , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Sergey Levine,et al.  QT-Opt: Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation , 2018, CoRL.

[45]  Robert Platt,et al.  Pick and Place Without Geometric Object Models , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[46]  Khashayar Rohanimanesh,et al.  Self-Supervised Goal-Conditioned Pick and Place , 2020, ArXiv.

[47]  Gregory D. Hager,et al.  “Good Robot!”: Efficient Reinforcement Learning for Multi-Step Visual Tasks with Sim to Real Transfer , 2020, IEEE Robotics and Automation Letters.

[48]  Arnold W. M. Smeulders,et al.  UvA-DARE (Digital Academic Repository) Siamese Instance Search for Tracking , 2016 .

[49]  Avinash C. Kak,et al.  Real-time tracking and pose estimation for industrial objects using geometric features , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[50]  Ankush Gupta,et al.  Unsupervised Learning of Object Keypoints for Perception and Control , 2019, NeurIPS.