Where2Act: From Pixels to Actions for Articulated 3D Objects

One of the fundamental goals of visual perception is to allow agents to meaningfully interact with their environment. In this paper, we take a step towards that long-term goal – we extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts. For example, given a drawer, our network predicts that applying a pulling force on the handle opens the drawer. We propose, discuss, and evaluate novel network architectures that given image and depth data, predict the set of actions possible at each pixel, and the regions over articulated parts that are likely to move under the force. We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation (SAPIEN) and generalizes across categories. But more importantly, our learned models even transfer to real-world data. Check the project website for the code and data release.

[1]  Leonidas J. Guibas,et al.  Deep part induction from articulated object pairs , 2018, ACM Trans. Graph..

[2]  Luc Van Gool,et al.  An object-dependent hand pose prior from sparse training data , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Leonidas J. Guibas,et al.  A scalable active framework for region annotation in 3D shape collections , 2016, ACM Trans. Graph..

[5]  Eric Lengyel Volumetric Hierarchical Approximate Convex Decomposition , 2016 .

[6]  Hao Su,et al.  A Point Set Generation Network for 3D Object Reconstruction from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Henrik I. Christensen,et al.  Automatic grasp planning using shape primitives , 2003, 2003 IEEE International Conference on Robotics and Automation (Cat. No.03CH37422).

[8]  Hui Huang,et al.  RPM-Net , 2019, ACM Trans. Graph..

[9]  Roozbeh Mottaghi,et al.  Learning About Objects by Learning to Interact with Them , 2020, NeurIPS.

[10]  Zoltan-Csaba Marton,et al.  Implicit 3D Orientation Learning for 6D Object Detection from RGB Images , 2018, ECCV.

[11]  Oliver van Kaick,et al.  Functionality Representations and Applications for Shape Analysis , 2018, Comput. Graph. Forum.

[12]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[13]  Alberto Rodriguez,et al.  Learning Synergies Between Pushing and Grasping with Self-Supervised Deep Reinforcement Learning , 2018, 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[14]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[15]  Cewu Lu,et al.  KeypointNet: A Large-Scale 3D Keypoint Dataset Aggregated From Numerous Human Annotations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Oliver Brock,et al.  The RBO dataset of articulated objects and interactions , 2018, Int. J. Robotics Res..

[17]  Kristen Grauman,et al.  Learning Affordance Landscapes for Interaction Exploration in 3D Environments , 2020, NeurIPS.

[18]  Jitendra Malik,et al.  Learning Instance Segmentation by Interaction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[19]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[20]  Kristen Grauman,et al.  Grounded Human-Object Interaction Hotspots From Video , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Yi Zhou,et al.  On the Continuity of Rotation Representations in Neural Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[23]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[24]  Dieter Fox,et al.  6-DOF Grasping for Target-driven Object Manipulation in Clutter , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[25]  Afzal Godil,et al.  Evaluation of 3D interest point detection techniques via human-generated ground truth , 2012, The Visual Computer.

[26]  Charles C. Kemp,et al.  ContactDB: Analyzing and Predicting Grasp Contact via Thermal Imaging , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Balazs Kovacs,et al.  Learning Material-Aware Local Descriptors for 3D Shapes , 2018, 2018 International Conference on 3D Vision (3DV).

[28]  Alexei A. Efros,et al.  People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[29]  Maks Ovsjanikov,et al.  PCPNet Learning Local Shape Properties from Raw Point Clouds , 2017, Comput. Graph. Forum.

[30]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Pat Hanrahan,et al.  Semantically-enriched 3D models for common-sense knowledge , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Afzal Godil,et al.  Evaluation of 3D Interest Point Detection Techniques , 2011, 3DOR@Eurographics.

[33]  Leonidas J. Guibas,et al.  Normalized Object Coordinate Space for Category-Level 6D Object Pose and Size Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Mathieu Aubry,et al.  Dex-Net 1.0: A cloud-based network of 3D objects for robust grasp planning using a Multi-Armed Bandit model with correlated rewards , 2016, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[35]  Jun Li,et al.  Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Leonidas J. Guibas,et al.  Cross-Modal Attribute Transfer for Rescaling 3D Models , 2017, 2017 International Conference on 3D Vision (3DV).

[37]  Leonidas J. Guibas,et al.  Shape2Pose , 2014, ACM Trans. Graph..

[38]  Ariel Shamir,et al.  Learning to predict part mobility from a single static snapshot , 2017, ACM Trans. Graph..

[39]  Leonidas J. Guibas,et al.  PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Dieter Fox,et al.  Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects , 2018, CoRL.

[41]  Xiaogang Wang,et al.  Shape2Motion: Joint Analysis of Motion Parts and Attributes From 3D Shapes , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Dieter Fox,et al.  PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes , 2017, Robotics: Science and Systems.

[43]  Ariel Shamir,et al.  Predictive and generative neural networks for object functionality , 2018, ACM Trans. Graph..

[44]  Abhinav Gupta,et al.  Learning to push by grasping: Using multiple tasks for effective learning , 2016, 2017 IEEE International Conference on Robotics and Automation (ICRA).

[45]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Vladimir G. Kim,et al.  Motion Annotation Programs: A Scalable Approach to Annotating Kinematic Articulations in Large 3D Shape Collections , 2020, 2020 International Conference on 3D Vision (3DV).

[47]  Scott Niekum,et al.  ScrewNet: Category-Independent Articulation Model Estimation From Depth Images Using Screw Theory , 2020, ArXiv.

[48]  A. Lynn Abbott,et al.  Category-Level Articulated Object Pose Estimation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).