O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning

Contrary to the vast literature in modeling, perceiving, and understanding agent-object (e.g. human-object, hand-object, robot-object) interaction in computer vision and robotics, very few past works have studied the task of objectobject interaction, which also plays an important role in robotic manipulation and planning tasks. There is a rich space of object-object interaction scenarios in our daily life, such as placing an object on a messy tabletop, fitting an object inside a drawer, pushing an object using a tool, etc. In this paper, we propose a unified affordance learning framework to learn object-object interaction for various tasks. By constructing four object-object interaction task environments using physical simulation (SAPIEN) and thousands of ShapeNet models with rich geometric diversity, we are able to conduct large-scale object-object affordance learning without the need for human annotations or demonstrations. At the core of technical contribution, we propose an object-kernel point convolution network to reason about detailed interaction between two objects. Experiments on large-scale synthetic data and real-world data prove the effectiveness of the proposed approach. Please refer to the project webpage for code, data, video, and more materials.

[1]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[2]  Darwin G. Caldwell,et al.  AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[3]  Wolfram Burgard,et al.  Self-supervised Transfer Learning for Instance Segmentation through Physical Interaction , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[4]  Ales Ude,et al.  Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives , 2011, Robotics Auton. Syst..

[5]  Andy Zeng,et al.  Form2Fit: Learning Shape Priors for Generalizable Assembly from Disassembly , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Roozbeh Mottaghi,et al.  Learning About Objects by Learning to Interact with Them , 2020, NeurIPS.

[7]  Xinlei Chen,et al.  Embodied Visual Recognition , 2019, ArXiv.

[8]  Kristen Grauman,et al.  Learning Affordance Landscapes for Interaction Exploration in 3D Environments , 2020, NeurIPS.

[9]  Andy Zeng,et al.  Learning to See before Learning to Act: Visual Pre-training for Manipulation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[10]  Hao Su,et al.  S4G: Amodal Single-view Single-Shot SE(3) Grasp Detection in Cluttered Scenes , 2019, CoRL.

[11]  Yun Jiang,et al.  Learning to place new objects in a scene , 2012, Int. J. Robotics Res..

[12]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Peter R. Florence,et al.  Transporter Networks: Rearranging the Visual World for Robotic Manipulation , 2020, CoRL.

[14]  Silvio Savarese,et al.  Demo2Vec: Reasoning Object Affordances from Online Videos , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[15]  Ashutosh Saxena,et al.  Robotic Grasping of Novel Objects using Vision , 2008, Int. J. Robotics Res..

[16]  Jiajun Wu,et al.  DensePhysNet: Learning Dense Physical Object Representations via Multi-step Dynamic Interactions , 2019, Robotics: Science and Systems.

[17]  Manuel Lopes,et al.  Learning grasping affordances from local visual descriptors , 2009, 2009 IEEE 8th International Conference on Development and Learning.

[18]  Abhinav Gupta,et al.  The Curious Robot: Learning Visual Representations via Physical Interactions , 2016, ECCV.

[19]  Claudio Zito,et al.  Let's Push Things Forward: A Survey on Robot Pushing , 2019, Frontiers in Robotics and AI.

[20]  Allan Jabri,et al.  Towards Practical Multi-Object Manipulation using Relational Reinforcement Learning , 2019, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[21]  Timothy Bretl,et al.  Self-supervised 6D Object Pose Estimation for Robot Manipulation , 2020, 2020 IEEE International Conference on Robotics and Automation (ICRA).

[22]  Francesc Moreno-Noguer,et al.  GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Leonidas J. Guibas,et al.  PartNet: A Large-Scale Benchmark for Fine-Grained and Hierarchical Part-Level 3D Object Understanding , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Shaogang Ren,et al.  Object-object interaction affordance learning , 2014, Robotics Auton. Syst..

[25]  Jiajun Wu,et al.  Galileo: Perceiving Physical Object Properties by Integrating a Physics Engine with Deep Learning , 2015, NIPS.

[26]  Leonidas J. Guibas,et al.  ShapeNet: An Information-Rich 3D Model Repository , 2015, ArXiv.

[27]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[28]  Carl E. Rasmussen,et al.  Learning to Control a Low-Cost Manipulator using Data-Efficient Reinforcement Learning , 2011, Robotics: Science and Systems.

[29]  Kris Hauser,et al.  Stable Bin Packing of Non-convex 3D Objects with a Robot Manipulator , 2018, 2019 International Conference on Robotics and Automation (ICRA).

[30]  Cewu Lu,et al.  CPF: Learning a Contact Potential Field to Model the Hand-Object Interaction , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Danica Kragic,et al.  Object Placement Planning and optimization for Robot Manipulators , 2019, 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[32]  Henrik I. Christensen,et al.  A Taxonomy of Benchmark Tasks for Robot Manipulation , 2015, ISRR.

[33]  Michael Goesele,et al.  The Replica Dataset: A Digital Replica of Indoor Spaces , 2019, ArXiv.

[34]  Danica Kragic,et al.  Learning Task-Oriented Grasping From Human Activity Datasets , 2019, IEEE Robotics and Automation Letters.

[35]  Leonidas J. Guibas,et al.  SAPIEN: A SimulAted Part-Based Interactive ENvironment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Joseph Redmon,et al.  Real-time grasp detection using convolutional neural networks , 2014, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[37]  Oliver Brock,et al.  The RBO dataset of articulated objects and interactions , 2018, Int. J. Robotics Res..

[38]  Katerina Fragkiadaki,et al.  Move to See Better: Towards Self-Supervised Amodal Object Detection , 2020, ArXiv.

[39]  Sergey Levine,et al.  Reasoning About Physical Interactions with Object-Oriented Prediction and Planning , 2018, ICLR.

[40]  Jitendra Malik,et al.  Learning Instance Segmentation by Interaction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[41]  P. Abbeel,et al.  Yale-CMU-Berkeley dataset for robotic manipulation research , 2017, Int. J. Robotics Res..

[42]  Kristen Grauman,et al.  Grounded Human-Object Interaction Hotspots From Video , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Abhinav Gupta,et al.  Supersizing self-supervision: Learning to grasp from 50K tries and 700 robot hours , 2015, 2016 IEEE International Conference on Robotics and Automation (ICRA).

[44]  Andrew Clement,et al.  Semantic and Functional Relationships Among Objects Increase the Capacity of Visual Working Memory , 2018, Journal of experimental psychology. Learning, memory, and cognition.

[45]  Sergey Levine,et al.  Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection , 2016, Int. J. Robotics Res..

[46]  Mario Fritz,et al.  To Fall Or Not To Fall: A Visual Approach to Physical Stability Prediction , 2016, ArXiv.

[47]  Yu Sun,et al.  AI Meets Physical World - Exploring Robot Cooking , 2018, ArXiv.

[48]  Shubham Tulsiani,et al.  Where2Act: From Pixels to Actions for Articulated 3D Objects , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[50]  Jitendra Malik,et al.  Learning to Poke by Poking: Experiential Learning of Intuitive Physics , 2016, NIPS.

[51]  Kristen Grauman,et al.  Learning Dexterous Grasping with Object-Centric Visual Affordances , 2020, 2021 IEEE International Conference on Robotics and Automation (ICRA).

[52]  Song-Chun Zhu,et al.  Understanding tools: Task-oriented object modeling, learning and recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).