AffordanceNet: An End-to-End Deep Learning Approach for Object Affordance Detection

We propose AffordanceNet, a new deep learning approach to simultaneously detect multiple objects and their affordances from RGB images. Our AffordanceNet has two branches: an object detection branch to localize and classify the object, and an affordance detection branch to assign each pixel in the object to its most probable affordance label. The proposed framework employs three key components for effectively handling the multiclass problem in the affordance mask: a sequence of deconvolutional layers, a robust resizing strategy, and a multi-task loss function. The experimental results on the public datasets show that our AffordanceNet outperforms recent state-of-the-art methods by a fair margin, while its end-to-end architecture allows the inference at the speed of 150ms per image. This makes our AffordanceNet well suitable for real-time robotic applications. Furthermore, we demonstrate the effectiveness of AffordanceNet in different testing environments and in real robotic applications. The source code is available at https://github.com/nqanh/affordance-net.

[1]  Nikolaos G. Tsagarakis,et al.  XBotCore: A Real-Time Cross-Robot Software Platform , 2017, 2017 First IEEE International Conference on Robotic Computing (IRC).

[2]  Sinisa Todorovic,et al.  A Multi-scale CNN for Affordance Segmentation in RGB Images , 2016, ECCV.

[3]  Ian D. Reid,et al.  RefineNet: Multi-path Refinement Networks for High-Resolution Semantic Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Nikolaos G. Tsagarakis,et al.  Translating Videos to Commands for Robotic Manipulation with Deep Recurrent Neural Networks , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[6]  Florentin Wörgötter,et al.  Bootstrapping the Semantics of Tools: Affordance Analysis of Real World Objects on a Per-part Basis , 2016, IEEE Transactions on Cognitive and Developmental Systems.

[7]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Ian D. Reid,et al.  SceneCut: Joint Geometric and Object Segmentation for Indoor Scenes , 2017, 2018 IEEE International Conference on Robotics and Automation (ICRA).

[9]  Lihi Zelnik-Manor,et al.  How to Evaluate Foreground Maps , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Nikolaos G. Tsagarakis,et al.  Yarp Based Plugins for Gazebo Simulator , 2014, MESAS.

[11]  Jitendra Malik,et al.  Deformable part models are convolutional neural networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Nikolaos G. Tsagarakis,et al.  Object-based affordances detection with Convolutional Neural Networks and dense Conditional Random Fields , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[13]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Nikolaos G. Tsagarakis,et al.  OpenSoT: A whole-body control library for the compliant humanoid robot COMAN , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[15]  Barbara Caputo,et al.  Using Object Affordances to Improve Object Recognition , 2011, IEEE Transactions on Autonomous Mental Development.

[16]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[17]  Nikolaos G. Tsagarakis,et al.  Detecting object affordances with Convolutional Neural Networks , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[18]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[20]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[21]  Honglak Lee,et al.  Deep learning for detecting robotic grasps , 2013, Int. J. Robotics Res..

[22]  Trevor Darrell,et al.  Learning to Detect Visual Grasp Affordance , 2016, IEEE Transactions on Automation Science and Engineering.

[23]  Yiannis Aloimonos,et al.  Affordance detection of tool parts from geometric features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[24]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[25]  Jörn Malzahn,et al.  WALK‐MAN: A High‐Performance Humanoid Platform for Realistic Environments , 2017, J. Field Robotics.

[26]  Juergen Gall,et al.  Weakly Supervised Affordance Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[28]  Jitendra Malik,et al.  Simultaneous Detection and Segmentation , 2014, ECCV.

[29]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  J. Gibson The Ecological Approach to Visual Perception , 1979 .