论文信息 - A Multi-scale CNN for Affordance Segmentation in RGB Images

A Multi-scale CNN for Affordance Segmentation in RGB Images

Given a single RGB image our goal is to label every pixel with an affordance type. By affordance, we mean an object’s capability to readily support a certain human action, without requiring precursor actions. We focus on segmenting the following five affordance types in indoor scenes: ‘walkable’, ‘sittable’, ‘lyable’, ‘reachable’, and ‘movable’. Our approach uses a deep architecture, consisting of a number of multi-scale convolutional neural networks, for extracting mid-level visual cues and combining them toward affordance segmentation. The mid-level cues include depth map, surface normals, and segmentation of four types of surfaces – namely, floor, structure, furniture and props. For evaluation, we augmented the NYUv2 dataset with new ground-truth annotations of the five affordance types. We are not aware of prior work which starts from pixels, infers mid-level cues, and combines them in a feed-forward fashion for predicting dense affordance maps of a single RGB image.

Sinisa Todorovic | Anirban Roy | S. Todorovic | Anirban Roy

[1] H. Barrow,et al. RECOVERING INTRINSIC SCENE CHARACTERISTICS FROM IMAGES , 1978 .

[2] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[3] Jitendra Malik,et al. Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4] Derek Hoiem,et al. Recovering the spatial layout of cluttered rooms , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[5] Azriel Rosenfeld,et al. Recognition by Functional Parts , 1995, Comput. Vis. Image Underst..

[6] Yann LeCun,et al. Toward automatic phenotyping of developing embryos from videos , 2005, IEEE Transactions on Image Processing.

[7] Y. Aloimonos,et al. Affordance of Object Parts from Geometric Features , 2014 .

[8] Rob Fergus,et al. Depth Map Prediction from a Single Image using a Multi-Scale Deep Network , 2014, NIPS.

[9] Larry S. Davis,et al. Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] James M. Rehg,et al. Affordance Prediction via Learned Object Attributes , 2011 .

[11] Michael R. Lowry,et al. Learning Physical Descriptions From Functional Definitions, Examples, and Precedents , 1983, AAAI.

[12] Danica Kragic,et al. Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[13] Jitendra Malik,et al. Simultaneous Detection and Segmentation , 2014, ECCV.

[14] Song-Chun Zhu,et al. Inferring "Dark Matter" and "Dark Energy" from Videos , 2013, 2013 IEEE International Conference on Computer Vision.

[15] L. Stark,et al. Dissertation Abstract , 1994, Journal of Cognitive Education and Psychology.

[16] Yun Jiang,et al. Hallucinated Humans as the Hidden Context for Labeling 3D Scenes , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[17] Fei-Fei Li,et al. Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[18] Fei-Fei Li,et al. Discovering Object Functionality , 2013, 2013 IEEE International Conference on Computer Vision.

[19] Larry S. Davis,et al. Objects in Action: An Approach for Combining Action Understanding and Object Perception , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[20] Derek Hoiem,et al. Indoor Segmentation and Support Inference from RGBD Images , 2012, ECCV.

[21] Victor S. Lempitsky,et al. N4-Fields: Neural Network Nearest Neighbor Fields for Image Transforms , 2014, ArXiv.

[22] Alexei A. Efros,et al. Scene Semantics from Long-Term Observation of People , 2012, ECCV.

[23] Yiannis Aloimonos,et al. Affordance detection of tool parts from geometric features , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[24] Alexei A. Efros,et al. From 3D scene geometry to human workspace , 2011, CVPR 2011.

[25] Camille Couprie,et al. Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Alexei A. Efros,et al. Recovering Surface Layout from an Image , 2007, International Journal of Computer Vision.

[27] Hema Swetha Koppula,et al. Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Barbara Caputo,et al. Using Object Affordances to Improve Object Recognition , 2011, IEEE Transactions on Autonomous Mental Development.

[29] Luc Van Gool,et al. What makes a chair a chair? , 2011, CVPR 2011.

[30] Ronan Collobert,et al. Recurrent Convolutional Neural Networks for Scene Parsing , 2013, ArXiv.

[31] Alexei A. Efros,et al. People Watching: Human Actions as a Cue for Single View Geometry , 2012, International Journal of Computer Vision.

[32] Jianxiong Xiao,et al. DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33] Ashutosh Saxena,et al. Robotic Grasping of Novel Objects using Vision , 2008, Int. J. Robotics Res..

[34] Yann LeCun,et al. Indoor Semantic Segmentation using depth information , 2013, ICLR.

[35] 三嶋博之. The theory of affordances , 2008 .

[36] Hema Swetha Koppula,et al. Physically Grounded Spatio-temporal Object Affordances , 2014, ECCV.

[37] Alexei A. Efros,et al. Geometric context from a single image , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[38] Sai Kit Yeung,et al. Fill and Transfer: A Simple Physics-Based Approach for Containability Reasoning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39] Hema Swetha Koppula,et al. Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[40] Song-Chun Zhu,et al. Scene Parsing by Integrating Function, Geometry and Appearance Models , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[41] Li Fei-Fei,et al. Reasoning about Object Affordances in a Knowledge Base Representation , 2014, ECCV.

[42] Danica Kragic,et al. Simultaneous Visual Recognition of Manipulation Actions and Manipulated Objects , 2008, ECCV.

[43] James J. Gibson,et al. The Ecological Approach to Visual Perception: Classic Edition , 2014 .

[44] Song-Chun Zhu,et al. Understanding tools: Task-oriented object modeling, learning and recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Alexei A. Efros,et al. Closing the loop in scene interpretation , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[46] Andrew Y. Ng,et al. Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[47] Rob Fergus,et al. Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-scale Convolutional Architecture , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[48] Abhinav Gupta,et al. In Defense of the Direct Perception of Affordances , 2015, ArXiv.

[49] Ali Farhadi,et al. Describing objects by their attributes , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.