Deep Affordance-Grounded Sensorimotor Object Recognition

It is well-established by cognitive neuroscience that human perception of objects constitutes a complex process, where object appearance information is combined with evidence about the so-called object affordances, namely the types of actions that humans typically perform when interacting with them. This fact has recently motivated the sensorimotor approach to the challenging task of automatic object recognition, where both information sources are fused to improve robustness. In this work, the aforementioned paradigm is adopted, surpassing current limitations of sensorimotor object recognition research. Specifically, the deep learning paradigm is introduced to the problem for the first time, developing a number of novel neuro-biologically and neuro-physiologically inspired architectures that utilize state-of-the-art neural networks for fusing the available information sources in multiple ways. The proposed methods are evaluated using a large RGB-D corpus, which is specifically collected for the task of sensorimotor object recognition and is made publicly available. Experimental results demonstrate the utility of affordance information to object recognition, achieving an up to 29% relative error reduction by its inclusion.

[1]  Daniel Cremers,et al.  A primal-dual framework for real-time dense RGB-D scene flow , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[4]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[5]  Azriel Rosenfeld,et al.  Recognition by Functional Parts , 1995, Comput. Vis. Image Underst..

[6]  Markus Vincze,et al.  Supervised learning of hidden and non-hidden 0-order affordances and detection in real scenes , 2012, 2012 IEEE International Conference on Robotics and Automation.

[7]  Vladimir Vezhnevets,et al.  A Survey on Pixel-Based Skin Color Detection Techniques , 2003 .

[8]  Yi Liu,et al.  Shape Topics: A Compact Representation and New Algorithms for 3D Partial Shape Retrieval , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[9]  James M. Rehg,et al.  Affordance Prediction via Learned Object Attributes , 2011 .

[10]  Tobias Kluth,et al.  Affordance-Based Object Recognition Using Interactions Obtained from a Utility Maximization Principle , 2014, ECCV Workshops.

[11]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[12]  Xiaolin Hu,et al.  Recurrent convolutional neural network for object recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Barbara Caputo,et al.  Using Object Affordances to Improve Object Recognition , 2011, IEEE Transactions on Autonomous Mental Development.

[15]  Luc De Raedt,et al.  Statistical Relational Learning of Object Affordances for Robotic Manipulation , 2011, ILP.

[16]  Juho Kannala,et al.  Joint Depth and Color Camera Calibration with Distortion Correction , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Afra Wohlschläger,et al.  The Neural Correlates of Planning and Executing Actual Tool Use , 2014, The Journal of Neuroscience.

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  David Filliat,et al.  From passive to interactive object learning and recognition through self-identification on a humanoid robot , 2016, Auton. Robots.

[21]  三嶋 博之 The theory of affordances , 2008 .

[22]  Jürgen Schmidhuber,et al.  Training Recurrent Networks by Evolino , 2007, Neural Computation.

[23]  Marvin Minsky,et al.  Society of Mind: A Response to Four Reviews , 1991, Artif. Intell..

[24]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[26]  Lauren L. Cloutman,et al.  Interaction between dorsal and ventral processing streams: Where, when and how? , 2013, Brain and Language.

[27]  Danica Kragic,et al.  Visual object-action recognition: Inferring object affordances from human demonstration , 2011, Comput. Vis. Image Underst..

[28]  Danica Kragic,et al.  A Sensorimotor Learning Framework for Object Categorization , 2016, IEEE Transactions on Cognitive and Developmental Systems.

[29]  A. Noë,et al.  A sensorimotor account of vision and visual consciousness. , 2001, The Behavioral and brain sciences.

[30]  John K. Tsotsos,et al.  50 Years of object recognition: Directions forward , 2013, Comput. Vis. Image Underst..

[31]  Kevin W. Bowyer,et al.  Function from visual analysis and physical interaction: a methodology for recognition of generic classes of objects , 1998, Image Vis. Comput..

[32]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[33]  M. Davare,et al.  Interactions between dorsal and ventral streams for controlling skilled grasp , 2015, Neuropsychologia.

[34]  R. Shaw,et al.  Perceiving, Acting and Knowing : Toward an Ecological Psychology , 1978 .

[35]  Stefan Carlsson,et al.  CNN Features Off-the-Shelf: An Astounding Baseline for Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.