Recognizing visual composite in real images

Automatically discovering and recognizing the main structured visual pattern of an image is a challenging problem. The most difficulties are how to find the component objects and how to recognize the interaction among these objects. The component objects of the structured visual pattern have consistent 3D spatial co-occurrence layout across images, which manifest themselves as a predictable pattern called visual composite. In this paper, we propose a visual composite recognition model to automatically discover and recognize the visual composite of an image. Our model firstly learns 3D spatial co-occurrence statistics among objects to discover the potential structured visual pattern of an image so that it captures the component objects of visual composite. Secondly, we construct a feedforward architecture using the proposed factored three-way interaction machine to recognize the visual composite, which casts the recognition problem as a structured prediction task. It predicts the visual composite by maximizing the probability of the correct structured label given the component objects and their 3D spatial context. Experiments conducted on a six-class sports dataset and a phrasal recognition dataset respectively demonstrate the encouraging performance of our model in discovery precision and recognition accuracy compared with competing approaches.

[1]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Ankush Gupta,et al.  From Image Annotation to Image Description , 2012, ICONIP.

[3]  Greg Mori,et al.  From Subcategories to Visual Composites: A Multi-level Framework for Object Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[4]  Geoffrey E. Hinton,et al.  Gated Softmax Classification , 2010, NIPS.

[5]  Tsuhan Chen,et al.  Automatic discovery of groups of objects for scene understanding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Alexei A. Efros,et al.  Recovering Occlusion Boundaries from an Image , 2011, International Journal of Computer Vision.

[7]  Tsuhan Chen,et al.  Efficient Kernels for identifying unbounded-order spatial features , 2009, CVPR.

[8]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[9]  Marcus Rohrbach,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[10]  Deva Ramanan,et al.  Detecting Actions, Poses, and Objects with Relational Phraselets , 2012, ECCV.

[11]  Shuai Jiang,et al.  Main objects interaction activity recognition in real images , 2015, Neural Computing and Applications.

[12]  Ronan Collobert,et al.  Simple Image Description Generator via a Linear Phrase-Based Approach , 2014, ICLR.

[13]  Tsuhan Chen,et al.  Efficient Kernels for identifying unbounded-order spatial features , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Joshua B. Tenenbaum,et al.  Learning to share visual appearance for multiclass object detection , 2011, CVPR 2011.

[15]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[16]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Larry S. Davis,et al.  Observing Human-Object Interactions: Using Spatial and Functional Compatibility for Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[19]  Fei-FeiLi,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012 .

[20]  Charless C. Fowlkes,et al.  Discriminative Models for Multi-Class Object Layout , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[21]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Fei-Fei Li,et al.  Recognizing Human-Object Interactions in Still Images by Modeling the Mutual Context of Objects and Human Poses , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Fei-Fei Li,et al.  Grouplet: A structured image representation for recognizing human and object interactions , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.