Towards Context-Aware Interaction Recognition for Visual Relationship Detection

Recognizing how objects interact with each other is a crucial task in visual recognition. If we define the context of the interaction to be the objects involved, then most current methods can be categorized as either: (i) training a single classifier on the combination of the interaction and its context; or (ii) aiming to recognize the interaction independently of its explicit context. Both methods suffer limitations: the former scales poorly with the number of combinations and fails to generalize to unseen combinations, while the latter often leads to poor interaction recognition performance due to the difficulty of designing a contextindependent interaction classifier.,,To mitigate those drawbacks, this paper proposes an alternative, context-aware interaction recognition framework. The key to our method is to explicitly construct an interaction classifier which combines the context, and the interaction. The context is encoded via word2vec into a semantic space, and is used to derive a classification result for the interaction. The proposed method still builds one classifier for one interaction (as per type (ii) above), but the classifier built is adaptive to context via weights which are context dependent. The benefit of using the semantic space is that it naturally leads to zero-shot generalizations in which semantically similar contexts (subject-object pairs) can be recognized as suitable contexts for an interaction, even if they were not observed in the training set. Our method also scales with the number of interaction-context pairs since our model parameters do not increase with the number of interactions. Thus our method avoids the limitation of both approaches. We demonstrate experimentally that the proposed framework leads to improved performance for all investigated interaction representations and datasets.

[1]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.

[3]  Larry S. Davis,et al.  Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers , 2008, ECCV.

[4]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[5]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mario Fritz,et al.  Ask Your Neurons: A Neural-Based Approach to Answering Questions about Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[7]  Xiaogang Wang,et al.  Scene Graph Generation from Objects, Phrases and Region Captions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[9]  Leonidas J. Guibas,et al.  Human action recognition by learning bases of action attributes and parts , 2011, 2011 International Conference on Computer Vision.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Samy Bengio,et al.  Learning semantic relationships for better action retrieval in images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Svetlana Lazebnik,et al.  Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering , 2016, ECCV.

[13]  Jiaxuan Wang,et al.  HICO: A Benchmark for Recognizing Human-Object Interactions in Images , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[14]  Ali Farhadi,et al.  VisKE: Visual knowledge extraction and question answering by visual verification of relation phrases , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Svetlana Lazebnik,et al.  Phrase Localization and Visual Relationship Detection with Comprehensive Linguistic Cues , 2016, ArXiv.

[16]  Trevor Darrell,et al.  Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[17]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Andrea Vedaldi,et al.  Dynamic Image Networks for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Trevor Darrell,et al.  Modeling Relationships in Referential Expressions with Compositional Modular Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Michael S. Bernstein,et al.  Visual Relationship Detection with Language Priors , 2016, ECCV.

[22]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[23]  Franco Turini,et al.  Time-Annotated Sequences for Medical Data Mining , 2007 .

[24]  Andrew Y. Ng,et al.  Transfer learning for text classification , 2005, NIPS.

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  Kilian Q. Weinberger,et al.  Large Margin Multi-Task Metric Learning , 2010, NIPS.

[29]  Lin Ma,et al.  Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[30]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[31]  Barbara Caputo,et al.  Learning to Learn, from Transfer Learning to Domain Adaptation: A Unifying Perspective , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Xiaogang Wang,et al.  ViP-CNN: Visual Phrase Guided Convolutional Neural Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Richard S. Zemel,et al.  Image Question Answering: A Visual Semantic Embedding Model and a New Dataset , 2015, ArXiv.

[36]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Eric P. Xing,et al.  Deep Variation-Structured Reinforcement Learning for Visual Relationship and Attribute Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[39]  Ramesh Nallapati,et al.  A Comparative Study of Methods for Transductive Transfer Learning , 2007, Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007).

[40]  Shih-Fu Chang,et al.  Visual Translation Embedding Network for Visual Relation Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).