Is alice chasing or being chased?: Determining subject and object of activities in videos

Recent progress in video description has shown promising results by combining object/action recognition and natural language processing techniques. However, even the most simplest form of the generated sentence, the SVO triplet (Subject/Verb/Object), can be misleading for its lack of role relationship analysis. When the system detects keywords "person", "baby" and "feed", we do not want the system to generate "a person feeding a baby" when the actual screen is a scene where the baby is trying to share the food. In this paper, we explore role relationships between objects/persons and their usage in generating a more meaningful video description. More specifically, we confine ourselves on the following problem: identifying subject and object roles in two-person activities. We argue that the subject and object roles have consistent properties across different activities. To that end, we cast this problem as a domain adaptation problem. A novel Youtube SVO dataset is proposed for evaluating methods developed for this problem. The performance of the proposed method is compared against several baseline methods.

[1]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ivan Laptev,et al.  On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[3]  Andrew Zisserman,et al.  Detecting People Looking at Each Other in Videos , 2014, International Journal of Computer Vision.

[4]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[5]  Bohyung Han,et al.  Multi-agent Event Detection: Localization and Role Assignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Fei-Fei Li,et al.  Social Role Discovery in Human Events , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Song-Chun Zhu,et al.  Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[9]  Kate Saenko,et al.  Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[10]  David A. McAllester,et al.  Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Trevor Darrell,et al.  Semi-supervised Domain Adaptation with Instance Constraints , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Haibin Ling,et al.  Real time robust L1 tracker using accelerated proximal gradient approach , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Rama Chellappa,et al.  Domain adaptation for object recognition: An unsupervised approach , 2011, 2011 International Conference on Computer Vision.

[14]  Vidit Jain,et al.  Adapting Classification Cascades to New Domains , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Yi Yang,et al.  Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM , 2015, ICML.

[16]  Brian C. Lovell,et al.  Detecting kangaroos in the wild: the first step towards automated animal surveillance , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yongxin Yang,et al.  Zero-Shot Domain Adaptation via Kernel Regression on the Grassmannian , 2015, ArXiv.

[18]  Ramakant Nevatia,et al.  Semantic Aware Video Transcription Using Random Forest Classifiers , 2014, ECCV.

[19]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[20]  Yuan Shi,et al.  Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.