论文信息 - Is alice chasing or being chased?: Determining subject and object of activities in videos

Is alice chasing or being chased?: Determining subject and object of activities in videos

Recent progress in video description has shown promising results by combining object/action recognition and natural language processing techniques. However, even the most simplest form of the generated sentence, the SVO triplet (Subject/Verb/Object), can be misleading for its lack of role relationship analysis. When the system detects keywords "person", "baby" and "feed", we do not want the system to generate "a person feeding a baby" when the actual screen is a scene where the baby is trying to share the food. In this paper, we explore role relationships between objects/persons and their usage in generating a more meaningful video description. More specifically, we confine ourselves on the following problem: identifying subject and object roles in two-person activities. We argue that the subject and object roles have consistent properties across different activities. To that end, we cast this problem as a domain adaptation problem. A novel Youtube SVO dataset is proposed for evaluating methods developed for this problem. The performance of the proposed method is compared against several baseline methods.

[1] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Ivan Laptev,et al. On Space-Time Interest Points , 2005, International Journal of Computer Vision.

[3] Andrew Zisserman,et al. Detecting People Looking at Each Other in Videos , 2014, International Journal of Computer Vision.

[4] Jianguo Zhang,et al. The PASCAL Visual Object Classes Challenge , 2006 .

[5] Bohyung Han,et al. Multi-agent Event Detection: Localization and Role Assignment , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[6] Fei-Fei Li,et al. Social Role Discovery in Human Events , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[7] Song-Chun Zhu,et al. Joint inference of groups, events and human roles in aerial videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Chih-Jen Lin,et al. LIBSVM: A library for support vector machines , 2011, TIST.

[9] Kate Saenko,et al. Generating Natural-Language Video Descriptions Using Text-Mined Knowledge , 2013, AAAI.

[10] David A. McAllester,et al. Object Detection with Discriminatively Trained Part Based Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Trevor Darrell,et al. Semi-supervised Domain Adaptation with Instance Constraints , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[12] Haibin Ling,et al. Real time robust L1 tracker using accelerated proximal gradient approach , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Rama Chellappa,et al. Domain adaptation for object recognition: An unsupervised approach , 2011, 2011 International Conference on Computer Vision.

[14] Vidit Jain,et al. Adapting Classification Cascades to New Domains , 2013, 2013 IEEE International Conference on Computer Vision.

[15] Yi Yang,et al. Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM , 2015, ICML.

[16] Brian C. Lovell,et al. Detecting kangaroos in the wild: the first step towards automated animal surveillance , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Yongxin Yang,et al. Zero-Shot Domain Adaptation via Kernel Regression on the Grassmannian , 2015, ArXiv.

[18] Ramakant Nevatia,et al. Semantic Aware Video Transcription Using Random Forest Classifiers , 2014, ECCV.

[19] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[20] Yuan Shi,et al. Geodesic flow kernel for unsupervised domain adaptation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[21] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[22] Luc Van Gool,et al. The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.