Weakly-Supervised Action Recognition and Localization via Knowledge Transfer

Action recognition and localization has attracted much attention in the past decade. However, a challenging problem is that it typically requires large-scale temporal annotations of action instances for training models in untrimmed video scenarios, which is not practical in many real-world applications. To alleviate the problem, we propose a novel weakly-supervised action recognition framework for untrimmed videos to use only video-level annotations to transfer information from publicly available trimmed videos to assist in model learning, namely KTUntrimmedNet. A two-stage method is designed to guarantee an effective transfer strategy: Firstly, the trimmed and untrimmed videos are clustered to find similar classes between them, so as to avoid negative information transfer from trimmed data. Secondly, we design an invariant module to find common features between trimmed videos and untrimmed videos for improving the performance. Extensive experiments on the standard benchmark datasets, THUMOS14 and ActivityNet1.3, clearly demonstrate the efficacy of our proposed method when compared with the existing state-of-the-arts.

[1]  Bernard Ghanem,et al.  Action Search: Learning to Search for Human Activities in Untrimmed Videos , 2017, ArXiv.

[2]  Changsheng Li,et al.  Learning Transferable Self-attentive Representations for Action Recognition in Untrimmed Videos with Weak Supervision , 2019, AAAI.

[3]  Cees Snoek,et al.  What do 15,000 object categories tell us about classifying and localizing actions? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[5]  Imran Saleemi,et al.  Human Action Recognition across Datasets by Foreground-Weighted Histogram Decomposition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[7]  Bo Du,et al.  Domain Adaptation for Remote Sensing Image Classification: A Low-Rank Reconstruction and Instance Weighting Label Propagation Inspired Algorithm , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[8]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[9]  Changsheng Xu,et al.  Effective Annotation and Search for Video Blogs with Integration of Context and Content Analysis , 2009, IEEE Trans. Multim..

[10]  Sethuraman Panchanathan,et al.  Deep-Learning Systems for Domain Adaptation in Computer Vision: Learning Transferable Feature Representations , 2017, IEEE Signal Processing Magazine.

[11]  Yu-Chiang Frank Wang,et al.  Adaptation and Re-identification Network: An Unsupervised Deep Transfer Learning Approach to Person Re-identification , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Xiaoyu Zhang,et al.  Update vs. upgrade: Modeling with indeterminate multi-class active learning , 2015, Neurocomputing.

[13]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[14]  Bowen Zhang,et al.  Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Rahul Sukthankar,et al.  Rethinking the Faster R-CNN Architecture for Temporal Action Localization , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Xinming Zhang,et al.  A novel framework for semantic segmentation with generative adversarial network , 2019, J. Vis. Commun. Image Represent..

[18]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Peng Li,et al.  Active semi-supervised learning based on self-expressive correlation with generative adversarial networks , 2019, Neurocomputing.

[21]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Yi Yang,et al.  Camera Style Adaptation for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Serge J. Belongie,et al.  Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[28]  Xiaoyu Zhang,et al.  Simultaneous optimization for robust correlation estimation in partially observed social network , 2016, Neurocomputing.

[29]  Jing Dong,et al.  SSGAN: Secure Steganography Based on Generative Adversarial Networks , 2017, PCM.

[30]  Xueting Li,et al.  Learning Linear Transformations for Fast Arbitrary Style Transfer , 2018, ArXiv.

[31]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Michael I. Jordan,et al.  Deep Transfer Learning with Joint Adaptation Networks , 2016, ICML.

[33]  Bingbing Ni,et al.  Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[35]  Changsheng Li,et al.  Residual Invertible Spatio-Temporal Network for Video Super-Resolution , 2019, AAAI.

[36]  Xiaoyu Zhang,et al.  Interactive patent classification based on multi-classifier fusion and active learning , 2014, Neurocomputing.

[37]  Xiaoyu Zhang,et al.  Bidirectional Active Learning: A Two-Way Exploration Into Unlabeled and Labeled Data Set , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[38]  Xiaoyu Zhang,et al.  ListNet-based object proposals ranking , 2017, Neurocomputing.

[39]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.