论文信息 - Localizing the Common Action Among a Few Videos

Localizing the Common Action Among a Few Videos

This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (\textit{i}) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (\textit{ii}) a progressive alignment module that iteratively fuses the support videos into the query branch; and (\textit{iii}) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.

[1] R. Nevatia,et al. TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Cordelia Schmid,et al. Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[4] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5] Yong Jae Lee,et al. Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6] Cordelia Schmid,et al. Joint Learning of Object and Action Detectors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7] Fatih Murat Porikli,et al. One-Shot Action Localization by Learning Sequence Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Luc Van Gool,et al. Creating Summaries from User Videos , 2014, ECCV.

[10] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Haroon Idrees,et al. The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[12] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[14] Cees Snoek,et al. Objects2action: Classifying and Localizing Actions without Any Video Example , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15] Cordelia Schmid,et al. Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[16] Xiaoou Tang,et al. Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[17] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[18] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19] Luc Van Gool,et al. UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .

[21] Cees G. M. Snoek,et al. ActionBytes: Learning From Trimmed Videos to Localize Actions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Larry S. Davis,et al. Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Juergen Gall,et al. Ex Paucis Plura: Learning Affordance Segmentation from Very Few Examples , 2018, GCPR.

[24] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25] Amit K. Roy-Chowdhury,et al. W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[26] Bohyung Han,et al. Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27] Hilde Kuehne,et al. A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Li Fei-Fei,et al. End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[30] Junsong Yuan,et al. Common Action Discovery and Localization in Unconstrained Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31] Jean Ponce,et al. Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Mubarak Shah,et al. Unsupervised Action Discovery and Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34] Stan Sclaroff,et al. Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Charless C. Fowlkes,et al. Weakly-Supervised Action Localization With Background Modeling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36] Bernard Ghanem,et al. DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[37] Deng Cai,et al. Localizing Unseen Activities in Video via Image Query , 2019, IJCAI.

[38] Byron Boots,et al. Learning to Find Common Objects Across Image Collections , 2019, ArXiv.

[39] Cees Snoek,et al. Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Bernard Ghanem,et al. SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41] Cees Snoek,et al. SILCO: Show a Few Images, Localize the Common Object , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42] Byron Boots,et al. Learning to Find Common Objects Across Few Image Collections , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43] Ming Yang,et al. BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[44] Ramakant Nevatia,et al. CTAP: Complementary Temporal Action Proposal Generation , 2018, ECCV.

[45] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[46] Yang Feng,et al. Video Re-localization , 2018, ECCV.

[47] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[48] Luca Antiga,et al. Automatic differentiation in PyTorch , 2017 .

[49] Ming Shao,et al. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Deyu Meng,et al. Few-Example Object Detection with Model Communication , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.