Localizing the Common Action Among a Few Videos

This paper strives to localize the temporal extent of an action in a long untrimmed video. Where existing work leverages many examples with their start, their ending, and/or the class of the action during training time, we propose few-shot common action localization. The start and end of an action in a long untrimmed video is determined based on just a hand-full of trimmed video examples containing the same action, without knowing their common class label. To address this task, we introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments. The network contains: (\textit{i}) a mutual enhancement module to simultaneously complement the representation of the few trimmed support videos and the untrimmed query video; (\textit{ii}) a progressive alignment module that iteratively fuses the support videos into the query branch; and (\textit{iii}) a pairwise matching module to weigh the importance of different support videos. Evaluation of few-shot common action localization in untrimmed videos containing a single or multiple action instances demonstrates the effectiveness and general applicability of our proposal.

[1]  R. Nevatia,et al.  TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Bernard Ghanem,et al.  ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Cordelia Schmid,et al.  Weakly Supervised Action Labeling in Videos under Ordering Constraints , 2014, ECCV.

[4]  Kate Saenko,et al.  R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Yong Jae Lee,et al.  Hide-and-Seek: Forcing a Network to be Meticulous for Weakly-Supervised Object and Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[6]  Cordelia Schmid,et al.  Joint Learning of Object and Action Detectors , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Fatih Murat Porikli,et al.  One-Shot Action Localization by Learning Sequence Matching Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Shih-Fu Chang,et al.  Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[10]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Haroon Idrees,et al.  The THUMOS challenge on action recognition for videos "in the wild" , 2016, Comput. Vis. Image Underst..

[12]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Cees Snoek,et al.  Objects2action: Classifying and Localizing Actions without Any Video Example , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[15]  Cordelia Schmid,et al.  Action and Event Recognition with Fisher Vectors on a Compact Feature Set , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Xiaoou Tang,et al.  Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[17]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Luc Van Gool,et al.  UntrimmedNets for Weakly Supervised Action Recognition and Detection , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[21]  Cees G. M. Snoek,et al.  ActionBytes: Learning From Trimmed Videos to Localize Actions , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Larry S. Davis,et al.  Temporal Context Network for Activity Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23]  Juergen Gall,et al.  Ex Paucis Plura: Learning Affordance Segmentation from Very Few Examples , 2018, GCPR.

[24]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Amit K. Roy-Chowdhury,et al.  W-TALC: Weakly-supervised Temporal Activity Localization and Classification , 2018, ECCV.

[26]  Bohyung Han,et al.  Weakly Supervised Action Localization by Sparse Temporal Pooling Network , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Hilde Kuehne,et al.  A Hybrid RNN-HMM Approach for Weakly Supervised Temporal Action Segmentation , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[30]  Junsong Yuan,et al.  Common Action Discovery and Localization in Unconstrained Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Jean Ponce,et al.  Automatic annotation of human actions in video , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Mubarak Shah,et al.  Unsupervised Action Discovery and Localization in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Stan Sclaroff,et al.  Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Charless C. Fowlkes,et al.  Weakly-Supervised Action Localization With Background Modeling , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Bernard Ghanem,et al.  DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[37]  Deng Cai,et al.  Localizing Unseen Activities in Video via Image Query , 2019, IJCAI.

[38]  Byron Boots,et al.  Learning to Find Common Objects Across Image Collections , 2019, ArXiv.

[39]  Cees Snoek,et al.  Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Bernard Ghanem,et al.  SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Cees Snoek,et al.  SILCO: Show a Few Images, Localize the Common Object , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Byron Boots,et al.  Learning to Find Common Objects Across Few Image Collections , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Ming Yang,et al.  BSN: Boundary Sensitive Network for Temporal Action Proposal Generation , 2018, ECCV.

[44]  Ramakant Nevatia,et al.  CTAP: Complementary Temporal Action Proposal Generation , 2018, ECCV.

[45]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[46]  Yang Feng,et al.  Video Re-localization , 2018, ECCV.

[47]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[48]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[49]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Deyu Meng,et al.  Few-Example Object Detection with Model Communication , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.