论文信息 - Learning Spatiotemporal Features via Video and Text Pair Discrimination

Learning Spatiotemporal Features via Video and Text Pair Discrimination

Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. Specifically, we adopt noise-contrastive estimation to tackle the computational issue imposed by the huge amount of pair instance classes and design a practical curriculum learning strategy. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD model yields a new state of the art for zero-shot action recognition on UCF101 by directly utilizing the learnt visual-textual embeddings. The code will be made available at this https URL.

Limin Wang | Tianhao Li | Limin Wang | Tianhao Li

[1] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[2] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[3] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[4] Allan Jabri,et al. Learning Correspondence From the Cycle-Consistency of Time , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Matthew J. Hausknecht,et al. Beyond short snippets: Deep networks for video classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Andrew Zisserman,et al. Video Representation Learning by Dense Predictive Coding , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[8] Cees Snoek,et al. Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[9] Antonio Torralba,et al. SoundNet: Learning Sound Representations from Unlabeled Video , 2016, NIPS.

[10] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Limin Wang,et al. Appearance-and-Relation Networks for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Yueting Zhuang,et al. Self-Supervised Spatiotemporal Learning via Video Clip Order Prediction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Phillip Isola,et al. Contrastive Multiview Coding , 2019, ECCV.

[15] Noah Snavely,et al. Unsupervised Learning of Depth and Ego-Motion from Video , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[17] Alexei A. Efros,et al. Colorful Image Colorization , 2016, ECCV.

[18] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Wei Liu,et al. Self-Supervised Spatio-Temporal Representation Learning for Videos by Predicting Motion and Appearance Statistics , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Cordelia Schmid,et al. Contrastive Bidirectional Transformer for Temporal Representation Learning , 2019, ArXiv.

[21] Bowen Zhang,et al. Real-Time Action Recognition with Enhanced Motion Vector CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Stella X. Yu,et al. Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[23] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[24] Kaiming He,et al. Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[25] Efstratios Gavves,et al. Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Luc Van Gool,et al. DynamoNet: Dynamic Action and Motion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30] Xinlei Chen,et al. Webly Supervised Learning of Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31] Andrew Zisserman,et al. Look, Listen and Learn , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[32] Kaiming He,et al. Mask R-CNN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Jonghyun Choi,et al. ActionFlowNet: Learning Motion Representation for Action Recognition , 2016, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[35] Yoshua Bengio,et al. Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[36] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[37] Sanja Fidler,et al. Skip-Thought Vectors , 2015, NIPS.

[38] Alexei A. Efros,et al. Unsupervised Visual Representation Learning by Context Prediction , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39] Paolo Favaro,et al. Representation Learning by Learning to Count , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Yue Zhao,et al. Omni-sourced Webly-supervised Learning for Video Recognition , 2020, ECCV.

[41] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[42] Lorenzo Torresani,et al. Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization , 2018, NeurIPS.

[43] Antonio Torralba,et al. Anticipating Visual Representations from Unlabeled Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[45] Heng Wang,et al. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Abhinav Gupta,et al. Unsupervised Learning of Visual Representations Using Videos , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[48] Ivan Laptev,et al. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.

[49] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[50] Andrew Zisserman,et al. Learning and Using the Arrow of Time , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[53] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54] Xi Wang,et al. Modeling Spatial-Temporal Clues in a Hybrid Deep Learning Framework for Video Classification , 2015, ACM Multimedia.

[55] Xirong Li,et al. Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56] Timothy Dozat,et al. Universal Dependency Parsing from Scratch , 2019, CoNLL.

[57] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.

[58] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[59] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[60] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[62] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[63] Kristen Grauman,et al. Learning Image Representations Tied to Egomotion from Unlabeled Video , 2017, International Journal of Computer Vision.

[64] B. Holden. Listen and learn , 2002 .

[65] Tao Mei,et al. Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[66] Aapo Hyvärinen,et al. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[67] Svetlana Lazebnik,et al. Enhancing Video Summarization via Vision-Language Embedding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68] Martial Hebert,et al. Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[69] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[70] Yann LeCun,et al. A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[71] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.