论文信息 - Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning

Text-to-Clip Video Retrieval with Early Fusion and Re-Captioning

We propose a novel method capable of retrieving clips from untrimmed videos based on natural language queries. This cross-modal retrieval task plays a key role in visual-semantic understanding, and requires localizing clips in time and computing their similarity to the query sentence. Current methods generate sentence and video embeddings and then compare them using a late fusion approach, but this ignores the word order in queries and prevents more fine-grained comparisons. Motivated by the need for fine-grained multi-modal feature fusion, we propose a novel early fusion embedding approach that combines video and language information at the word level. Furthermore, we use the inverse task of dense video captioning as a side-task to improve the learned embedding. Our full model combines these components with an efficient proposal pipeline that performs accurate localization of potential video clips. We present a comprehensive experimental validation on two large-scale text-to-clip datasets (Charades-STA and DiDeMo) and attain state-of-the-art retrieval results with our model.

[1] Marcus Rohrbach,et al. A Multi-scale Multiple Instance Video Description Network , 2015, ArXiv.

[2] Zhou Su,et al. Weakly Supervised Dense Video Captioning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[5] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[6] Eyke Hüllermeier,et al. Label ranking by learning pairwise preferences , 2008, Artif. Intell..

[7] Shih-Fu Chang,et al. Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Kate Saenko,et al. Joint Event Detection and Description in Continuous Video Streams , 2018, 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW).

[9] Bernard Ghanem,et al. DAPs: Deep Action Proposals for Action Understanding , 2016, ECCV.

[10] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[11] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Suman Saha,et al. AMTnet: Action-Micro-Tube Regression by End-to-end Trainable Deep Architecture , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[13] Shih-Fu Chang,et al. CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Kate Saenko,et al. Top-Down Visual Saliency Guided by Captions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Richard Socher,et al. Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.

[16] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[17] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[18] Sanja Fidler,et al. Order-Embeddings of Images and Language , 2015, ICLR.

[19] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Ming Shao,et al. A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22] Cordelia Schmid,et al. Action Tubelet Detector for Spatio-Temporal Action Localization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Vineet Gandhi,et al. Learning Unsupervised Visual Grounding Through Semantic Self-Supervision , 2018, IJCAI.

[24] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26] Suman Saha,et al. Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos , 2016, BMVC.

[27] Xiao Lin,et al. Leveraging Visual Question Answering for Image-Caption Ranking , 2016, ECCV.

[28] Liwei Wang,et al. Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[30] Bingbing Ni,et al. Temporal Action Localization with Pyramid of Score Distribution Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Trevor Darrell,et al. Natural Language Object Retrieval , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34] Limin Wang,et al. Temporal Action Detection with Structured Segment Networks , 2017, International Journal of Computer Vision.

[35] David A. Shamma,et al. YFCC100M , 2015, Commun. ACM.

[36] Xiaoou Tang,et al. Action Recognition and Detection by Combining Motion and Appearance Features , 2014 .

[37] Bowen Zhou,et al. A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[38] Lin Ma,et al. Multimodal Convolutional Neural Networks for Matching Image and Sentence , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39] Rui Hou,et al. Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[41] Li Fei-Fei,et al. DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Cordelia Schmid,et al. Learning to Track for Spatio-Temporal Action Localization , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43] Jitendra Malik,et al. Finding action tubes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Saurabh Singh,et al. Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[47] Anton van den Hengel,et al. Visual Question Answering as a Meta Learning Task , 2017, ECCV.

[48] Antonio Torralba,et al. See, Hear, and Read: Deep Aligned Representations , 2017, ArXiv.

[49] Jongwook Choi,et al. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[51] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.

[52] Nir Ailon,et al. Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[53] Stan Sclaroff,et al. Learning Activity Progression in LSTMs for Activity Detection and Early Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54] Li Fei-Fei,et al. End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55] Fei-Fei Li,et al. Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[56] Shijian Lu,et al. TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[57] Gang Yu,et al. Fast action proposals for human action detection and search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Donghyun Kim,et al. Excitation Backprop for RNNs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59] Bernard Ghanem,et al. SST: Single-Stream Temporal Action Proposals , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Lei Zhang,et al. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[61] Trevor Darrell,et al. Grounding of Textual Phrases in Images by Reconstruction , 2015, ECCV.

[62] Geoffrey Zweig,et al. From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[63] Wei Liu,et al. SSD: Single Shot MultiBox Detector , 2015, ECCV.

[64] Hugo Larochelle,et al. Modulating early visual processing by language , 2017, NIPS.

[65] James Philbin,et al. FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Xu Zhao,et al. Single Shot Temporal Action Detection , 2017, ACM Multimedia.