Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings

We address the problem of cross-modal fine-grained action retrieval between text and video. Cross-modal retrieval is commonly achieved through learning a shared embedding space, that can indifferently embed modalities. In this paper, we propose to enrich the embedding by disentangling parts-of-speech (PoS) in the accompanying captions. We build a separate multi-modal embedding space for each PoS tag. The outputs of multiple PoS embeddings are then used as input to an integrated multi-modal space, where we perform action retrieval. All embeddings are trained jointly through a combination of PoS-aware and PoS-agnostic losses. Our proposal enables learning specialised embedding spaces that offer multiple views of the same embedded entities. We report the first retrieval results on fine-grained actions for the large-scale EPIC dataset, in a generalised zero-shot setting. Results show the advantage of our approach for both video-to-text and text-to-video action retrieval. We also demonstrate the benefit of disentangling the PoS for the generic task of cross-modal video retrieval on the MSR-VTT dataset.

[1]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Albert Gordo,et al.  Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Naokazu Yokoya,et al.  Learning Joint Representations of Videos and Sentences with Web Image Search , 2016, ECCV Workshops.

[5]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yang Song,et al.  Learning Fine-Grained Image Similarity with Deep Ranking , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Gunhee Kim,et al.  A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.

[9]  Susanne Westphal,et al.  The “Something Something” Video Database for Learning and Evaluating Visual Common Sense , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Trevor Darrell,et al.  YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  James M. Rehg,et al.  Action2Vec: A Crossmodal Embedding Approach to Action Learning , 2019, ArXiv.

[12]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[13]  Dima Damen,et al.  Learning Visual Actions Using Multiple Verb-Only Labels , 2019, BMVC.

[14]  Albert Gordo,et al.  Deep Image Retrieval: Learning Global Representations for Image Search , 2016, ECCV.

[15]  Nir Ailon,et al.  Deep Metric Learning Using Triplet Network , 2014, SIMBAD.

[16]  Wei-Lun Chao,et al.  An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild , 2016, ECCV.

[17]  James Philbin,et al.  FaceNet: A unified embedding for face recognition and clustering , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Xirong Li,et al.  Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bernt Schiele,et al.  Recognizing Fine-Grained and Composite Activities Using Hand-Centric Features and Script Data , 2015, International Journal of Computer Vision.

[20]  Amit K. Roy-Chowdhury,et al.  Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[21]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[22]  Leonid Sigal,et al.  Learning Language-Visual Embedding for Movie Understanding with Natural-Language , 2016, ArXiv.

[23]  Ivan Laptev,et al.  Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.

[24]  Kaiqi Huang,et al.  Beyond Triplet Loss: A Deep Quadruplet Network for Person Re-identification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Dima Damen,et al.  Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[26]  Svetlana Lazebnik,et al.  Enhancing Video Summarization via Vision-Language Embedding , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  David J. Fleet,et al.  ON THE EFFECTIVENESS OF TASK GRANULARITY FOR TRANSFER LEARNING , 2018, 1804.09235.

[29]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[30]  Ondrej Chum,et al.  CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples , 2016, ECCV.

[31]  Josef Sivic,et al.  NetVLAD: CNN Architecture for Weakly Supervised Place Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Xun Wang,et al.  Dual Dense Encoding for Zero-Example Video Retrieval , 2018, ArXiv.

[33]  Liwei Wang,et al.  Learning Two-Branch Neural Networks for Image-Text Matching Tasks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Ivan Laptev,et al.  Learnable pooling with Context Gating for video classification , 2017, ArXiv.

[35]  Jordi Torres,et al.  Cross-modal Embeddings for Video and Audio Retrieval , 2018, ECCV Workshops.

[36]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[37]  Xirong Li,et al.  Predicting Visual Features From Text for Image and Video Caption Retrieval , 2017, IEEE Transactions on Multimedia.

[38]  Naila Murray,et al.  Re-ID done right: towards good practices for person re-identification , 2018, ArXiv.

[39]  Wei Chen,et al.  Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.

[40]  Lucas Beyer,et al.  In Defense of the Triplet Loss for Person Re-Identification , 2017, ArXiv.

[41]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.