Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval

Given a gallery of uncaptioned video sequences, this paper considers the task of retrieving videos based on their relevance to an unseen text query. To compensate for the lack of annotations, we rely instead on a related video gallery composed of video-caption pairs, termed the source gallery, albeit with a domain gap between its videos and those in the target gallery. We thus introduce the problem of Unsupervised Domain Adaptation for Cross-modal Video Retrieval, along with a new benchmark on fine-grained actions. We propose a novel iterative domain alignment method by means of pseudo-labelling target videos and cross-domain (i.e. source-target) ranking. Our approach adapts the embedding space to the target gallery, consistently outperforming source-only as well as marginal and conditional alignment methods. ar X iv :2 11 0. 12 81 2v 1 [ cs .C V ] 2 5 O ct 2 02 1 Domain Adaptation in Multi-View Embedding for Cross-Modal Video Retrieval

[1]  Ming-Hsuan Yang,et al.  Unsupervised Domain Adaptation for Spatio-Temporal Action Localization , 2020, BMVC.

[2]  Chuan Chen,et al.  Learning Semantic Representations for Unsupervised Domain Adaptation , 2018, ICML.

[3]  Hazel Doughty,et al.  Rescaling Egocentric Vision , 2020, ArXiv.

[4]  Silvio Savarese,et al.  Learning Transferrable Representations for Unsupervised Domain Adaptation , 2016, NIPS.

[5]  Shizhe Chen,et al.  Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Kate Saenko,et al.  Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[7]  Gaurav Sharma,et al.  Shuffle and Attend: Video Domain Adaptation , 2020, ECCV.

[8]  Jo Yew Tham,et al.  Learning Attribute Representations with Localization for Flexible Fashion Search , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[10]  Dapeng Chen,et al.  Mutual Mean-Teaching: Pseudo Label Refinery for Unsupervised Domain Adaptation on Person Re-identification , 2020, ICLR.

[11]  Shaogang Gong,et al.  Instance-Guided Context Rendering for Cross-Domain Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Sicheng Zhao,et al.  Spatio-temporal Contrastive Domain Adaptation for Action Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Ivan Laptev,et al.  Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.

[15]  Trevor Darrell,et al.  Adversarial Discriminative Domain Adaptation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Tatsuya Harada,et al.  Asymmetric Tri-training for Unsupervised Domain Adaptation , 2017, ICML.

[17]  Xirong Li,et al.  Dual Encoding for Video Retrieval by Text , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  K. S. Venkatesh,et al.  Deep Domain Adaptation in Action Space , 2018, BMVC.

[19]  Samuel Albanie,et al.  Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Junzhou Huang,et al.  Progressive Feature Alignment for Unsupervised Domain Adaptation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Liang Zheng,et al.  Rethinking Triplet Loss for Domain Adaptation , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Le Wang,et al.  Ladder Loss for Coherent Visual-Semantic Embedding , 2019, AAAI.

[23]  Yuxin Peng,et al.  Unsupervised Cross-Media Retrieval Using Domain Adaptation With Scene Graph , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[24]  Yi Yang,et al.  Contrastive Adaptation Network for Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[26]  Ghassan AlRegib,et al.  Action Segmentation with Mixed Temporal Domain Adaptation , 2020, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[27]  Chong-Wah Ngo,et al.  Transferrable Prototypical Networks for Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Michael I. Jordan,et al.  Learning Transferable Features with Deep Adaptation Networks , 2015, ICML.

[29]  James M. Rehg,et al.  Action2Vec: A Crossmodal Embedding Approach to Action Learning , 2019, ArXiv.

[30]  Gabriela Csurka,et al.  Domain Adaptation with a Domain Specific Class Means Classifier , 2014, ECCV Workshops.

[31]  Dima Damen,et al.  EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Yuchen Zhang,et al.  Bridging Theory and Algorithm for Domain Adaptation , 2019, ICML.

[33]  Ruxin Chen,et al.  Temporal Attentive Alignment for Large-Scale Video Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Michael I. Jordan,et al.  Conditional Adversarial Domain Adaptation , 2017, NeurIPS.

[35]  Barbara Caputo,et al.  Frustratingly Easy NBNN Domain Adaptation , 2013, 2013 IEEE International Conference on Computer Vision.

[36]  Ivan Laptev,et al.  HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Ivan Laptev,et al.  Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Amit K. Roy-Chowdhury,et al.  Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.

[39]  Andrew Zisserman,et al.  End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Linchao Zhu,et al.  T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Slawomir Bak,et al.  Domain Adaptation through Synthesis for Unsupervised Person Re-identification , 2018, ECCV.

[42]  Nicu Sebe,et al.  Unsupervised Domain Adaptation Using Feature-Whitening and Consensus Loss , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Chunhua Shen,et al.  Self-Training With Progressive Augmentation for Unsupervised Cross-Domain Person Re-Identification , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Jun Zhu,et al.  Cluster Alignment With a Teacher for Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[46]  Yang Liu,et al.  Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.

[47]  Yin Li,et al.  Learning Deep Structure-Preserving Image-Text Embeddings , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Dong Xu,et al.  Collaborative and Adversarial Network for Unsupervised Domain Adaptation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Yi Yang,et al.  Image-Image Domain Adaptation with Preserved Self-Similarity and Domain-Dissimilarity for Person Re-identification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  D. Damen,et al.  Multi-Modal Domain Adaptation for Fine-Grained Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Yu Wu,et al.  Exploit the Unknown Gradually: One-Shot Video-Based Person Re-identification by Stepwise Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Zhe Gan,et al.  Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Sid Ying-Ze Bao,et al.  Action Segmentation With Joint Self-Supervised Temporal Domain Adaptation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Liang Zheng,et al.  Unsupervised Person Re-identification: Clustering and Fine-tuning , 2017 .

[55]  D. Damen,et al.  On Semantic Similarity in Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Yannis Avrithis,et al.  Label Propagation for Deep Semi-Supervised Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Albert Gordo,et al.  Beyond Instance-Level Image Retrieval: Leveraging Captions to Learn a Global Visual Representation for Semantic Retrieval , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Kate Saenko,et al.  Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[59]  Nicolas Courty,et al.  DeepJDOT: Deep Joint distribution optimal transport for unsupervised domain adaptation , 2018, ECCV.

[60]  Emanuele Alberti,et al.  Cross-Domain First Person Audio-Visual Action Recognition through Relative Norm Alignment , 2021, ArXiv.

[61]  Feng Zhu,et al.  Structured Domain Adaptation for Unsupervised Person Re-identification , 2020, ArXiv.

[62]  Yunchao Wei,et al.  Self-Similarity Grouping: A Simple Unsupervised Cross Domain Adaptation Approach for Person Re-Identification , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  Dima Damen,et al.  Fine-Grained Action Retrieval Through Multiple Parts-of-Speech Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).