暂无分享,去创建一个
Fan Yang | Hezheng Lin | Dong Shen | Xing Cheng | Xiangyu Wu | F. Yang | Xingyi Cheng | Hezheng Lin | Xiangyu Wu | Dong Shen
[1] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[2] LazebnikSvetlana,et al. A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2014 .
[3] Ioannis Patras,et al. Query and Keyframe Representations for Ad-hoc Video Search , 2017, ICMR.
[4] Gang Wang,et al. Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[5] Andrew Zisserman,et al. Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.
[6] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[7] Linchao Zhu,et al. T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[9] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[10] Lei Zhang,et al. VinVL: Making Visual Representations Matter in Vision-Language Models , 2021, ArXiv.
[11] Chang Zhou,et al. CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.
[12] Gunhee Kim,et al. A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.
[13] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[14] Duy-Dinh Le,et al. NII-HITACHI-UIT at TRECVID 2017 , 2016, TRECVID.
[15] Yang Liu,et al. Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.
[16] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, ArXiv.
[18] Qiang Yang,et al. An Overview of Multi-task Learning , 2018 .
[19] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[20] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Sebastian Ruder,et al. An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.
[22] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[23] Dan Klein,et al. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.
[24] Adwait Ratnaparkhi,et al. A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.
[25] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[26] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[27] Kaiming He,et al. Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.
[28] Yueting Zhuang,et al. Hierarchical Cross-Modal Graph Consistency Learning for Video-Text Retrieval , 2021, SIGIR.
[29] Bernt Schiele,et al. The Long-Short Story of Movie Description , 2015, GCPR.
[30] Amit K. Roy-Chowdhury,et al. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.
[31] David J. Fleet,et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.
[32] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Pengfei Xiong,et al. CLIP2Video: Mastering Video-Text Retrieval via Image CLIP , 2021, ArXiv.
[34] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Aleksandr Petiushko,et al. MDMMT: Multidomain Multimodal Transformer for Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[36] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[37] Shengsheng Qian,et al. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[38] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[40] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[41] Zhe Zhao,et al. Modeling Task Relationships in Multi-task Learning with Multi-gate Mixture-of-Experts , 2018, KDD.
[42] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.