暂无分享,去创建一个
Pengfei Xiong | Luhui Xu | Han Fang | Yu Chen | Luhui Xu | Han Fang | Pengfei Xiong | Yu Chen
[1] Shengsheng Qian,et al. HiT: Hierarchical Transformer with Momentum Contrast for Video-Text Retrieval , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[2] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Bowen Zhang,et al. Cross-Modal and Hierarchical Modeling of Video and Text , 2018, ECCV.
[4] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[5] Bernt Schiele,et al. The Long-Short Story of Movie Description , 2015, GCPR.
[6] Jitendra Malik,et al. SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[7] Heng Wang,et al. Large-Scale Weakly-Supervised Pre-Training for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Nan Duan,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[9] Aleksandr Petiushko,et al. MDMMT: Multidomain Multimodal Transformer for Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[10] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[11] Ivan Laptev,et al. Learning a Text-Video Embedding from Incomplete and Heterogeneous Data , 2018, ArXiv.
[12] Hugo Terashima-Mar'in,et al. A Straightforward Framework For Video Retrieval Using CLIP , 2021, MCPR.
[13] Limin Wang,et al. Learning Spatiotemporal Features via Video and Text Pair Discrimination , 2020, ArXiv.
[14] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.
[15] Rami Ben-Ari,et al. Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning , 2020, AAAI.
[16] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[17] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[18] Cordelia Schmid,et al. ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[19] Amit K. Roy-Chowdhury,et al. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.
[20] Du Q. Huynh,et al. Hallucinating IDT Descriptors and I3D Optical Flow Features for Action Recognition With CNNs , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[21] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[22] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[23] David A. Ross,et al. Learning Video Representations from Textual Web Supervision , 2020, ArXiv.
[24] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[26] Andrew Zisserman,et al. Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval , 2021, ArXiv.
[27] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.
[28] Xirong Li,et al. Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Lihi Zelnik-Manor,et al. An Image is Worth 16x16 Words, What is a Video Worth? , 2021, ArXiv.
[32] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[33] Xin Wang,et al. S3D: Single Shot multi-Span Detector via Fully 3D Convolutional Networks , 2018, BMVC.
[34] Gunhee Kim,et al. A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.
[35] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[36] Yang Liu,et al. Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.
[37] Ruslan Salakhutdinov,et al. Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.
[38] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[39] Gedas Bertasius,et al. Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.
[40] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).