暂无分享,去创建一个
[1] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[2] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[3] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Gunhee Kim,et al. A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.
[5] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.
[6] Tamir Hazan,et al. A Simple Baseline for Audio-Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[7] Yang Liu,et al. Use What You Have: Video retrieval using representations from collaborative experts , 2019, BMVC.
[8] Amit K. Roy-Chowdhury,et al. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval , 2018, ICMR.
[9] Subhashini Venugopalan,et al. Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.
[10] Ariel Shamir,et al. Towards Coherent Visual Storytelling with Ordered Image Attention , 2021, ArXiv.
[11] Larry S. Davis,et al. MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Florian Metze,et al. Support-set bottlenecks for video-text representation learning , 2020, ICLR.
[13] Xirong Li,et al. Dual Encoding for Zero-Example Video Retrieval , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[14] Daniel Jurafsky,et al. A Hierarchical Neural Autoencoder for Paragraphs and Documents , 2015, ACL.
[15] Chuang Gan,et al. Weakly Supervised Dense Event Captioning in Videos , 2018, NeurIPS.
[16] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[17] Kate Saenko,et al. Multilevel Language and Vision Integration for Text-to-Clip Retrieval , 2018, AAAI.
[18] Kate Saenko,et al. wMAN: Weakly-supervised Moment Alignment Network for Text-based Video Segment Retrieval , 2019, ArXiv.
[19] Yueming Lu,et al. A Multi-modal Clustering Method for Web Videos , 2012, ISCTCS.
[20] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[22] Shizhe Chen,et al. Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Xin Wang,et al. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[24] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.
[25] Yao Zhao,et al. Multimodal Fusion for Video Search Reranking , 2010, IEEE Transactions on Knowledge and Data Engineering.
[26] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[27] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[28] Matthieu Cord,et al. MUTAN: Multimodal Tucker Fusion for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[29] Tamir Hazan,et al. High-Order Attention Models for Visual Question Answering , 2017, NIPS.
[30] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Amit K. Roy-Chowdhury,et al. Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[32] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Gang Hua,et al. Hierarchical Multimodal LSTM for Dense Visual-Semantic Embedding , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[35] Hrishikesh B. Aradhye,et al. Video2Text: Learning to Annotate Video Content , 2009, 2009 IEEE International Conference on Data Mining Workshops.
[36] Zhe Gan,et al. Less is More: CLIPBERT for Video-and-Language Learning via Sparse Sampling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[37] Nan Duan,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[38] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Yu-Gang Jiang,et al. Semantic Proposal for Activity Localization in Videos via Sentence Query , 2019, AAAI.
[40] Bowen Zhang,et al. Cross-Modal and Hierarchical Modeling of Video and Text , 2018, ECCV.
[41] Larry S. Davis,et al. WSLLN:Weakly Supervised Natural Language Localization Networks , 2019, EMNLP.
[42] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.
[43] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[44] Yi Yang,et al. Hierarchical Recurrent Neural Encoder for Video Representation with Application to Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Byoung-Tak Zhang,et al. Bilinear Attention Networks , 2018, NeurIPS.
[46] Tao Mei,et al. To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression , 2018, AAAI.
[47] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.
[48] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[49] Chen Sun,et al. Multi-modal Transformer for Video Retrieval , 2020, ECCV.
[50] Zhe Gan,et al. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.