CLIP4Caption: CLIP for Video Caption
暂无分享,去创建一个
Fengyun Rao | Dian Li | Xiu Li | Zhenhua Liu | Mingkang Tang | Zhanyu Wang | Zhenhua Liu | Dian Li | Fengyun Rao | Zhanyu Wang | Mingkang Tang | Xiu Li
[1] Bing Li,et al. Multimodal Semantic Attention Network for Video Captioning , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).
[2] Bing Li,et al. Object Relational Graph With Teacher-Recommended Learning for Video Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Shin'ichi Satoh,et al. Consensus-based Sequence Training for Video Captioning , 2017, ArXiv.
[4] Benjamin Bustos,et al. Bridging Vision and Language from the Video-to-Text Perspective: A Comprehensive Review , 2021, ArXiv.
[5] Benjamin Bustos,et al. Improving Video Captioning with Temporal Composition of a Visual-Syntactic Embedding* , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[6] Delving Deeper into the Decoder for Video Captioning , 2020, ECAI.
[7] Christopher Joseph Pal,et al. Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[8] Jia Chen,et al. Generating Video Descriptions With Latent Topic Guidance , 2019, IEEE Transactions on Multimedia.
[9] Yehao Li,et al. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network , 2021, AAAI.
[10] Benjamin Bustos,et al. Attentive Visual Semantic Specialized Network for Video Captioning , 2021, 2020 25th International Conference on Pattern Recognition (ICPR).
[11] Wei Liu,et al. Controllable Video Captioning With POS Sequence Guidance Based on Gated Fusion Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[12] Trevor Darrell,et al. YouTube2Text: Recognizing and Describing Arbitrary Activities Using Semantic Hierarchies and Zero-Shot Recognition , 2013, 2013 IEEE International Conference on Computer Vision.
[13] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[14] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.
[15] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[17] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[18] Trevor Darrell,et al. Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).
[19] Wei Liu,et al. Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[20] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.
[22] Tao Mei,et al. Jointly Modeling Embedding and Translation to Bridge Video and Language , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Bernt Schiele,et al. Translating Video Content to Natural Language Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.
[24] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[25] Tao Mei,et al. Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Nan Duan,et al. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval , 2021, Neurocomputing.
[27] Jun Xu,et al. Auto-captions on GIF: A Large-scale Video-sentence Dataset for Vision-language Pre-training , 2020, ACM Multimedia.
[28] Luc Van Gool,et al. Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.
[29] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[30] Xiaolin Hu,et al. A Semantics-Assisted Video Captioning Model Trained With Scheduled Sampling , 2019, Frontiers in Robotics and AI.
[31] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[32] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[33] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[34] Lejian Liao,et al. REVnet: Bring Reviewing Into Video Captioning for a Better Description , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).
[35] Hamid R. Arabnia,et al. Automatic Image and Video Caption Generation With Deep Learning: A Concise Review and Algorithmic Overlap , 2020, IEEE Access.
[36] Wei Chen,et al. Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework , 2015, AAAI.
[37] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[38] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[39] Tao Mei,et al. Jointly Localizing and Describing Events for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[40] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[41] Nan Duan,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.