暂无分享,去创建一个
Zhe Gan | Jingjing Liu | Lei Zhang | Yu Cheng | Luowei Zhou
[1] Juan Carlos Niebles,et al. Dense-Captioning Events in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[2] Nan Duan,et al. UniViLM: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation , 2020, ArXiv.
[3] Efstratios Gavves,et al. Self-Supervised Video Representation Learning with Odd-One-Out Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Andrew Zisserman,et al. Self-supervised Co-training for Video Representation Learning , 2020, NeurIPS.
[5] Mohit Bansal,et al. MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning , 2020, ACL.
[6] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[7] Christopher Joseph Pal,et al. Using Descriptive Video Services to Create a Large Data Source for Video Annotation Research , 2015, ArXiv.
[8] Andru Putra Twinanda,et al. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos , 2016, IEEE Transactions on Medical Imaging.
[9] Bernard Ghanem,et al. ActivityNet: A large-scale video benchmark for human activity understanding , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Bernard Ghanem,et al. Self-Supervised Learning by Cross-Modal Audio-Video Clustering , 2019, NeurIPS.
[11] James Glass,et al. AVLnet: Learning Audio-Visual Language Representations from Instructional Videos , 2021, Interspeech 2021.
[12] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Christopher Joseph Pal,et al. Movie Description , 2016, International Journal of Computer Vision.
[14] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Serge J. Belongie,et al. Spatiotemporal Contrastive Video Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.
[17] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Ramakant Nevatia,et al. TALL: Temporal Activity Localization via Language Query , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[20] Andrew Zisserman,et al. Self-Supervised MultiModal Versatile Networks , 2020, NeurIPS.
[21] Radu Soricut,et al. Multimodal Pretraining for Dense Video Captioning , 2020, AACL.
[22] Marc'Aurelio Ranzato,et al. Video (language) modeling: a baseline for generative models of natural videos , 2014, ArXiv.
[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[24] Thomas Brox,et al. COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning , 2020, NeurIPS.
[25] Sebastian Ramos,et al. The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Thomas Serre,et al. HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.
[27] Yansong Tang,et al. COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Bernt Schiele,et al. A dataset for Movie Description , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.
[30] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[31] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[32] Kaiming He,et al. Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[33] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[34] Henry C. Lin,et al. JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .
[35] Oriol Vinyals,et al. Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.
[36] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[37] Yale Song,et al. TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Geoffrey E. Hinton,et al. A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.
[39] Ivan Laptev,et al. HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[40] Radu Soricut,et al. A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions , 2019, CoNLL.
[41] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[42] Jun Yu,et al. ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering , 2019, AAAI.
[43] Yi Yang,et al. ActBERT: Learning Global-Local Video-Text Representations , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[44] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.
[45] Cordelia Schmid,et al. Learning Video Representations using Contrastive Bidirectional Transformer , 2019 .
[46] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[47] Yang Zhang,et al. Bio-Megatron: Larger Biomedical Domain Language Model , 2020, EMNLP.
[48] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.
[49] Ivan Laptev,et al. Cross-Task Weakly Supervised Learning From Instructional Videos , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Andreas Geiger,et al. Are we ready for autonomous driving? The KITTI vision benchmark suite , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[51] Ivan Laptev,et al. Unsupervised Learning from Narrated Instruction Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[52] Gunhee Kim,et al. A Joint Sequence Fusion Model for Video Question Answering and Retrieval , 2018, ECCV.
[53] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[54] Andrew Zisserman,et al. End-to-End Learning of Visual Representations From Uncurated Instructional Videos , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Cordelia Schmid,et al. VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[56] Nitish Srivastava,et al. Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.
[57] Bernt Schiele,et al. A database for fine grained activity detection of cooking activities , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.
[58] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[60] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[61] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.
[62] Mohit Bansal,et al. Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA , 2020, ACL.
[63] Ming-Hsuan Yang,et al. Unsupervised Representation Learning by Sorting Sequences , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[64] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.
[65] Vedanuj Goswami,et al. Are we pretraining it right? Digging deeper into visio-linguistic pretraining , 2020, ArXiv.
[66] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.
[67] Chenliang Xu,et al. Towards Automatic Learning of Procedures From Web Instructional Videos , 2017, AAAI.
[68] Zhe Gan,et al. HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.
[69] Bernard Ghanem,et al. Temporal Localization of Moments in Video Collections with Natural Language , 2019, ArXiv.
[70] Trevor Darrell,et al. Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[71] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[72] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[73] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[74] Mohit Bansal,et al. TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval , 2020, ECCV.