暂无分享,去创建一个
Xiangang Li | Wei Zou | Dongwei Jiang | Wubo Li | Xiangang Li | Wei Zou | Dongwei Jiang | Wubo Li
[1] Kyomin Jung,et al. DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style Word Generator , 2020, ArXiv.
[2] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Lama Nachman,et al. Context, Attention and Audio Feature Explorations for Audio Visual Scene-Aware Dialog , 2018, ArXiv.
[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[5] Yueting Zhuang,et al. Video Question Answering via Gradually Refined Attention over Appearance and Motion , 2017, ACM Multimedia.
[6] Alon Lavie,et al. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.
[7] Doyen Sahoo,et al. Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems , 2019, ACL.
[8] Florian Metze,et al. How2: A Large-scale Dataset for Multimodal Language Understanding , 2018, NIPS 2018.
[9] Licheng Yu,et al. TVQA+: Spatio-Temporal Grounding for Video Question Answering , 2019, ACL.
[10] Ruslan Salakhutdinov,et al. Multimodal Transformer for Unaligned Multimodal Language Sequences , 2019, ACL.
[11] Tao Mei,et al. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Ali Farhadi,et al. Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.
[13] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[14] Chin-Yew Lin,et al. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics , 2004, ACL.
[15] Anna Rumshisky,et al. Similarity-Based Reconstruction Loss for Meaning Representation , 2018, EMNLP.
[16] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[17] William B. Dolan,et al. Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.
[18] Tim K. Marks,et al. Audio Visual Scene-aware dialog (AVSD) Track for Natural Language Generation in DSTC7 , 2019 .
[19] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.
[20] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[21] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[22] Yun-Nung Chen,et al. Reactive Multi-Stage Feature Fusion for Multimodal Dialogue Modeling , 2019, ArXiv.
[23] Anoop Cherian,et al. End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[24] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[25] Florian Metze,et al. CMU Sinbad’s Submission for the DSTC7 AVSD Challenge , 2019 .
[26] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[27] Tien Dat Nguyen,et al. From FiLM to Video: Multi-turn Question Answering with Multi-modal Context , 2018, ArXiv.
[28] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[29] Richard Socher,et al. Ask Me Anything: Dynamic Memory Networks for Natural Language Processing , 2015, ICML.
[30] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[31] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).