BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues
暂无分享,去创建一个
Doyen Sahoo | Nancy F. Chen | Hung Le | Steven C.H. Hoi | S. Hoi | Hung Le | Doyen Sahoo
[1] Jan Kautz,et al. STEP: Spatio-Temporal Progressive Learning for Video Action Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[2] Chong-Wah Ngo,et al. Learning Spatio-Temporal Representation With Local and Global Diffusion , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Stefan Ultes,et al. MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.
[4] Ramakant Nevatia,et al. Motion-Appearance Co-memory Networks for Video Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[5] Truyen Tran,et al. Hierarchical Conditional Relation Networks for Video Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Tao Mei,et al. Structured Two-Stream Attention Network for Video Question Answering , 2019, AAAI.
[7] Andrew Zisserman,et al. Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Anoop Cherian,et al. End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[9] Pavlo Molchanov,et al. Multilayer and Multimodal Fusion of Deep Neural Networks for Video Classification , 2016, ACM Multimedia.
[10] Navdeep Jaitly,et al. Pointer Networks , 2015, NIPS.
[11] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[12] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[13] Truyen Tran,et al. Neural Reasoning, Fast and Slow, for Video Question Answering , 2020, 2020 International Joint Conference on Neural Networks (IJCNN).
[14] Svetha Venkatesh,et al. Learning to Reason with Relational Video Representation for Question Answering , 2019, ArXiv.
[15] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[16] Yale Song,et al. TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Yutaka Satoh,et al. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[18] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[19] Chao Li,et al. Collaborative Spatiotemporal Feature Learning for Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[20] Chuang Gan,et al. Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering , 2019, AAAI.
[21] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[22] Florian Metze,et al. CMU Sinbad’s Submission for the DSTC7 AVSD Challenge , 2019 .
[23] Anoop Cherian,et al. Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog , 2019, INTERSPEECH.
[24] Wei Liu,et al. Spatio-Temporal Dynamics and Semantic Attribute Enriched Visual Encoding for Video Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Runhao Zeng,et al. Location-Aware Graph Convolutional Networks for Video Question Answering , 2020, AAAI.
[27] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.
[29] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[31] Lihong Li,et al. Neural Approaches to Conversational AI , 2019, Found. Trends Inf. Retr..
[32] Licheng Yu,et al. TVQA: Localized, Compositional Video Question Answering , 2018, EMNLP.
[33] Trevor Darrell,et al. Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.
[34] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Jongwook Choi,et al. End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[37] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[38] Shu Zhang,et al. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[39] Yahong Han,et al. Reasoning with Heterogeneous Graph Alignment for Video Question Answering , 2020, AAAI.
[40] Yue Gao,et al. Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering , 2020, AAAI.
[41] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).
[42] Doyen Sahoo,et al. Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems , 2019, ACL.
[43] Kuldip K. Paliwal,et al. Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..
[44] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[45] Qingming Huang,et al. Spatiotemporal CNN for Video Object Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Tien Dat Nguyen,et al. From FiLM to Video: Multi-turn Question Answering with Multi-modal Context , 2018, ArXiv.
[47] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[48] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[49] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[50] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.
[51] Tim K. Marks,et al. Audio Visual Scene-aware dialog (AVSD) Track for Natural Language Generation in DSTC7 , 2019 .
[52] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[53] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.
[54] Steven C. H. Hoi,et al. End-to-End Multimodal Dialog Systems with Hierarchical Multimodal Attention on Video Features , 2018 .
[55] Richard S. Zemel,et al. Exploring Models and Data for Image Question Answering , 2015, NIPS.
[56] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[57] Wei Xu,et al. Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[58] Bo Du,et al. Fast Spatio-Temporal Residual Network for Video Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.