暂无分享,去创建一个
Peng Gao | Anoop Cherian | Takaaki Hori | Chiori Hori | Tim K. Marks | Shijie Geng | Jonathan Le Roux | Ankit P. Shah
[1] Ali Farhadi,et al. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.
[2] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[3] Anoop Cherian,et al. End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[4] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[5] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[6] Peng Gao,et al. Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers , 2021, AAAI Conference on Artificial Intelligence.
[7] Jie Zhou,et al. Bridging Text and Video: A Universal Multimodal Transformer for Video-Audio Scene-Aware Dialog , 2021 .
[8] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.
[9] Esa Rahtu,et al. A Better Use of Audio-Visual Cues: Dense Video Captioning with Bi-modal Transformer , 2020, BMVC.
[10] Yash Goyal,et al. Yin and Yang: Balancing and Answering Binary Visual Questions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Yash Goyal,et al. Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.
[13] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[14] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[15] Anoop Cherian,et al. Overview of the Eighth Dialog System Technology Challenge: DSTC8 , 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[16] Sanja Fidler,et al. MovieQA: Understanding Stories in Movies through Question-Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[18] Florian Metze,et al. CMU Sinbad’s Submission for the DSTC7 AVSD Challenge , 2019 .
[19] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[20] Chiori Hori,et al. Overview of the seventh Dialog System Technology Challenge: DSTC7 , 2020, Comput. Speech Lang..
[21] Aren Jansen,et al. CNN architectures for large-scale audio classification , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[22] Anoop Cherian,et al. Joint Student-Teacher Learning for Audio-Visual Scene-Aware Dialog , 2019, INTERSPEECH.
[23] Doyen Sahoo,et al. Multimodal Transformer Networks for End-to-End Video-Grounded Dialogue Systems , 2019, ACL.
[24] John R. Hershey,et al. Attention-Based Multimodal Fusion for Video Description , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[25] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.