Saying the Unseen: Video Descriptions via Dialog Agents
暂无分享,去创建一个
[1] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.
[2] Leslie G. Ungerleider,et al. Mechanisms of visual attention in the human cortex. , 2000, Annual review of neuroscience.
[3] Yu Wu,et al. Exploring Heterogeneous Clues for Weakly-Supervised Audio-Visual Video Parsing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Jianfeng Gao,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.
[5] Tae-Hyun Oh,et al. Learning to Localize Sound Source in Visual Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[6] Richard Socher,et al. Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[7] 知秀 柴田. 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .
[8] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.
[9] Andrew Zisserman,et al. Objects that Sound , 2017, ECCV.
[10] Qi Wu,et al. Asking the Difficult Questions: Goal-Oriented Visual Question Generation via Intermediate Rewards , 2017, ArXiv.
[11] Hui Wang,et al. Iterative Context-Aware Graph Inference for Visual Dialog , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[12] Wei Liu,et al. Bidirectional Attentive Fusion with Context Gating for Dense Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[13] Yu Cheng,et al. Multi-step Reasoning via Recurrent Dual Attention for Visual Dialog , 2019, ACL.
[14] Tao Mei,et al. Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Chuang Gan,et al. Self-supervised Audio-visual Co-segmentation , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[16] Vaibhava Goel,et al. Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.
[18] Joelle Pineau,et al. Spoken Dialogue Management Using Probabilistic Reasoning , 2000, ACL.
[19] Luowei Zhou,et al. End-to-End Dense Video Captioning with Masked Transformer , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[20] Bolei Zhou,et al. Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[21] Marilyn A. Walker,et al. Reinforcement Learning for Spoken Dialogue Systems , 1999, NIPS.
[22] Shiliang Pu,et al. Counterfactual Samples Synthesizing for Robust Visual Question Answering , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[23] Rogério Schmidt Feris,et al. Learning to Separate Object Sounds by Watching Unlabeled Video , 2018, ECCV.
[24] José M. F. Moura,et al. Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Anoop Cherian,et al. End-to-end Audio Visual Scene-aware Dialog Using Multimodal Attention-based Video Features , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[27] Rita Cucchiara,et al. Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Xuelong Li,et al. Deep Multimodal Clustering for Unsupervised Audiovisual Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Tao Mei,et al. X-Linear Attention Networks for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[30] Leslie G. Ungerleider,et al. Increased Activity in Human Visual Cortex during Directed Attention in the Absence of Visual Stimulation , 1999, Neuron.
[31] Andrew Owens,et al. Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning , 2017, International Journal of Computer Vision.
[32] Dhruv Batra,et al. Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.
[33] Yu-Jung Heo,et al. Answerer in Questioner's Mind: Information Theoretic Approach to Goal-Oriented Visual Dialog , 2018, NeurIPS.
[34] Mario Fritz,et al. Towards Causal VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[35] Saurabh Singh,et al. Where to Look: Focus Regions for Visual Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[36] Jean Carletta,et al. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.
[37] Richard Socher,et al. Dynamic Memory Networks for Visual and Textual Question Answering , 2016, ICML.
[38] Abhishek Das,et al. Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline , 2020, ECCV.
[39] Hugo Latapie,et al. Learning Audio-Visual Correlations From Variational Cross-Modal Generation , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[40] Tamir Hazan,et al. Factor Graph Attention , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[41] Yan Yan,et al. Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).
[42] Tao Mei,et al. Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).
[43] Svetlana Lazebnik,et al. Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.
[44] Yi Yang,et al. Revisiting EmbodiedQA: A Simple Baseline and Beyond , 2019, IEEE Transactions on Image Processing.
[45] Tamir Hazan,et al. A Simple Baseline for Audio-Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[46] Philip H. S. Torr,et al. FLIPDIAL: A Generative Model for Two-Way Visual Dialogue , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[47] Yi Yang,et al. Describing Unseen Videos via Multi-Modal Cooperative Dialog Agents , 2020, ECCV.
[48] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[49] Svetlana Lazebnik,et al. Two Can Play This Game: Visual Dialog with Discriminative Question Generation and Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[50] Rogério Schmidt Feris,et al. Dialog-based Interactive Image Retrieval , 2018, NeurIPS.
[51] Qi Wu,et al. Are You Talking to Me? Reasoned Visual Dialog Generation Through Adversarial Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[52] Yi Yang,et al. Decoupled Novel Object Captioner , 2018, ACM Multimedia.
[53] Jatin Ganhotra,et al. Learning End-to-End Goal-Oriented Dialog with Multiple Answers , 2018, EMNLP.
[54] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[55] Stefan Lee,et al. Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[56] Tae-Hyun Oh,et al. Listen to Look: Action Recognition by Previewing Audio , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[57] Wei Liu,et al. Reconstruction Network for Video Captioning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[58] Hanwang Zhang,et al. Two Causal Principles for Improving Visual Dialog , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[59] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[60] Jiasen Lu,et al. Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.
[61] Chenliang Xu,et al. Audio-Visual Event Localization in Unconstrained Videos , 2018, ECCV.
[62] Rongrong Ji,et al. Variational Structured Semantic Inference for Diverse Image Captioning , 2019, NeurIPS.
[63] Verena Rieser,et al. History for Visual Dialog: Do we really need it? , 2020, ACL.
[64] Jiebo Luo,et al. Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[65] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.
[66] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[67] Yahong Han,et al. Explore Multi-Step Reasoning in Video Question Answering , 2018, CoVieW@MM.
[68] Bohyung Han,et al. Visual Reference Resolution using Attention Memory for Visual Dialog , 2017, NIPS.
[69] Matthew Turk,et al. What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog , 2019, ACL.
[70] Hugo Larochelle,et al. GuessWhat?! Visual Object Discovery through Multi-modal Dialogue , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[71] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[72] Anoop Cherian,et al. Audio Visual Scene-Aware Dialog , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[73] Andrew Owens,et al. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features , 2018, ECCV.
[74] Bolei Zhou,et al. Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.
[75] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.
[76] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[77] Yan Yan,et al. Dual Attention Matching for Audio-Visual Event Localization , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[78] Chuang Gan,et al. The Sound of Pixels , 2018, ECCV.
[79] Olivier Pietquin,et al. End-to-end optimization of goal-driven and visually grounded dialogue systems , 2017, IJCAI.
[80] Jing Liu,et al. Normalized and Geometry-Aware Self-Attention Network for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).