Unsupervised Keyword Extraction for Full-sentence VQA

In most of the existing Visual Question Answering (VQA) methods, the answers consist of short, almost single words, due to the instructions to the annotators when constructing the dataset. In this study, we envision a new VQA task in natural situations, where the answers would more likely to be sentences, rather than single words. To bridge the gap between the natural VQA and the existing VQA studies, we proposed a novel unsupervised keyword extraction method for VQA. Our key insight is that the full-sentence answer can be decomposed into two parts: one that contains new information for the question (i.e. keyword) and one that contains information already included in the question. We designed discriminative decoders to ensure such decomposition. We conducted experiments on VQA datasets that contain full-sentence answers, and show that our proposed model can correctly extract the keyword without explicit annotations of what the keyword is.

[1]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[3]  Martial Hebert,et al.  Learning by Asking Questions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Ricardo Campos,et al.  YAKE! Keyword extraction from single documents using multiple local features , 2020, Inf. Sci..

[5]  Yike Guo,et al.  A visual attention-based keyword extraction for document classification , 2018, Multimedia Tools and Applications.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[8]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[9]  Martin Jaggi,et al.  Simple Unsupervised Keyphrase Extraction using Sentence Embeddings , 2018, CoNLL.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Nick Cramer,et al.  Automatic Keyword Extraction from Individual Documents , 2010 .

[13]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Vahid Kazemi,et al.  Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering , 2017, ArXiv.

[15]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[16]  Juan Enrique Ramos,et al.  Using TF-IDF to Determine Word Relevance in Document Queries , 2003 .

[17]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[19]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[20]  Tatsuya Harada,et al.  The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering (FSVQA) , 2016, ArXiv.

[21]  Tatsuya Harada,et al.  Visual Question Generation for Class Acquisition of Unknown Objects , 2018, ECCV.

[22]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[23]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Xiaojun Wan,et al.  Single Document Keyphrase Extraction Using Neighborhood Knowledge , 2008, AAAI.

[25]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[29]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[31]  Sanja Fidler,et al.  Learning to Caption Images Through a Lifetime by Asking Questions , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[33]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[34]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[36]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[37]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[38]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.