RepeatPadding: Balancing words and sentence length for language comprehension in visual question answering

Abstract Visual question answering (VQA) is a complicated Turing-AI task which needs not only to understand the multi-modality inputs but also reason to provide correct answer. Nowadays, there are complicated and sophisticated modules for reasoning in popular works. However, the language representation which is frequently treated as the guider of VQA hasn’t been fully explored in current researches, leading to insufficient reasoning and unsatisfactory answer. In this work, two types of method including VieAns and RepeatPadding which focus on language processing are proposed to balance the sentence by cropping and padding the question, where the language information is transformed to different expressions and further pushes the language model to grab more representative features for further boosting the accuracy of predicted answers. Experiments on the benchmark COCO-QA and VQA2.0 datasets are conducted to demonstrate the effectiveness of the proposed method. Particularly, the proposed RepeatPadding method is more suitable for different language models.

[1]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2]  J. Weijer,et al.  Word length, sentence length and frequency: Zipf revisited , 2004 .

[3]  Richard S. Zemel,et al.  Exploring Models and Data for Image Question Answering , 2015, NIPS.

[4]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[5]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[6]  T. Bever,et al.  Sentence comprehension: The integration of habits and rules. David J. Townsend and Thomas G. Bever. Cambridge, MA: MIT Press, 2001. Pp. 455. , 2002, Applied Psycholinguistics.

[7]  Mario Fritz,et al.  A Multi-World Approach to Question Answering about Real-World Scenes based on Uncertain Input , 2014, NIPS.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[12]  Matthieu Cord,et al.  BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection , 2019, AAAI.

[13]  Yash Goyal,et al.  Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Meng Liu,et al.  Online Data Organizer: Micro-Video Categorization by Structure-Guided Multimodal Dictionary Learning , 2019, IEEE Transactions on Image Processing.

[15]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[16]  S. Piantadosi Zipf’s word frequency law in natural language: A critical review and future directions , 2014, Psychonomic Bulletin & Review.

[17]  Hanzhang Wang,et al.  Categorizing Concepts with Basic Level for Vision-to-Language , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[18]  Thomas G. Bever,et al.  Sentence Comprehension: The Integration of Habits and Rules , 2001 .

[19]  Anton van den Hengel,et al.  Visual Question Answering as Reading Comprehension , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Zhou Yu,et al.  Deep Modular Co-Attention Networks for Visual Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Yu Cheng,et al.  Relation-Aware Graph Attention Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[23]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[24]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[25]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[26]  Dennis Koelma,et al.  The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection , 2016, ICMR.

[27]  Lei Zou,et al.  Interactive natural language question answering over knowledge graphs , 2019, Inf. Sci..

[28]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[29]  Mirella Lapata,et al.  Learning to Paraphrase for Question Answering , 2017, EMNLP.

[30]  Liqiang Nie,et al.  Low-Rank Regularized Multi-Representation Learning for Fashion Compatibility Prediction , 2020, IEEE Transactions on Multimedia.

[31]  Anton van den Hengel,et al.  Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[34]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Yue Gao,et al.  Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information , 2013, IEEE Transactions on Multimedia.

[36]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[38]  Bin Li,et al.  CNN-Based Adversarial Embedding for Image Steganography , 2019, IEEE Transactions on Information Forensics and Security.

[39]  Meng Wang,et al.  Disease Inference from Health-Related Questions via Sparse Deep Learning , 2015, IEEE Transactions on Knowledge and Data Engineering.

[40]  Meng Wang,et al.  Multimedia answering: enriching text QA with media information , 2011, SIGIR.

[41]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Zhou Yu,et al.  Beyond Bilinear: Generalized Multimodal Factorized High-Order Pooling for Visual Question Answering , 2017, IEEE Transactions on Neural Networks and Learning Systems.

[44]  W. Nelson Francis,et al.  Beschrijving en interpretare van Lingulstische Frequences: Computational analysis of present-day American English , 1968 .

[45]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Sung-Hyon Myaeng,et al.  Semantic passage segmentation based on sentence topics for question answering , 2007, Inf. Sci..