Spatially Aware Multimodal Transformers for TextVQA

Textual cues are essential for everyday tasks like buying groceries and using public transport. To develop this assistive technology, we study the TextVQA task, i.e., reasoning about text in images to answer a question. Existing approaches are limited in their use of spatial relations and rely on fully-connected transformer-like architectures to implicitly learn the spatial structure of a scene. In contrast, we propose a novel spatially aware self-attention layer such that each visual entity only looks at neighboring entities defined by a spatial graph. Further, each head in our multi-head self-attention layer focuses on a different subset of relations. Our approach has two advantages: (1) each head considers local context instead of dispersing the attention amongst all visual entities; (2) we avoid learning redundant features. We show that our model improves the absolute accuracy of current state-of-the-art methods on TextVQA by 2.2% overall over an improved baseline, and 4.62% on questions that involve spatial reasoning and can be answered correctly using OCR tokens. Similarly on ST-VQA, we improve the absolute accuracy by 4.2%. We further show that spatially aware self-attention improves visual grounding.

[1]  Xuancheng Ren,et al.  Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[2]  Xinlei Chen,et al.  Pythia v0.1: the Winning Entry to the VQA Challenge 2018 , 2018, ArXiv.

[3]  Cordelia Schmid,et al.  VideoBERT: A Joint Model for Video and Language Representation Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Licheng Yu,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ArXiv.

[5]  Trevor Darrell,et al.  Iterative Answer Prediction With Pointer-Augmented Multimodal Transformers for TextVQA , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Yu Cheng,et al.  UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.

[8]  Xing Wang,et al.  Self-Attention with Structural Position Representations , 2019, EMNLP.

[9]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[10]  Abhishek Das,et al.  Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline , 2020, ECCV.

[11]  Shashank Shekhar,et al.  OCR-VQA: Visual Question Answering by Reading Text in Images , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[12]  Ernest Valveny,et al.  Word Spotting and Recognition with Embedded Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[14]  Ernest Valveny,et al.  ICDAR 2015 competition on Robust Reading , 2015, 2015 13th International Conference on Document Analysis and Recognition (ICDAR).

[15]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Xinlei Chen,et al.  Towards VQA Models That Can Read , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Jon Almazán,et al.  ICDAR 2013 Robust Reading Competition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[19]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[20]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[21]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[22]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Jiasen Lu,et al.  VQA: Visual Question Answering , 2015, ICCV.

[25]  Jeffrey P. Bigham,et al.  VizWiz: nearly real-time answers to visual questions , 2010, W4A.

[26]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[29]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[30]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[31]  Cho-Jui Hsieh,et al.  VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.

[32]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  C. V. Jawahar,et al.  Image Retrieval Using Textual Cues , 2013, 2013 IEEE International Conference on Computer Vision.

[34]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[35]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[36]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[37]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[38]  Nan Duan,et al.  Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[39]  Yu Cheng,et al.  Relation-Aware Graph Attention Network for Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[40]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[41]  Lin Su,et al.  ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data , 2020, ArXiv.

[42]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[43]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Jordi Pont-Tuset,et al.  The Open Images Dataset V4 , 2018, International Journal of Computer Vision.

[45]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[46]  Jianfeng Gao,et al.  Unified Vision-Language Pre-Training for Image Captioning and VQA , 2020, AAAI.

[47]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[48]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Albert Gordo,et al.  Rosetta: Large Scale System for Text Detection and Recognition in Images , 2018, KDD.

[50]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[51]  Ernest Valveny,et al.  Scene Text Visual Question Answering , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Ernest Valveny,et al.  ICDAR 2019 Competition on Scene Text Visual Question Answering , 2019, 2019 International Conference on Document Analysis and Recognition (ICDAR).

[53]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.