Understanding Advertisements with BERT

We consider a task based on CVPR 2018 challenge dataset on advertisement (Ad) understanding. The task involves detecting the viewer’s interpretation of an Ad image captured as text. Recent results have shown that the embedded scene-text in the image holds a vital cue for this task. Motivated by this, we fine-tune the base BERT model for a sentence-pair classification task. Despite utilizing the scene-text as the only source of visual information, we could achieve a hit-or-miss accuracy of 84.95% on the challenge test data. To enable BERT to process other visual information, we append image captions to the scene-text. This achieves an accuracy of 89.69%, which is an improvement of 4.7%. This is the best reported result for this task.

[2]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Adriana Kovashka,et al.  ADVISE: Symbolism and External Knowledge for Decoding Advertisements , 2017, ECCV.

[4]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[5]  Christopher D. Manning,et al.  Learning by Abstraction: The Neural State Machine , 2019, NeurIPS.

[6]  Ernest Valveny,et al.  Beyond Visual Semantics: Exploring the Role of Scene Text in Image Understanding , 2019, Pattern Recognit. Lett..

[7]  Ajay Divakaran,et al.  Understanding Visual Ads by Aligning Symbols and Objects using Co-Attention , 2018, ArXiv.

[8]  Shaodi You,et al.  Automatic Generation of Grounded Visual Questions , 2016, IJCAI.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Anton van den Hengel,et al.  Visual Question Answering as Reading Comprehension , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[14]  Mingda Zhang,et al.  Automatic Understanding of Image and Video Advertisements , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).