SuperCaptioning: Image Captioning Using Two-dimensional Word Embedding

Language and vision are processed as two different modal in current work for image captioning. However, recent work on Super Characters method shows the effectiveness of two-dimensional word embedding, which converts text classification problem into image classification problem. In this paper, we propose the SuperCaptioning method, which borrows the idea of two-dimensional word embedding from Super Characters method, and processes the information of language and vision together in one single CNN model. The experimental results on Flickr30k data shows the proposed method gives high quality image captions. An interactive demo is ready to show at the workshop.

[1]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Pratik Rane,et al.  Self-Critical Sequence Training for Image Captioning , 2018 .

[3]  Lin Yang,et al.  Squared English Word: A Method of Generating Glyph to Use Super Characters for Sentiment Analysis , 2019, AffCon@AAAI.

[4]  Dumitru Erhan,et al.  Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Md. Zakir Hossain,et al.  A Comprehensive Survey of Deep Learning for Image Captioning , 2018, ACM Comput. Surv..

[6]  Lin Yang,et al.  SuperTML: Two-Dimensional Word Embedding and Transfer Learning Using ImageNet Pretrained CNN Models for the Classifications on Tabular Data , 2019, 1903.06246.

[7]  Lin Yang,et al.  Super Characters: A Conversion from Sentiment Classification to Image Classification , 2018, WASSA@EMNLP.

[8]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[10]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Lin Yang,et al.  Ultra Power-Efficient CNN Domain Specific Accelerator with 9.3TOPS/Watt for Mobile and Embedded Applications , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[12]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Lin Yang,et al.  SuperChat: dialogue generation by transfer learning from vision to language using two-dimensional word embedding , 2019, ArXiv.

[14]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[15]  Terry Torng,et al.  MRAM Co-designed Processing-in-Memory CNN Accelerator for Mobile and IoT Applications , 2018, 1811.12179.

[16]  Michael Lin,et al.  SuperTML: Two-Dimensional Word Embedding for the Precognition on Structured Tabular Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).