Sequential image encoding for vision-to-language problems

The combination of visual recognition and language understanding is aim to build a commonly shared space between heterogeneous data of vision and text, such as the tasks of image captioning and visual question answering (VQA). Most existing approaches convert an image into a semantic visual feature vector via deep convolutional neural networks (CNN), while keep the sequential property of text data and represent it with Recurrent Neural Networks(RNN). The key to analyse multi-source heterogeneous data is to construct the inherent correlations between data. In order to reduce the heterogeneous gap among the vision and language, in this work, we represent the image in a sequential way as well as the text. We utilize the objects in the visual scenes and convert the image to a sequence of detected objects and their locations. Then we analogize a sequence of objects(visual language) to a sequence of words(natural language). We take the order of objects into account and evaluate different permutations and combinations of objects. Experimental results on the image captioning and VQA benchmarks demonstrate our hypothesis it’s beneficial to appropriately arrange objects sequence on the Vision-to-Language(V2L) problems.

[1]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[2]  Yuandong Tian,et al.  Simple Baseline for Visual Question Answering , 2015, ArXiv.

[3]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Bohyung Han,et al.  Text-Guided Attention Model for Image Captioning , 2016, AAAI.

[6]  Xu Jia,et al.  Guiding Long-Short Term Memory for Image Caption Generation , 2015, ArXiv.

[7]  Qi Wu,et al.  Visual question answering: A survey of methods and datasets , 2016, Comput. Vis. Image Underst..

[8]  Trevor Darrell,et al.  Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding , 2016, EMNLP.

[9]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[10]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Li Fei-Fei,et al.  Inferring and Executing Programs for Visual Reasoning , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[12]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[13]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Karl Stratos,et al.  Large Scale Retrieval and Generation of Image Descriptions , 2015, International Journal of Computer Vision.

[15]  Eugene Charniak,et al.  Nonparametric Method for Data-driven Image Captioning , 2014, ACL.

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Trevor Darrell,et al.  Learning to Reason: End-to-End Module Networks for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[18]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[19]  Jonathon S. Hare,et al.  Learning to Count Objects in Natural Images for Visual Question Answering , 2018, ICLR.

[20]  Yueting Zhuang,et al.  Feature Enhancement in Attention for Visual Question Answering , 2018, IJCAI.

[21]  Sheng Tang,et al.  Image Caption with Global-Local Attention , 2017, AAAI.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[24]  Peng Wang,et al.  Ask Me Anything: Free-Form Visual Question Answering Based on Knowledge from External Sources , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Fuchun Sun,et al.  MAT: A Multimodal Attentive Translator for Image Captioning , 2017, IJCAI.

[26]  Zhou Yu,et al.  Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[30]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[31]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[34]  Chunhua Shen,et al.  Explicit Knowledge-based Reasoning for Visual Question Answering , 2015, IJCAI.

[35]  Xu Jia,et al.  Guiding the Long-Short Term Memory Model for Image Caption Generation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  Dan Klein,et al.  Learning to Compose Neural Networks for Question Answering , 2016, NAACL.

[37]  Wei Zhang,et al.  R-VQA: Learning Visual Relation Facts with Semantic Attention for Visual Question Answering , 2018, KDD.

[38]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[39]  Vicente Ordonez,et al.  Obj2Text: Generating Visually Descriptive Language from Object Layouts , 2017, EMNLP.

[40]  Bo Dai,et al.  Rethinking the Form of Latent States in Image Captioning , 2018, ECCV.

[41]  Wei Zhang,et al.  Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering , 2017, AAAI.

[42]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Byoung-Tak Zhang,et al.  Multimodal Residual Learning for Visual QA , 2016, NIPS.

[44]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[45]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.