Visual Image Caption Generation for Service Robotics and Industrial Applications

Image caption generation is a task that generates a sentence from a raw image, which is mimicking the intelligence of human that can acquire knowledge from the view. The difficulty of this task is the combination of multimodal knowledge learning, i.e. recognition of objects, actions, scenes, human, etc. In order to perform semantic understanding for service robotics or other industrial applications, the caption must be enhanced for recognition of the objects in the confined environment. We propose a template-based augmentation method for improving the capability of object recognition while retaining the other capability of the image caption model. This work opens a new era of image caption generation training procedure that the caption dataset and the classification dataset can be combined to train the deep captioning model. We show in our experiments that our improved model outperforms the original model in SPICE metrics by 4 times.

[1]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[2]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[4]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[5]  Alexander Schwing,et al.  Fast, Diverse and Accurate Image Captioning Guided by Part-Of-Speech , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[7]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[9]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[11]  C. V. Jawahar,et al.  Choosing Linguistics over Vision to Describe Images , 2012, AAAI.

[12]  Yasuo Kuniyoshi,et al.  Efficient image annotation for automatic sentence generation , 2012, ACM Multimedia.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[16]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[17]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[18]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Xu Qian,et al.  Sequential Dual Attention: Coarse-to-Fine-Grained Hierarchical Generation for Image Captioning , 2018, Symmetry.

[20]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[21]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Armand Joulin,et al.  Deep Fragment Embeddings for Bidirectional Image Sentence Mapping , 2014, NIPS.

[23]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[24]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[25]  C. Constantinidis,et al.  Bottom-Up and Top-Down Attention , 2014, The Neuroscientist : a review journal bringing neurobiology, neurology and psychiatry.

[26]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Chun C. Lai,et al.  Multisensor Fusion and Integration: Theories, Applications, and its Perspectives , 2011, IEEE Sensors Journal.