More is Better: Precise and Detailed Image Captioning Using Online Positive Recall and Missing Concepts Mining

Recently, a great progress in automatic image captioning has been achieved by using semantic concepts detected from the image. However, we argue that existing concepts-to-caption framework, in which the concept detector is trained using the image-caption pairs to minimize the vocabulary discrepancy, suffers from the deficiency of insufficient concepts. The reasons are two-fold: 1) the extreme imbalance between the number of occurrence positive and negative samples of the concept and 2) the incomplete labeling in training captions caused by the biased annotation and usage of synonyms. In this paper, we propose a method, termed online positive recall and missing concepts mining, to overcome those problems. Our method adaptively re-weights the loss of different samples according to their predictions for online positive recall and uses a two-stage optimization strategy for missing concepts mining. In this way, more semantic concepts can be detected and a high accuracy will be expected. On the caption generation stage, we explore an element-wise selection process to automatically choose the most suitable concepts at each time step. Thus, our method can generate more precise and detailed caption to describe the image. We conduct extensive experiments on the MSCOCO image captioning data set and the MSCOCO online test server, which shows that our method achieves superior image captioning performance compared with other competitive methods.

[1]  Yoshua Bengio,et al.  Entropy Regularization , 2006, Semi-Supervised Learning.

[2]  Xuelong Li,et al.  Describing Video With Attention-Based Bidirectional LSTM , 2019, IEEE Transactions on Cybernetics.

[3]  Xuelong Li,et al.  Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval , 2017, IEEE Transactions on Image Processing.

[4]  Garrison W. Cottrell,et al.  Skeleton Key: Image Captioning by Skeleton-Attribute Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Trevor Darrell,et al.  Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[7]  Li Fei-Fei,et al.  DenseCap: Fully Convolutional Localization Networks for Dense Captioning , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Xiaofeng Zhu,et al.  Efficient kNN Classification With Different Numbers of Nearest Neighbors , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[9]  Tomás Lozano-Pérez,et al.  A Framework for Multiple-Instance Learning , 1997, NIPS.

[10]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[11]  Wei Xu,et al.  Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN) , 2014, ICLR.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Chunhua Shen,et al.  What Value Do Explicit High Level Concepts Have in Vision to Language Problems? , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Nicu Sebe,et al.  A Distance-Computation-Free Search Scheme for Binary Code Databases , 2016, IEEE Transactions on Multimedia.

[15]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[16]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Xiaofeng Zhu,et al.  Local and Global Structure Preservation for Robust Unsupervised Spectral Feature Selection , 2018, IEEE Transactions on Knowledge and Data Engineering.

[18]  Yang Yang,et al.  A Fast Optimization Method for General Binary Code Learning , 2016, IEEE Transactions on Image Processing.

[19]  Alex Alves Freitas,et al.  A Tutorial on Multi-label Classification Techniques , 2009, Foundations of Computational Intelligence.

[20]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Yang Yang,et al.  Multitask Spectral Clustering by Exploring Intertask Correlation , 2015, IEEE Transactions on Cybernetics.

[23]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[24]  Yueting Zhuang,et al.  Diverse Image Captioning via GroupTalk , 2016, IJCAI.

[25]  Chenliang Xu,et al.  Watch What You Just Said: Image Captioning with Text-Conditional Attention , 2016, ACM Multimedia.

[26]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Xiaodong Yu,et al.  Learning Bidirectional Temporal Cues for Video-Based Person Re-Identification , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Rick Siow Mong Goh,et al.  Transfer Hashing: From Shallow to Deep , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[29]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[30]  Tao Mei,et al.  Video Captioning with Transferred Semantic Attributes , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Svetlana Lazebnik,et al.  Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space , 2017, NIPS.

[32]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[33]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Zi Huang,et al.  Adaptively Attending to Visual Attributes and Linguistic Knowledge for Captioning , 2017, ACM Multimedia.

[35]  Zhe Gan,et al.  Semantic Compositional Networks for Visual Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Tao Mei,et al.  Boosting Image Captioning with Attributes , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[38]  Sanja Fidler,et al.  Towards Diverse and Natural Image Descriptions via a Conditional GAN , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[39]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jongwook Choi,et al.  End-to-End Concept Word Detection for Video Captioning, Retrieval, and Question Answering , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[42]  Yang Yang,et al.  Deep Semantic Indexing Using Convolutional Localization Network with Region-Based Visual Attention for Image Database , 2017, ADC.

[43]  Ye Yuan,et al.  Review Networks for Caption Generation , 2016, NIPS.

[44]  Xuelong Li,et al.  Robust Web Image Annotation via Exploring Multi-Facet and Structural Knowledge , 2017, IEEE Transactions on Image Processing.

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Bernt Schiele,et al.  Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[47]  Zi Huang,et al.  Discrete Nonnegative Spectral Clustering , 2017, IEEE Transactions on Knowledge and Data Engineering.

[48]  Heng Tao Shen,et al.  Video Captioning by Adversarial LSTM , 2018, IEEE Transactions on Image Processing.

[49]  Xuelong Li,et al.  Graph PCA Hashing for Similarity Search , 2017, IEEE Transactions on Multimedia.

[50]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Qi Chen,et al.  Long-range terrain perception using convolutional neural networks , 2018, Neurocomputing.

[52]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[53]  Shijian Lu,et al.  YoTube: Searching Action Proposal Via Recurrent and Static Regression Networks , 2017, IEEE Transactions on Image Processing.

[54]  Cordelia Schmid,et al.  Areas of Attention for Image Captioning , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[55]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).