Captioning Images with Novel Objects via Online Vocabulary Expansion

In this study, we introduce a low cost method for generating descriptions from images containing novel objects. Generally, constructing a model, which can explain images with novel objects, is costly because of the following: (1) collecting a large amount of data for each category, and (2) retraining the entire system. If humans see a small number of novel objects, they are able to estimate their properties by associating their appearance with known objects. Accordingly, we propose a method that can explain images with novel objects without retraining using the word embeddings of the objects estimated from only a small number of image features of the objects. The method can be integrated with general image-captioning models. The experimental results show the effectiveness of our approach.

[1]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Achim Rettinger,et al.  Describing Natural Images Containing Novel Objects with Knowledge Guided Assitance , 2017, ArXiv.

[3]  Kilian Q. Weinberger,et al.  Fast Image Tagging , 2013, ICML.

[4]  Tao Mei,et al.  Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Tao Mei,et al.  Pointing Novel Objects in Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Mark Johnson,et al.  Partially-Supervised Image Captioning , 2018, NeurIPS.

[7]  Alexander G. Schwing,et al.  Convolutional Image Captioning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[8]  Licheng Yu,et al.  A Joint Speaker-Listener-Reinforcer Model for Referring Expressions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[10]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Tao Mei,et al.  Exploring Visual Relationship for Image Captioning , 2018, ECCV.

[12]  Alan L. Yuille,et al.  Generation and Comprehension of Unambiguous Object Descriptions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[14]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[15]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Richard Socher,et al.  Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Yi Yang,et al.  Decoupled Novel Object Captioner , 2018, ACM Multimedia.

[19]  Jianwei Yang,et al.  Neural Baby Talk , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[21]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[22]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[23]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Trevor Darrell,et al.  Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[26]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[27]  Nazli Ikizler-Cinbis,et al.  Image Captioning with Unseen Objects , 2019, BMVC.

[28]  Xinlei Chen,et al.  nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30]  Basura Fernando,et al.  Guided Open Vocabulary Image Captioning with Constrained Beam Search , 2016, EMNLP.

[31]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[32]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  José M. F. Moura,et al.  Visual Dialog , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Stella X. Yu,et al.  Large-Scale Long-Tailed Recognition in an Open World , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[37]  Mubarak Shah,et al.  Fast Zero-Shot Image Tagging , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Yi Yang,et al.  Cascaded Revision Network for Novel Object Captioning , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.