Dual Learning for Cross-domain Image Captioning

Recent AI research has witnessed increasing interests in automatically generating image descriptions in text, which is coined as theimage captioning problem. Significant progresses have been made in domains where plenty of labeled training data (i.e. image-text pairs) are readily available or collected. However, obtaining rich annotated data is a time-consuming and expensive process, creating a substantial barrier for applying image captioning methods to a new domain. In this paper, we propose a cross-domain image captioning approach that uses a novel dual learning mechanism to overcome this barrier. First, we model the alignment between the neural representations of images and that of natural languages in the source domain where one can access sufficient labeled data. Second, we adjust the pre-trained model based on examining limited data (or unpaired data) in the target domain. In particular, we introduce a dual learning mechanism with a policy gradient method that generates highly rewarded captions. The mechanism simultaneously optimizes two coupled objectives: generating image descriptions in text and generating plausible images from text descriptions, with the hope that by explicitly exploiting their coupled relation, one can safeguard the performance of image captioning in the target domain. To verify the effectiveness of our model, we use MSCOCO dataset as the source domain and two other datasets (Oxford-102 and Flickr30k) as the target domains. The experimental results show that our model consistently outperforms previous methods for cross-domain image captioning.

[1]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[2]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yejin Choi,et al.  TreeTalk: Composition and Compression of Trees for Image Descriptions , 2014, TACL.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Frank Keller,et al.  Image Description using Visual Dependency Representations , 2013, EMNLP.

[6]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[8]  Nazli Ikizler-Cinbis,et al.  Automatic Description Generation from Images: A Survey of Models, Datasets, and Evaluation Measures , 2016, J. Artif. Intell. Res..

[9]  Svetlana Lazebnik,et al.  Improving Image-Sentence Embeddings Using Large Weakly Annotated Photo Collections , 2014, ECCV.

[10]  Eduardo F. Morales,et al.  An Introduction to Reinforcement Learning , 2011 .

[11]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[12]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[13]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[14]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yu Zhang,et al.  Personalizing a Dialogue System with Transfer Learning , 2016, ArXiv.

[16]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Richard S. Sutton,et al.  Introduction to Reinforcement Learning , 1998 .

[18]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[19]  Yu Zhang,et al.  Personalizing a Dialogue System With Transfer Reinforcement Learning , 2016, AAAI.

[20]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Ting Liu,et al.  Neural personalized response generation as domain adaptation , 2017, World Wide Web.

[23]  Yishay Mansour,et al.  Policy Gradient Methods for Reinforcement Learning with Function Approximation , 1999, NIPS.

[24]  Siqi Liu,et al.  Optimization of image description metrics using policy gradient methods , 2016, ArXiv.

[25]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[26]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Min Sun,et al.  Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Tie-Yan Liu,et al.  Dual Learning for Machine Translation , 2016, NIPS.

[30]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[31]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[33]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[35]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[36]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[37]  Ping Tan,et al.  DualGAN: Unsupervised Dual Learning for Image-to-Image Translation , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.