Cross-Domain Image Captioning via Cross-Modal Retrieval and Model Adaptation

In recent years, large scale datasets of paired images and sentences have enabled the remarkable success in automatically generating descriptions for images, namely image captioning. However, it is labour-intensive and time-consuming to collect a sufficient number of paired images and sentences in each domain. It may be beneficial to transfer the image captioning model trained in an existing domain with pairs of images and sentences (i.e., source domain) to a new domain with only unpaired data (i.e., target domain). In this paper, we propose a cross-modal retrieval aided approach to cross-domain image captioning that leverages a cross-modal retrieval model to generate pseudo pairs of images and sentences in the target domain to facilitate the adaptation of the captioning model. To learn the correlation between images and sentences in the target domain, we propose an iterative cross-modal retrieval process where a cross-modal retrieval model is first pre-trained using the source domain data and then applied to the target domain data to acquire an initial set of pseudo image-sentence pairs. The pseudo image-sentence pairs are further refined by iteratively fine-tuning the retrieval model with the pseudo image-sentence pairs and updating the pseudo image-sentence pairs using the retrieval model. To make the linguistic patterns of the sentences learned in the source domain adapt well to the target domain, we propose an adaptive image captioning model with a self-attention mechanism fine-tuned using the refined pseudo image-sentence pairs. Experimental results on several settings where MSCOCO is used as the source domain and five different datasets (Flickr30k, TGIF, CUB-200, Oxford-102 and Conceptual) are used as the target domains demonstrate that our method achieves mostly better or comparable performance against the state-of-the-art methods. We also extend our method to cross-domain video captioning where MSR-VTT is used as the source domain and two other datasets (MSVD and Charades Captions) are used as the target domains to further demonstrate the effectiveness of our method.

[1]  Ali Farhadi,et al.  Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding , 2016, ECCV.

[2]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[3]  Tao Mei,et al.  MSR-VTT: A Large Video Description Dataset for Bridging Video and Language , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Qi Wu,et al.  FVQA: Fact-Based Visual Question Answering , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Wei Wang,et al.  Learning Coupled Feature Spaces for Cross-Modal Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[6]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Deng Cai,et al.  A Better Way to Attend: Attention With Trees for Video Question Answering , 2018, IEEE Transactions on Image Processing.

[9]  Nassir Navab,et al.  Towards Unsupervised Image Captioning With Shared Multimodal Embeddings , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[11]  Yang Feng,et al.  Unsupervised Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Bin Zhao,et al.  CAM-RNN: Co-Attention Model Based RNN for Video Captioning , 2019, IEEE Transactions on Image Processing.

[13]  Trevor Darrell,et al.  Captioning Images with Diverse Objects , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[16]  Yi Yang,et al.  Entangled Transformer for Image Captioning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Deng Cai,et al.  Multi-Turn Video Question Answering via Hierarchical Attention Context Reinforced Networks , 2019, IEEE Transactions on Image Processing.

[19]  Wen Gao,et al.  Self-Critical N-Step Training for Image Captioning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Gang Wang,et al.  Unpaired Image Captioning via Scene Graph Alignments , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[22]  Zhou Zhao,et al.  Open-Ended Video Question Answering via Multi-Modal Conditional Adversarial Networks , 2020, IEEE Transactions on Image Processing.

[23]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[24]  Xinbo Gao,et al.  Triplet-Based Deep Hashing Network for Cross-Modal Retrieval , 2018, IEEE Transactions on Image Processing.

[25]  Zhou Zhao,et al.  Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos , 2019, SIGIR.

[26]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[27]  Yang Wang,et al.  Recurrent Relational Memory Network for Unsupervised Image Captioning , 2020, IJCAI.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  William Kwok-Wai Cheung,et al.  Learning Deep Unsupervised Binary Codes for Image Retrieval , 2018, IJCAI.

[30]  Ruslan Salakhutdinov,et al.  Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models , 2014, ArXiv.

[31]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[32]  Trevor Darrell,et al.  Localizing Moments in Video with Natural Language , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jianfei Cai,et al.  Auto-Encoding Scene Graphs for Image Captioning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Qi Tian,et al.  Video-Based Cross-Modal Recipe Retrieval , 2019, ACM Multimedia.

[36]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[37]  Wei Xu,et al.  Explain Images with Multimodal Recurrent Neural Networks , 2014, ArXiv.

[38]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[39]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Jeff A. Bilmes,et al.  Deep Canonical Correlation Analysis , 2013, ICML.

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Min Sun,et al.  Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[43]  Yale Song,et al.  TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[45]  Amit K. Roy-Chowdhury,et al.  Weakly Supervised Video Moment Retrieval From Text Queries , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Jianfei Cai,et al.  Scene Graph Generation With External Knowledge and Image Reconstruction , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Liangliang Cao,et al.  Focal Visual-Text Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Anton van den Hengel,et al.  Visual Question Answering as Reading Comprehension , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Krystian Mikolajczyk,et al.  Deep correlation for matching images and text , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[52]  Lei Zhang,et al.  Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Zhou Yu,et al.  Compositional Attention Networks With Two-Stream Fusion for Video Question Answering , 2020, IEEE Transactions on Image Processing.

[54]  Shuqiang Jiang,et al.  Know More Say Less: Image Captioning Based on Scene Graphs , 2019, IEEE Transactions on Multimedia.

[55]  Larry S. Davis,et al.  MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  David J. Fleet,et al.  VSE++: Improving Visual-Semantic Embeddings with Hard Negatives , 2017, BMVC.

[57]  Zhijie Lin,et al.  Weakly-Supervised Video Moment Retrieval via Semantic Completion Network , 2020, AAAI.

[58]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Wei Xu,et al.  Dual Learning for Cross-domain Image Captioning , 2017, CIKM.

[60]  Wei Zhao,et al.  Multitask Learning for Cross-Domain Image Captioning , 2019, IEEE Transactions on Multimedia.

[61]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Feng Liu,et al.  Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[63]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[64]  Zijian Zhang,et al.  Moment Retrieval via Cross-Modal Interaction Networks With Query Reconstruction , 2020, IEEE Transactions on Image Processing.

[65]  Qi Wu,et al.  Image Captioning and Visual Question Answering Based on Attributes and External Knowledge , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[66]  Shizhe Chen,et al.  Fine-Grained Video-Text Retrieval With Hierarchical Graph Reasoning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[67]  Ling Shao,et al.  Supervised Matrix Factorization Hashing for Cross-Modal Retrieval , 2016, IEEE Transactions on Image Processing.

[68]  William B. Dolan,et al.  Collecting Highly Parallel Data for Paraphrase Evaluation , 2011, ACL.

[69]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[70]  Mohit Bansal,et al.  Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA , 2020, ACL.

[71]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[72]  Rita Cucchiara,et al.  Meshed-Memory Transformer for Image Captioning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).