Instance-based Deep Transfer Learning on Cross-domain Image Captioning

Image captioning task achieves imposing result in generating text description of image by training on the large image and sentence pairs dataset (e.g., MSCOCO). Applied the large publicly available datasets to the specific new task will lead to the problem known as domain shifting due to the different probability distribution. In this research, we propose a Multimodal Instance-based Deep Transfer Learning (MIBTL) for cross-domain image captioning. Generally, transferring knowledge from the source domain into the target domain is pertinent due to the limited number of the dataset from the target domain. The limited number of the dataset also lead to a problem called overfitting. The instance-based strategy intuitively takes into account the influence of data by selecting the most representative data from source domain as the supplement for training on the target domain. We employ deep hash binary code representation of image and text pair to determine the distance between two data points. The experiment conducted by two types of transfer condition includes a slight shift (from MSCOCO to Flickr30k) and significant shift (from MSCOCO to CUB-200 and Oxford-102). The result shows that MIBTL can outperform other baselines to achieve state-of-the-art performance in cross-domain image captioning transfer learning.

[1]  Jun Huan,et al.  Instance-Based Deep Transfer Learning , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[2]  Wei Liu,et al.  Learning to Guide Decoding for Image Captioning , 2018, AAAI.

[3]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[4]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[5]  Ambedkar Dukkipati,et al.  Instance-based Inductive Deep Transfer Learning by Cross-Dataset Querying with Locality Sensitive Hashing , 2018, EMNLP.

[6]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[7]  Jianmin Wang,et al.  Collective Deep Quantization for Efficient Cross-Modal Retrieval , 2017, AAAI.

[8]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[10]  Wu-Jun Li,et al.  Deep Cross-Modal Hashing , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Trevor Darrell,et al.  Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[13]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[14]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[15]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[16]  Wei Xu,et al.  Dual Learning for Cross-domain Image Captioning , 2017, CIKM.

[17]  Babak Taati,et al.  A Hybrid Instance-based Transfer Learning Method , 2018, ArXiv.

[18]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[19]  Yale Song,et al.  TGIF: A New Dataset and Benchmark on Animated GIF Description , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Philip S. Yu,et al.  Deep Visual-Semantic Hashing for Cross-Modal Retrieval , 2016, KDD.

[21]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[22]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[24]  Rizal Setya Perdana,et al.  Pairwise Cluster Similarity Domain Adaptation for Multimodal Deep Learning Architecture , 2019 .

[25]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Mei Wang,et al.  Deep Visual Domain Adaptation: A Survey , 2018, Neurocomputing.

[27]  Min Sun,et al.  Show, Adapt and Tell: Adversarial Training of Cross-Domain Image Captioner , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Mark J. Huiskes,et al.  The MIR flickr retrieval evaluation , 2008, MIR '08.

[29]  Peter Young,et al.  From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , 2014, TACL.

[30]  Pietro Perona,et al.  The Caltech-UCSD Birds-200-2011 Dataset , 2011 .

[31]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Rui Wang,et al.  A Survey of Domain Adaptation for Neural Machine Translation , 2018, COLING.

[33]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.