ISIA at the ImageCLEF 2017 Image Caption Task

This paper describes the details of our methods for participation in the caption prediction task of ImageCLEF 2017. The dataset we use is all provided by the organizers and doesn’t include any external resources. The key components of our framework include a deep model part, an SVM part and a caption retrieval part. In deep model part, we use an end to end architecture with Convolutional neural network (CNN) and a Long Short-Term Memory (LSTM) to encode and decode images and captions. According to the statistics of training dataset, we train different models with different lengths of captions. Then in SVM part, we use Support Vector Machine (SVM) to determine which model to use when generating the description for a test image. In this way, we can combine these models from the previous deep model part. In caption retrieval part, we use the image feature extracted from CNN and apply Nearest Neighbor method to retrieve the most similar image with caption in the training dataset. The final description is the aggregation of the generated sentence and the caption retrieved from the training dataset. The best performance of our 10 submitted runs ranks the 3rd in group which doesn’t use external resources.

[1]  C. Lawrence Zitnick,et al.  CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[3]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[6]  Michael Riegler,et al.  Overview of ImageCLEF 2017: Information Extraction from Images , 2017, CLEF.

[7]  Xinlei Chen,et al.  Microsoft COCO Captions: Data Collection and Evaluation Server , 2015, ArXiv.

[8]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[10]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[11]  Henning Müller,et al.  Overview of ImageCLEFcaption 2017 - Image Caption Prediction and Concept Detection for Biomedical Images , 2017, CLEF.

[12]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[13]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[14]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.