Common Subspace for Model and Similarity: Phrase Learning for Caption Generation from Images

Generating captions to describe images is a fundamental problem that combines computer vision and natural language processing. Recent works focus on descriptive phrases, such as "a white dog" to explain the visual composites of an input image. The phrases can not only express objects, attributes, events, and their relations but can also reduce visual complexity. A caption for an input image can be generated by connecting estimated phrases using a grammar model. However, because phrases are combinations of various words, the number of phrases is much larger than the number of single words. Consequently, the accuracy of phrase estimation suffers from too few training samples per phrase. In this paper, we propose a novel phrase-learning method: Common Subspace for Model and Similarity (CoSMoS). In order to overcome the shortage of training samples, CoSMoS obtains a subspace in which (a) all feature vectors associated with the same phrase are mapped as mutually close, (b) classifiers for each phrase are learned, and (c) training samples are shared among co-occurring phrases. Experimental results demonstrate that our system is more accurate than those in earlier work and that the accuracy increases when the dataset from the web increases.

[1]  Gabriela Csurka,et al.  Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost , 2012, ECCV.

[2]  Cordelia Schmid,et al.  Semantic Hierarchies for Visual Object Recognition , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Tamara L. Berg,et al.  Baby Talk : Understanding and Generating Image Descriptions , 2011 .

[4]  Thomas Mensink,et al.  Large Scale Metric Learning for Distance-Based Image Classification on Open Ended Data Sets , 2012, Advanced Topics in Computer Vision.

[5]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[6]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Cyrus Rashtchian,et al.  Every Picture Tells a Story: Generating Sentences from Images , 2010, ECCV.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[10]  C. V. Jawahar,et al.  Choosing Linguistics over Vision to Describe Images , 2012, AAAI.

[11]  Yasuo Kuniyoshi,et al.  Efficient image annotation for automatic sentence generation , 2012, ACM Multimedia.

[12]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[13]  C. V. Jawahar,et al.  Generating Image Descriptions Using Semantic Similarities in the Output Space , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[14]  Samy Bengio,et al.  Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yejin Choi,et al.  Composing Simple Image Descriptions using Web-scale N-grams , 2011, CoNLL.

[16]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[17]  Ali Farhadi,et al.  Phrasal Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[19]  Jason Weston,et al.  Large scale image annotation: learning to rank with joint word-image embeddings , 2010, Machine Learning.

[20]  Samy Bengio,et al.  An Online Algorithm for Large Scale Image Similarity Learning , 2009, NIPS.

[21]  Geoffrey Zweig,et al.  From captions to visual concepts and back , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[23]  Karl Stratos,et al.  Midge: Generating Image Descriptions From Computer Vision Detections , 2012, EACL.

[24]  Cor J. Veenman,et al.  LESS: a model-based classifier for sparse subspaces , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[26]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[27]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[28]  Joshua B. Tenenbaum,et al.  Learning to share visual appearance for multiclass object detection , 2011, CVPR 2011.

[29]  Peter Young,et al.  Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics , 2013, J. Artif. Intell. Res..

[30]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[31]  Yiannis Aloimonos,et al.  Corpus-Guided Sentence Generation of Natural Images , 2011, EMNLP.

[32]  Ruslan Salakhutdinov,et al.  Multimodal Neural Language Models , 2014, ICML.

[33]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[34]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[36]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[37]  Florent Perronnin,et al.  High-dimensional signature compression for large-scale image classification , 2011, CVPR 2011.

[38]  Ankush Gupta,et al.  From Image Annotation to Image Description , 2012, ICONIP.

[39]  Cyrus Rashtchian,et al.  Collecting Image Annotations Using Amazon’s Mechanical Turk , 2010, Mturk@HLT-NAACL.

[40]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[41]  Thomas Mensink,et al.  Improving the Fisher Kernel for Large-Scale Image Classification , 2010, ECCV.

[42]  Yejin Choi,et al.  Baby talk: Understanding and generating simple image descriptions , 2011, CVPR 2011.

[43]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  Yejin Choi,et al.  Collective Generation of Natural Image Descriptions , 2012, ACL.

[46]  Yejin Choi,et al.  Generalizing Image Captions for Image-Text Parallel Corpus , 2013, ACL.

[47]  Xinlei Chen,et al.  Mind's eye: A recurrent visual representation for image caption generation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[49]  Ali Farhadi,et al.  Recognition using visual phrases , 2011, CVPR 2011.