Assessing semantic information in convolutional neural network representations of images via image annotation

Image annotation, or prediction of multiple tags for an image, is a challenging task. Most current algorithms are based on large sets of handcrafted features. Deep convolutional neural networks have recently outperformed humans in image classification, and these networks can be used to extract features highly predictive of an image's tags. In this study, we analyze semantic information in features derived from two pre-trained deep network classifiers by evaluating their performance in nearest neighbor-based approaches to tag prediction. We generally exceed performance of the manual features when using the deep features. We also find complementary information in the manual and deep features when used in combination for image annotation.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Csaba Szepesvári,et al.  Deep Representations and Codes for Image Auto-Annotation , 2012, NIPS.

[4]  Kilian Q. Weinberger,et al.  Fast Image Tagging , 2013, ICML.

[5]  Hao Xu,et al.  Tag refinement by regularized LDA , 2009, ACM Multimedia.

[6]  Fei-Fei Li,et al.  Deep visual-semantic alignments for generating image descriptions , 2015, CVPR.

[7]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[8]  Cordelia Schmid,et al.  TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  R. Manmatha,et al.  Multiple Bernoulli relevance models for image and video annotation , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[10]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[11]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[12]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[13]  C. V. Jawahar,et al.  Image Annotation Using Metric Learning in Semantic Neighbourhoods , 2012, ECCV.

[14]  Paul Clough,et al.  The IAPR TC-12 Benchmark: A New Evaluation Resource for Visual Information Systems , 2006 .

[15]  Haroon Idrees,et al.  NMF-KNN: Image Annotation Using Weighted Multi-view Non-negative Matrix Factorization , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Subhransu Maji,et al.  Automatic Image Annotation using Deep Learning Representations , 2015, ICMR.

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[19]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[20]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21]  Vladimir Pavlovic,et al.  A New Baseline for Image Annotation , 2008, ECCV.