Relaxing from Vocabulary: Robust Weakly-Supervised Deep Learning for Vocabulary-Free Image Tagging

The development of deep learning has empowered machines with comparable capability of recognizing limited image categories to human beings. However, most existing approaches heavily rely on human-curated training data, which hinders the scalability to large and unlabeled vocabularies in image tagging. In this paper, we propose a weakly-supervised deep learning model which can be trained from the readily available Web images to relax the dependence on human labors and scale up to arbitrary tags (categories). Specifically, based on the assumption that features of true samples in a category tend to be similar and noises tend to be variant, we embed the feature map of the last deep layer into a new affinity representation, and further minimize the discrepancy between the affinity representation and its low-rank approximation. The discrepancy is finally transformed into the objective function to give relevance feedback to back propagation. Experiments show that we can achieve a performance gain of 14.0% in terms of a semantic-based relevance metric in image tagging with 63,043 tags from the WordNet, against the typical deep model trained on the ImageNet 1,000 vocabulary set.

[1]  Tao Mei,et al.  Tagging Personal Photos with Transfer Deep Learning , 2015, WWW.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  Shie Mannor,et al.  Outlier-Robust PCA: The High-Dimensional Case , 2013, IEEE Transactions on Information Theory.

[4]  Yi Ma,et al.  Robust principal component analysis? , 2009, JACM.

[5]  Lars Kai Hansen,et al.  Design of robust neural network classifiers , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[6]  Jürgen Schmidhuber,et al.  Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction , 2011, ICANN.

[7]  Mikhail Belkin,et al.  Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering , 2001, NIPS.

[8]  Fei-Fei Li,et al.  What Does Classifying More Than 10, 000 Image Categories Tell Us? , 2010, ECCV.

[9]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[10]  Rob Fergus,et al.  Learning from Noisy Labels with Deep Neural Networks , 2014, ICLR.

[11]  Xinlei Chen,et al.  NEIL: Extracting Visual Knowledge from Web Data , 2013, 2013 IEEE International Conference on Computer Vision.

[12]  Jieping Ye,et al.  Discriminative K-means for Clustering , 2007, NIPS.

[13]  Tao Mei,et al.  Image Tag Refinement With View-Dependent Concept Representations , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Gang Hua,et al.  Unsupervised One-Class Learning for Automatic Outlier Removal , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Zi Huang,et al.  Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence ℓ2,1-Norm Regularized Discriminative Feature Selection for Unsupervised Learning , 2022 .

[16]  Benjamin Graham,et al.  Spatially-sparse convolutional neural networks , 2014, ArXiv.

[17]  Ali Farhadi,et al.  Learning Everything about Anything: Webly-Supervised Visual Concept Learning , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Geoffrey E. Hinton,et al.  Learning to Label Aerial Images from Noisy Data , 2012, ICML.

[19]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[20]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Ivan Laptev,et al.  Is object localization for free? - Weakly-supervised learning with convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Sylvain Paris,et al.  Learning photographic global tonal adjustment with a database of input / output image pairs , 2011, CVPR 2011.

[23]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Clayton D. Scott,et al.  Robust kernel density estimation , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[26]  Xiaogang Wang,et al.  Hierarchical face parsing via deep learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Jiajun Wu,et al.  Deep multiple instance learning for image classification and auto-annotation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).