Joint Intermodal and Intramodal Label Transfers for Extremely Rare or Unseen Classes

In this paper, we present a label transfer model from texts to images for image classification tasks. The problem of image classification is often much more challenging than text classification. On one hand, labeled text data is more widely available than the labeled images for classification tasks. On the other hand, text data tends to have natural semantic interpretability, and they are often more directly related to class labels. On the contrary, the image features are not directly related to concepts inherent in class labels. One of our goals in this paper is to develop a model for revealing the functional relationships between text and image features as to directly transfer intermodal and intramodal labels to annotate the images. This is implemented by learning a transfer function as a bridge to propagate the labels between two multimodal spaces. However, the intermodal label transfers could be undermined by blindly transferring the labels of noisy texts to annotate images. To mitigate this problem, we present an intramodal label transfer process, which complements the intermodal label transfer by transferring the image labels instead when relevant text is absent from the source corpus. In addition, we generalize the inter-modal label transfer to zero-shot learning scenario where there are only text examples available to label unseen classes of images without any positive image examples. We evaluate our algorithm on an image classification task and show the effectiveness with respect to the other compared algorithms.

[1]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[2]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[3]  Heng Ji,et al.  Exploring Context and Content Links in Social Media: A Latent Space Method , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Tommi S. Jaakkola,et al.  Maximum-Margin Matrix Factorization , 2004, NIPS.

[5]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[6]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[7]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[8]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[9]  Thomas G. Dietterich,et al.  Improving SVM accuracy by training on auxiliary data sources , 2004, ICML.

[10]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[11]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Qiang Yang,et al.  Heterogeneous Transfer Learning for Image Clustering via the SocialWeb , 2009, ACL.

[13]  Subhashini Venugopalan,et al.  Translating Videos to Natural Language Using Deep Recurrent Neural Networks , 2014, NAACL.

[14]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cordelia Schmid,et al.  Label-Embedding for Attribute-Based Classification , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[17]  Steve Marschner,et al.  Caliber: Camera Localization and Calibration Using Rigidity Constraints , 2016, International Journal of Computer Vision.

[18]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[19]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[20]  Vicente Ordonez,et al.  Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.

[21]  Dong Xu,et al.  Exploiting Privileged Information from Web Data for Image Categorization , 2014, ECCV.

[22]  Haohan Wang,et al.  Multimodal Transfer Deep Learning with Applications in Audio-Visual Recognition , 2014 .

[23]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[24]  Emmanuel J. Candès,et al.  A Singular Value Thresholding Algorithm for Matrix Completion , 2008, SIAM J. Optim..

[25]  Trevor Darrell,et al.  What you saw is not what you get: Domain adaptation using asymmetric kernel transforms , 2011, CVPR 2011.

[26]  Cristian Sminchisescu,et al.  Twin Gaussian Processes for Structured Prediction , 2010, International Journal of Computer Vision.

[27]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[28]  Qiang Yang,et al.  Heterogeneous Transfer Learning for Image Classification , 2011, AAAI.

[29]  Ivor W. Tsang,et al.  Learning With Augmented Features for Supervised and Semi-Supervised Heterogeneous Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[31]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[32]  Jürgen Schmidhuber,et al.  Multimodal Similarity-Preserving Hashing , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[34]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Mohan S. Kankanhalli,et al.  Audio Matters in Visual Attention , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[36]  Tao Xiang,et al.  Learning Multimodal Latent Attributes , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[38]  Qiang Yang,et al.  Translated Learning: Transfer Learning across Different Feature Spaces , 2008, NIPS.

[39]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[40]  Thomas G. Dietterich,et al.  To transfer or not to transfer , 2005, NIPS 2005.

[41]  S. Yun,et al.  An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems , 2009 .

[42]  Haohan Wang,et al.  Multimodal Transfer Deep Learning for Audio Visual Recognition , 2014, ArXiv.

[43]  Charu C. Aggarwal,et al.  Towards cross-category knowledge propagation for learning visual concepts , 2011, CVPR 2011.

[44]  Ivor W. Tsang,et al.  Domain Transfer SVM for video concept detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[46]  Michael Isard,et al.  A Multi-View Embedding Space for Modeling Internet Images, Tags, and Their Semantics , 2012, International Journal of Computer Vision.

[47]  Patrick F. Reidy An Introduction to Latent Semantic Analysis , 2009 .

[48]  Charu C. Aggarwal,et al.  Towards semantic knowledge propagation from text corpus to web images , 2011, WWW.

[49]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[50]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[51]  Xian-Sheng Hua,et al.  Learning semantic distance from community-tagged media collection , 2009, MM '09.

[52]  Shimon Ullman,et al.  Uncovering shared structures in multiclass classification , 2007, ICML '07.