Learning Modality-Invariant Latent Representations for Generalized Zero-shot Learning

Recently, feature generating methods have been successfully applied to zero-shot learning (ZSL). However, most previous approaches only generate visual representations for zero-shot recognition. In fact, typical ZSL is a classic multi-modal learning protocol which consists of a visual space and a semantic space. In this paper, therefore, we present a new method which can simultaneously generate both visual representations and semantic representations so that the essential multi-modal information associated with unseen classes can be captured. Specifically, we address the most challenging issue in such a paradigm, i.e., how to handle the domain shift and thus guarantee that the learned representations are modality-invariant. To this end, we propose two strategies: 1) leveraging the mutual information between the latent visual representations and the semantic representations; 2) maximizing the entropy of the joint distribution of the two latent representations. By leveraging the two strategies, we argue that the two modalities can be well aligned. At last, extensive experiments on five widely used datasets verify that the proposed method is able to significantly outperform previous the state-of-the-arts.

[1]  Gal Chechik,et al.  Adaptive Confidence Smoothing for Generalized Zero-Shot Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Tao Xiang,et al.  Learning a Deep Embedding Model for Zero-Shot Learning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Zi Huang,et al.  Leveraging the Invariant Side of Generative Zero-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[7]  Ming Shao,et al.  Low-Rank Embedded Ensemble Semantic Dictionary for Zero-Shot Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Lu Ke,et al.  Maximum Density Divergence for Domain Adaptation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[11]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Xi Peng,et al.  A Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  Paul A. Viola,et al.  Alignment by Maximization of Mutual Information , 1997, International Journal of Computer Vision.

[14]  Ming Shao,et al.  Generative Zero-Shot Learning via Low-Rank Embedded Semantic Dictionary , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[16]  Pieter Abbeel,et al.  InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets , 2016, NIPS.

[17]  Aaron C. Courville,et al.  MINE: Mutual Information Neural Estimation , 2018, ArXiv.

[18]  Piyush Rai,et al.  Generalized Zero-Shot Learning via Synthesized Examples , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Bernt Schiele,et al.  Feature Generating Networks for Zero-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20]  Yoshua Bengio,et al.  Learning deep representations by mutual information estimation and maximization , 2018, ICLR.

[21]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Heng Tao Shen,et al.  Maximum Density Divergence for Domain Adaptation , 2020, IEEE transactions on pattern analysis and machine intelligence.

[23]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[24]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[25]  Ruslan Salakhutdinov,et al.  Learning Robust Visual-Semantic Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Philip S. Yu,et al.  Generative Dual Adversarial Network for Generalized Zero-Shot Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yang Yang,et al.  CANZSL: Cycle-Consistent Adversarial Networks for Zero-Shot Learning from Natural Language , 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[28]  Zi Huang,et al.  Cycle-consistent Conditional Adversarial Transfer Networks , 2019, ACM Multimedia.

[29]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[30]  Christoph H. Lampert,et al.  Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Zi Huang,et al.  Alleviating Feature Confusion for Generative Zero-shot Learning , 2019, ACM Multimedia.

[32]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Wei-Lun Chao,et al.  Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yuhong Guo,et al.  Progressive Ensemble Networks for Zero-Shot Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Hema A. Murthy,et al.  A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[36]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[37]  Yun Fu,et al.  Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Shaogang Gong,et al.  Unsupervised Domain Adaptation for Zero-Shot Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Shaogang Gong,et al.  Semantic Autoencoder for Zero-Shot Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Yingyu Liang,et al.  Generalization and Equilibrium in Generative Adversarial Nets (GANs) , 2017, ICML.

[42]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[43]  Gustavo Carneiro,et al.  Multi-modal Cycle-consistent Generalized Zero-Shot Learning , 2018, ECCV.

[44]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[45]  Zi Huang,et al.  From Zero-Shot Learning to Cold-Start Recommendation , 2019, AAAI.

[46]  Trevor Darrell,et al.  Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).