HSVA: Hierarchical Semantic-Visual Adaptation for Zero-Shot Learning

Zero-shot learning (ZSL) tackles the unseen class recognition problem, transferring semantic knowledge from seen classes to unseen ones. Typically, to guarantee desirable knowledge transfer, a common (latent) space is adopted for associating the visual and semantic domains in ZSL. However, existing common space learning methods align the semantic and visual domains by merely mitigating distribution disagreement through one-step adaptation. This strategy is usually ineffective due to the heterogeneous nature of the feature representations in the two domains, which intrinsically contain both distribution and structure variations. To address this and advance ZSL, we propose a novel hierarchical semantic-visual adaptation (HSVA) framework. Specifically, HSVA aligns the semantic and visual domains by adopting a hierarchical two-step adaptation, i.e., structure adaptation and distribution adaptation. In the structure adaptation step, we take two task-specific encoders to encode the source data (visual domain) and the target data (semantic domain) into a structure-aligned common space. To this end, a supervised adversarial discrepancy (SAD) module is proposed to adversarially minimize the discrepancy between the predictions of two task-specific classifiers, thus making the visual and semantic feature manifolds more closely aligned. In the distribution adaptation step, we directly minimize the Wasserstein distance between the latent multivariate Gaussian distributions to align the visual and semantic distributions using a common encoder. Finally, the structure and distribution adaptation are derived in a unified framework under two partially-aligned variational autoencoders. Extensive experiments on four benchmark datasets demonstrate that HSVA achieves superior performance on both conventional and generalized ZSL. The code is available at https://github.com/shiming-chen/HSVA .

[1]  Ruslan Salakhutdinov,et al.  Learning Robust Visual-Semantic Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[2]  Fahad Shahbaz Khan,et al.  Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification , 2020, ECCV.

[3]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Qiang Yang,et al.  Distant Domain Transfer Learning , 2017, AAAI.

[5]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[6]  Fei Zhang,et al.  Co-Representation Network for Generalized Zero-Shot Learning , 2019, ICML.

[7]  Yoshua Bengio,et al.  Zero-data Learning of New Tasks , 2008, AAAI.

[8]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Margaret Mitchell,et al.  VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[12]  Sethuraman Panchanathan,et al.  Leveraging Seen and Unseen Semantic Relationships for Generative Zero-Shot Learning , 2020, ECCV.

[13]  Gustavo Carneiro,et al.  Multi-modal Cycle-consistent Generalized Zero-Shot Learning , 2018, ECCV.

[14]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[15]  Chen-Yu Lee,et al.  Sliced Wasserstein Discrepancy for Unsupervised Domain Adaptation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yong Luo,et al.  Transferring Knowledge Fragments for Learning Distance Metric from a Heterogeneous Domain , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[18]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[19]  Shaogang Gong,et al.  Zero-shot object recognition by semantic manifold distance , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Shuang Li,et al.  Simultaneous Semantic Alignment Network for Heterogeneous Domain Adaptation , 2020, ACM Multimedia.

[22]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[23]  Shijian Lu,et al.  Model Adaptation: Historical Contrastive Learning for Unsupervised Domain Adaptation without Source Data , 2021, Neural Information Processing Systems.

[24]  Simao Herdade,et al.  Image Captioning: Transforming Objects into Words , 2019, NeurIPS.

[25]  Shiguang Shan,et al.  Transferable Contrastive Network for Generalized Zero-Shot Learning , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Yongdong Zhang,et al.  Domain-Aware Visual Bias Eliminating for Generalized Zero-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Yunlong Yu,et al.  Episode-Based Prototype Generating Network for Zero-Shot Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Trevor Darrell,et al.  Generalized Zero- and Few-Shot Learning via Aligned Variational Autoencoders , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xiaobing Zhang,et al.  Contextual-Relation Consistent Domain Adaptation for Semantic Segmentation , 2020, ECCV.

[31]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Dat T. Huynh,et al.  Fine-Grained Generalized Zero-Shot Learning via Dense Attribute-Based Attention , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Michael I. Jordan,et al.  Conditional Adversarial Domain Adaptation , 2017, NeurIPS.

[34]  Xiaobo Jin,et al.  Attentive Region Embedding Network for Zero-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Bernt Schiele,et al.  Feature Generating Networks for Zero-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Ling Shao,et al.  Region Graph Embedding Network for Zero-Shot Learning , 2020, ECCV.

[37]  Zi Huang,et al.  Leveraging the Invariant Side of Generative Zero-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Zhiqiang Tang,et al.  Semantic-Guided Multi-Attention Localization for Zero-Shot Learning , 2019, NeurIPS.

[39]  Hyeonwoo Yu,et al.  Zero-shot Learning via Simultaneous Generating and Learning , 2019, NeurIPS.

[40]  Wei-Lun Chao,et al.  Synthesized Classifiers for Zero-Shot Learning , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[42]  Tyng-Luh Liu,et al.  Adaptive and Generative Zero-Shot Learning , 2021, ICLR.

[43]  Yi Yang,et al.  Exploring Semantic Inter-Class Relationships (SIR) for Zero-Shot Action Recognition , 2015, AAAI.

[44]  Dat Huynh,et al.  Compositional Zero-Shot Learning via Fine-Grained Dense Feature Composition , 2020, NeurIPS.

[45]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Joint Latent Similarity Embedding , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[47]  Nanning Zheng,et al.  Compressing Unknown Images With Product Quantizer for Efficient Zero-Shot Classification , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Bernt Schiele,et al.  Attribute Prototype Network for Zero-Shot Learning , 2020, NeurIPS.

[49]  Ke Chen,et al.  Zero-Shot Visual Recognition via Bidirectional Latent Embedding , 2016, International Journal of Computer Vision.

[50]  Bernt Schiele,et al.  Learning Deep Representations of Fine-Grained Visual Descriptions , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Tatsuya Harada,et al.  Maximum Classifier Discrepancy for Unsupervised Domain Adaptation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Michael I. Jordan,et al.  Generalized Zero-Shot Learning with Deep Calibration Network , 2018, NeurIPS.

[53]  Christoph H. Lampert,et al.  Learning to detect unseen object classes by between-class attribute transfer , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[54]  Yun Fu,et al.  Rethinking Zero-Shot Learning: A Conditional Visual Classification Perspective , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Ling Shao,et al.  FREE: Feature Refinement for Generalized Zero-Shot Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[56]  Qiang Yang,et al.  Transitive Transfer Learning , 2015, KDD.

[57]  Bernt Schiele,et al.  F-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Kate Saenko,et al.  Return of Frustratingly Easy Domain Adaptation , 2015, AAAI.

[59]  Tao Xiang,et al.  Zero-Shot Learning on Semantic Class Prototype Graph , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Alex C. Kot,et al.  Heterogeneous Transfer Learning via Deep Matrix Completion with Adversarial Kernel Embedding , 2019, AAAI.