SIGN: Spatial-information Incorporated Generative Network for Generalized Zero-shot Semantic Segmentation

Unlike conventional zero-shot classification, zero-shot semantic segmentation predicts a class label at the pixel level instead of the image level. When solving zero-shot semantic segmentation problems, the need for pixel-level prediction with surrounding context motivates us to incorporate spatial information using positional encoding. We improve standard positional encoding by introducing the concept of Relative Positional Encoding, which integrates spatial information at the feature level and can handle arbitrary image sizes. Furthermore, while selftraining is widely used in zero-shot semantic segmentation to generate pseudo-labels, we propose a new knowledgedistillation-inspired self-training strategy, namely Annealed Self-Training, which can automatically assign different importance to pseudo-labels to improve performance. We systematically study the proposed Relative Positional Encoding and Annealed Self-Training in a comprehensive experimental evaluation, and our empirical results confirm the effectiveness of our method on three benchmark datasets.

[1]  Sanja Fidler,et al.  The Role of Context for Object Detection and Semantic Segmentation in the Wild , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[3]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[4]  Nick Barnes,et al.  Polarity Loss for Zero-shot Object Detection , 2018, ArXiv.

[5]  Shaogang Gong,et al.  Zero-shot object recognition by semantic manifold distance , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Bernt Schiele,et al.  Latent Embeddings for Zero-Shot Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Hugo Liu,et al.  ConceptNet — A Practical Commonsense Reasoning Tool-Kit , 2004 .

[9]  Davis Liang,et al.  Improve Transformer Models with Better Relative Position Embeddings , 2020, FINDINGS.

[10]  Bernt Schiele,et al.  Semantic Projection Network for Zero- and Few-Label Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Rainer Stiefelhagen,et al.  How to Transfer? Zero-Shot Object Recognition via Hierarchical Transfer of Semantic Attributes , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[12]  Xiaodan Liang,et al.  Spatial-Aware Graph Relation Network for Large-Scale Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ilya Sutskever,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[14]  Yunchao Wei,et al.  Consistent Structural Relation Learning for Zero-Shot Segmentation , 2020, NeurIPS.

[15]  Nazli Ikizler-Cinbis,et al.  Zero-Shot Object Detection by Hybrid Region Embedding , 2018, BMVC.

[16]  Hao Wang,et al.  Rethinking Knowledge Graph Propagation for Zero-Shot Learning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Matthieu Cord,et al.  Zero-Shot Semantic Segmentation , 2019, NeurIPS.

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[20]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Siyuan Zhou,et al.  Context-aware Feature Generation For Zero-shot Semantic Segmentation , 2020, ACM Multimedia.

[22]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[23]  Han Zhang,et al.  Co-Occurrent Features in Semantic Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[26]  Wei Liu,et al.  Zero-Shot Visual Recognition Using Semantics-Preserving Adversarial Embedding Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Venkatesh Saligrama,et al.  Zero-Shot Learning via Semantic Similarity Embedding , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[29]  Philip H. S. Torr,et al.  An embarrassingly simple approach to zero-shot learning , 2015, ICML.

[30]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[31]  Bernt Schiele,et al.  Feature Generating Networks for Zero-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[34]  Nuno Vasconcelos,et al.  Semantically Consistent Regularization for Zero-Shot Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Gustavo Carneiro,et al.  Multi-modal Cycle-consistent Generalized Zero-Shot Learning , 2018, ECCV.

[36]  Honglak Lee,et al.  Learning Structured Output Representation using Deep Conditional Generative Models , 2015, NIPS.

[37]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[38]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[39]  Hema A. Murthy,et al.  A Generative Model for Zero Shot Learning Using Conditional Variational Autoencoders , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[40]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[41]  Fatih Murat Porikli,et al.  Zero-Shot Object Detection: Learning to Simultaneously Recognize and Localize Novel Concepts , 2018, ACCV.

[42]  Seyed Iman Mirzadeh,et al.  Improved Knowledge Distillation via Teacher Assistant , 2020, AAAI.

[43]  Subhransu Maji,et al.  Semantic contours from inverse detectors , 2011, 2011 International Conference on Computer Vision.

[44]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[45]  Andrew Y. Ng,et al.  Zero-Shot Learning Through Cross-Modal Transfer , 2013, NIPS.

[46]  Bernt Schiele,et al.  Evaluation of output embeddings for fine-grained image classification , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Toshihiko Yamasaki,et al.  Zero-Shot Semantic Segmentation via Variational Mapping , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  George Papandreou,et al.  Rethinking Atrous Convolution for Semantic Image Segmentation , 2017, ArXiv.

[50]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[53]  Marc'Aurelio Ranzato,et al.  DeViSE: A Deep Visual-Semantic Embedding Model , 2013, NIPS.

[54]  Ling Shao,et al.  Zero-Shot Learning Using Synthesised Unseen Visual Data with Diffusion Regularisation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[56]  Rama Chellappa,et al.  Zero-Shot Object Detection , 2018, ECCV.

[57]  Yue Gao,et al.  Zero-Shot Learning With Transferred Samples , 2017, IEEE Transactions on Image Processing.

[58]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[59]  Abhinav Gupta,et al.  Zero-Shot Recognition via Semantic Embeddings and Knowledge Graphs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.