Discriminative Region-based Multi-Label Zero-Shot Learning

Multi-label zero-shot learning (ZSL) is a more realistic counter-part of standard single-label ZSL since several objects can co-exist in a natural image. However, the occurrence of multiple objects complicates the reasoning and requires region-specific processing of visual features to preserve their contextual cues. We note that the best existing multi-label ZSL method takes a shared approach towards attending to region features with a common set of attention maps for all the classes. Such shared maps lead to diffused attention, which does not discriminatively focus on relevant locations when the number of classes are large. Moreover, mapping spatially-pooled visual features to the class semantics leads to inter-class feature entanglement, thus hampering the classification. Here, we propose an alternate approach towards region-based discriminabilitypreserving multi-label zero-shot classification. Our approach maintains the spatial resolution to preserve regionlevel characteristics and utilizes a bi-level attention module (BiAM) to enrich the features by incorporating both region and scene context information. The enriched region-level features are then mapped to the class semantics and only their class predictions are spatially pooled to obtain imagelevel predictions, thereby keeping the multi-class features disentangled. Our approach sets a new state of the art on two large-scale multi-label zero-shot benchmarks: NUSWIDE and Open Images. On NUS-WIDE, our approach achieves an absolute gain of 6.9% mAP for ZSL, compared to the best published results. Source code is available at https://github.com/akshitac8/BiAM.

[1]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[2]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[3]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition With Graph Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[5]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[6]  Matthias Bethge,et al.  Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.

[7]  Joost van de Weijer,et al.  Orderless Recurrent Models for Multi-Label Classification , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Gabriel Kreiman,et al.  Putting Visual Object Recognition in Context , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ling Shao,et al.  Generative Multi-Label Zero-Shot Learning , 2021, ArXiv.

[10]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[11]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[12]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[13]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[14]  Abhinav Gupta,et al.  Learning from Noisy Large-Scale Datasets with Minimal Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yu-Chiang Frank Wang,et al.  Multi-label Zero-Shot Learning with Structured Knowledge Graphs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[17]  Cordelia Schmid,et al.  Label-Embedding for Image Classification , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Lihi Zelnik-Manor,et al.  Semantic Diversity Learning for Zero-Shot Multi-label Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Serge J. Belongie,et al.  Context based object categorization: A critical survey , 2010, Comput. Vis. Image Underst..

[20]  Sid Ying-Ze Bao,et al.  Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification , 2020, AAAI.

[21]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[22]  Haofeng Zhang,et al.  Near-Real Feature Generative Network for Generalized Zero-Shot Learning , 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME).

[23]  Ling Shao,et al.  Deep transductive network for generalized zero shot learning , 2020, Pattern Recognit..

[24]  Ling Shao,et al.  Out-Of-Distribution Detection for Generalized Zero-Shot Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Cees Snoek,et al.  COSTA: Co-Occurrence Statistics for Zero-Shot Classification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Bernt Schiele,et al.  Feature Generating Networks for Zero-Shot Learning , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[28]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[31]  Jie Qin,et al.  Invertible Zero-Shot Recognition Flows , 2020, ECCV.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Johannes Fürnkranz,et al.  Maximizing Subset Accuracy with Recurrent Neural Networks in Multi-label Classification , 2017, NIPS.

[34]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[35]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Jason Weston,et al.  WSABIE: Scaling Up to Large Vocabulary Image Annotation , 2011, IJCAI.

[37]  Fahad Shahbaz Khan,et al.  Latent Embedding Feedback and Discriminative Features for Zero-Shot Classification , 2020, ECCV.

[38]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[39]  Ling Shao,et al.  A plug-in attribute correction module for generalized zero-shot learning , 2021, Pattern Recognit..

[40]  Yangqing Jia,et al.  Deep Convolutional Ranking for Multilabel Image Annotation , 2013, ICLR.

[41]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[42]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[43]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Yu Qiao,et al.  Attention-Driven Dynamic Graph Convolutional Network for Multi-label Image Recognition , 2020, ECCV.

[45]  Dat T. Huynh,et al.  A Shared Multi-Attention Framework for Multi-Label Zero-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).