ML-Decoder: Scalable and Versatile Classification Head

In this paper, we introduce ML-Decoder, a new attention-based classification head. ML-Decoder predicts the existence of class labels via queries, and enables better utilization of spatial data compared to global average pooling. By redesigning the decoder architecture, and using a novel group-decoding scheme, ML-Decoder is highly efficient, and can scale well to thousands of classes. Compared to using a larger backbone, ML-Decoder consistently provides a better speed-accuracy trade-off. ML-Decoder is also versatile - it can be used as a drop-in replacement for various classification heads, and generalize to unseen classes when operated with word queries. Novel query augmentations further improve its generalization ability. Using ML-Decoder, we achieve state-of-the-art results on several classification tasks: on MS-COCO multi-label, we reach 91.1% mAP; on NUS-WIDE zero-shot, we reach 31.1% ZSL mAP; and on ImageNet single-label, we reach with vanilla ResNet50 backbone a new top score of 80.7%, without extra data or distillation. Public code will be available.

[1]  Lihi Zelnik-Manor,et al.  Multi-label Classification with Partial Annotations using Class-aware Selective Loss , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ross Wightman,et al.  ResNet strikes back: An improved training procedure in timm , 2021, ArXiv.

[3]  Ling Shao,et al.  Discriminative Region-based Multi-Label Zero-Shot Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Jianxin Wu,et al.  Residual Attention: A Simple but Effective Method for Multi-Label Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Jun Zhu,et al.  Query2Label: A Simple Transformer Way to Multi-Label Classification , 2021, ArXiv.

[6]  Lihi Zelnik-Manor,et al.  Semantic Diversity Learning for Zero-Shot Multi-label Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[7]  Lihi Zelnik-Manor,et al.  ImageNet-21K Pretraining for the Masses , 2021, NeurIPS Datasets and Benchmarks.

[8]  Emanuel Ben Baruch,et al.  Asymmetric Loss For Multi-Label Classification , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Yu Wang,et al.  Distribution-Balanced Loss for Multi-Label Classification in Long-Tailed Datasets , 2020, ECCV.

[10]  Bin-Bin Gao,et al.  Learning to Discover Multi-Class Attentional Regions for Multi-Label Image Recognition , 2020, IEEE Transactions on Image Processing.

[11]  Dat T. Huynh,et al.  A Shared Multi-Attention Framework for Multi-Label Zero-Shot Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[13]  Itamar Friedman,et al.  TResNet: High Performance GPU-Dedicated Architecture , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[14]  Sid Ying-Ze Bao,et al.  Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification , 2019, AAAI.

[15]  Anshumali Shrivastava,et al.  Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products , 2019, NeurIPS.

[16]  Hefeng Wu,et al.  Learning Semantic-Specific Graph Representation for Multi-Label Image Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[17]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition with Joint Class-Aware Map Disentangling and Label Correlation Embedding , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[18]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[19]  Xiu-Shen Wei,et al.  Multi-Label Image Recognition With Graph Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Greg Mori,et al.  Learning a Deep ConvNet for Multi-Label Classification With Partial Labels , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Chunyan Miao,et al.  A Survey of Zero-Shot Learning , 2019, ACM Trans. Intell. Syst. Technol..

[22]  Shiming Xiang,et al.  Multi-Label Image Classification via Knowledge Distillation from Weakly-Supervised Detection , 2018, ACM Multimedia.

[23]  Byoung-Tak Zhang,et al.  Bilinear Attention Networks , 2018, NeurIPS.

[24]  Biswarup Bhattacharya,et al.  AdGAP: Advanced Global Average Pooling , 2018, AAAI.

[25]  Zhuowen Tu,et al.  Generalizing Pooling Functions in CNNs: Mixed, Gated, and Tree , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Giorgos Tolias,et al.  Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[27]  Liang Lin,et al.  Multi-label Image Recognition by Recurrently Discovering Attentional Regions , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[28]  Sebastian Ruder,et al.  An Overview of Multi-Task Learning in Deep Neural Networks , 2017, ArXiv.

[29]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[30]  Hongyuan Zha,et al.  Deep Extreme Multi-label Learning , 2017, ICMR.

[31]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Mubarak Shah,et al.  Fast Zero-Shot Image Tagging , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yu Zhang,et al.  Exploit Bounding Box Annotations for Multi-Label Object Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[36]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[37]  Samy Bengio,et al.  Zero-Shot Learning by Convex Combination of Semantic Embeddings , 2013, ICLR.

[38]  Inderjit S. Dhillon,et al.  Large-scale Multi-label Learning with Missing Labels , 2013, ICML.

[39]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[40]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[42]  Rich Caruana,et al.  Multitask Learning , 1997, Machine Learning.

[43]  Scott Johnson,et al.  Labels , 1902, The Canadian Entomologist.

[44]  Yonghong Xie,et al.  Bi-Modal Learning With Channel-Wise Attention for Multi-Label Image Classification , 2020, IEEE Access.

[45]  O. Calin Pooling , 2020, Deep Learning Architectures.

[46]  Qiang Yang,et al.  An Overview of Multi-task Learning , 2018 .

[47]  Jonathan Krause,et al.  Collecting a Large-scale Dataset of Fine-grained Cars , 2013 .

[48]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[49]  Jianguo Zhang,et al.  The PASCAL Visual Object Classes Challenge , 2006 .

[50]  Christopher K. I. Williams,et al.  Pascal Visual Object Classes Challenge Results , 2005 .