VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition

Deep learning-based models encounter challenges when processing long-tailed data in the real world. Existing solutions usually employ some balancing strategies or transfer learning to deal with the class imbalance problem, based on the image modality. In this work, we present a visuallinguistic long-tailed recognition framework, termed VLLTR, and conduct empirical studies on the benefits of introducing text modality for long-tailed recognition (LTR). Compared to existing approaches, the proposed VL-LTR has the following merits. (1) Our method can not only learn visual representation from images but also learn corresponding linguistic representation from noisy classlevel text descriptions collected from the Internet; (2) Our method can effectively use the learned visual-linguistic representation to improve the visual recognition performance, especially for classes with fewer image samples. We also conduct extensive experiments and set the new state-of-theart performance on widely-used LTR benchmarks. Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points, and is close to the prevailing performance training on the full ImageNet. Code shall be released.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Kai Han,et al.  Contrastive Learning based Hybrid Networks for Long-Tailed Image Classification , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[6]  Yi Yang,et al.  Inflated Episodic Memory With Region Self-Attention for Long-Tailed Visual Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[8]  Weihong Deng,et al.  Unequal-Training for Deep Face Recognition With Long-Tailed Noisy Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jianlong Fu,et al.  Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.

[10]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[11]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[12]  Shu Liu,et al.  ResLT: Residual Learning for Long-tailed Recognition , 2021 .

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Marcus Rohrbach,et al.  Decoupling Representation and Classifier for Long-Tailed Recognition , 2020, ICLR.

[15]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[16]  Yu Cheng,et al.  Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.

[17]  Gal Chechik,et al.  From generalized zero-shot learning to long-tail with class descriptors , 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[18]  Jinwoo Shin,et al.  M2m: Imbalanced Classification via Major-to-Minor Translation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Ahmed El Kholy,et al.  UNITER: Learning UNiversal Image-TExt Representations , 2019, ECCV 2020.

[20]  Jiaya Jia,et al.  Parametric Contrastive Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Yuxin Peng,et al.  Fine-Grained Image Classification via Combining Vision and Language , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Noah D. Goodman,et al.  Shaping Visual Representations with Language for Few-Shot Classification , 2019, ACL.

[23]  Colin Wei,et al.  Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss , 2019, NeurIPS.

[24]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Zhe Gan,et al.  A Closer Look at the Robustness of Vision-and-Language Pre-trained Models , 2020, ArXiv.

[26]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[27]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[28]  Xiang Yu,et al.  Feature Transfer Learning for Face Recognition With Under-Represented Data , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Yang Song,et al.  The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Kaiming He,et al.  Exploring the Limits of Weakly Supervised Pretraining , 2018, ECCV.

[31]  Zhe Gan,et al.  HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training , 2020, EMNLP.

[32]  Yu Qiao,et al.  WildFish++: A Comprehensive Fish Benchmark for Multimedia Research , 2020 .

[33]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[34]  Marcus Rohrbach,et al.  12-in-1: Multi-Task Vision and Language Representation Learning , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Zhongqi Miao,et al.  Long-tailed Recognition by Routing Diverse Distribution-Aware Experts , 2021, ICLR.

[36]  Robert C. Holte,et al.  C4.5, Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling , 2003 .

[37]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[38]  Mohammed Bennamoun,et al.  Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[39]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[40]  Xiu-Shen Wei,et al.  BBN: Bilateral-Branch Network With Cumulative Learning for Long-Tailed Visual Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Ming-Hsuan Yang,et al.  Rethinking Class-Balanced Methods for Long-Tailed Visual Recognition From a Domain Adaptation Perspective , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Gal Chechik,et al.  Distributional Robustness Loss for Long-tail Learning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[43]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Zehuan Yuan,et al.  Exploring Balanced Feature Spaces for Representation Learning , 2021, ICLR.

[45]  Chen Huang,et al.  Learning Deep Representation for Imbalanced Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yang Song,et al.  Class-Balanced Loss Based on Effective Number of Samples , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[48]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[50]  Junjie Yan,et al.  Equalization Loss for Long-Tailed Object Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Martial Hebert,et al.  Learning to Model the Tail , 2017, NIPS.

[52]  Tianhao Li,et al.  Self Supervision to Distillation for Long-Tailed Visual Recognition , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[53]  Yejin Choi,et al.  VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Stella X. Yu,et al.  Large-Scale Long-Tailed Recognition in an Open World , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Bryan Hooi,et al.  Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision , 2021, ArXiv.

[57]  Qingming Huang,et al.  Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks , 2015, ECCV.

[58]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[59]  Haibin Ling,et al.  Feature Space Augmentation for Long-Tailed Data , 2020, ECCV.