Multi-level Contrastive Learning for Self-Supervised Vision Transformers

Recent studies aim to establish contrastive self-supervised learning (CSL) algorithms specialized for the family of Vision Transformers (ViTs) to make them function normally as ordinary convolutional-based backbones in the training progress. Despite obtaining promising performance on related downstream tasks, one compelling property of the ViTs is ignored in those approaches. As previous studies have demonstrated, vision transformers benefit from the early stage global attention mechanics, obtaining feature representations that contain information from distant patches, even in their shallow layers. Motivated by this, we present a simple yet effective framework to facilitate the self-supervised feature learning of transformer based vision architectures, namely, Multi-level Contrastive learning for Vision Transformers (MCVT). Specifically, we equip the vision transformers with individual-based (InfoNCE) and prototypical-based (ProtoNCE) contrastive loss in different stages of the architecture to capture low-level invariance and high-level invariance between views of samples, respectively. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, using two well-known vision transformer backbones, on several vision downstream tasks, including linear classification, detection, and semantic segmentation.

[1]  Shentong Mo,et al.  Rethinking Prototypical Contrastive Learning through Alignment, Uniformity and Correlation , 2022, BMVC.

[2]  Shentong Mo,et al.  Siamese Prototypical Contrastive Learning , 2022, BMVC.

[3]  Rao Fu,et al.  HRFormer: High-Resolution Transformer for Dense Prediction , 2021, ArXiv.

[4]  Lu Yuan,et al.  Dynamic DETR: End-to-End Object Detection with Dynamic Attention , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  A. Dosovitskiy,et al.  Do Vision Transformers See Like Convolutional Neural Networks? , 2021, NeurIPS.

[6]  Matthijs Douze,et al.  XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[7]  Li Dong,et al.  BEiT: BERT Pre-Training of Image Transformers , 2021, ICLR.

[8]  Ming Tang,et al.  MST: Masked Self-Supervised Transformer for Visual Representation , 2021, NeurIPS.

[9]  Yutong Lin,et al.  Self-Supervised Learning with Swin Transformers , 2021, ArXiv.

[10]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Saining Xie,et al.  An Empirical Study of Training Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Yann LeCun,et al.  Barlow Twins: Self-Supervised Learning via Redundancy Reduction , 2021, ICML.

[13]  Karin Hansson,et al.  What an Image Is , 2021, Art Documentation: Journal of the Art Libraries Society of North America.

[14]  Francis E. H. Tay,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[16]  Xinlei Chen,et al.  Exploring Simple Siamese Representation Learning , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[18]  Stella X. Yu,et al.  Unsupervised Feature Learning by Cross-Level Instance-Group Discrimination , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Priya Goyal,et al.  Unsupervised Learning of Visual Features by Contrasting Cluster Assignments , 2020, NeurIPS.

[20]  Geoffrey E. Hinton,et al.  Big Self-Supervised Models are Strong Semi-Supervised Learners , 2020, NeurIPS.

[21]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[22]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[23]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ICLR.

[24]  Kaiming He,et al.  Improved Baselines with Momentum Contrastive Learning , 2020, ArXiv.

[25]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[26]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Phillip Isola,et al.  Contrastive Multiview Coding , 2019, ECCV.

[28]  Geoffrey E. Hinton,et al.  Similarity of Neural Network Representations Revisited , 2019, ICML.

[29]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[30]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[31]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[36]  Kihyuk Sohn,et al.  Improved Deep Metric Learning with Multi-class N-pair Loss Objective , 2016, NIPS.

[37]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[38]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[39]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).