Deep ViT Features as Dense Visual Descriptors

We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extractedfrom a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in dino-vit-features.github.io.

[1]  W. Freeman,et al.  Unsupervised Semantic Segmentation by Distilling Feature Correspondences , 2022, ICLR.

[2]  D. Vaufreydaz,et al.  Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  A. Vedaldi,et al.  Generalized Category Discovery , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  A. Vedaldi,et al.  Unsupervised Part Discovery from Contrastive Reconstruction , 2021, NeurIPS.

[5]  Jean Ponce,et al.  Localizing Objects with Self-Supervised Transformers and no Labels , 2021, BMVC.

[6]  A. Dosovitskiy,et al.  Do Vision Transformers See Like Convolutional Neural Networks? , 2021, NeurIPS.

[7]  Xiao Yang,et al.  Unsupervised Part Segmentation through Disentangling Appearance and Shape , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Fahad Shahbaz Khan,et al.  Intriguing Properties of Vision Transformers , 2021, NeurIPS.

[9]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Hujun Bao,et al.  LoFTR: Detector-Free Local Feature Matching with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Andrea Tagliasacchi,et al.  COTR: Correspondence Transformer for Matching Across Images , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12]  Guosheng Lin,et al.  CycleSegNet: Object Co-Segmentation With Cycle Refinement and Region Correspondence , 2021, IEEE Transactions on Image Processing.

[13]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[15]  Yin Li,et al.  Interpretable and Accurate Fine-grained Recognition via Region Grouping , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  William T. Freeman,et al.  Semantic Pyramid for Image Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jung-Woo Ha,et al.  StarGAN v2: Diverse Image Synthesis for Multiple Domains , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Bo Liu,et al.  Deep Object Co-segmentation via Spatial-Semantic Network Modulation , 2019, AAAI.

[19]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[20]  Bo Li Group-Wise Deep Object Co-Segmentation With Co-Attention Recurrent Neural Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Jean Ponce,et al.  SPair-71k: A Large-scale Benchmark for Semantic Correspondence , 2019, ArXiv.

[22]  Blaž Zupan,et al.  openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding , 2019, bioRxiv.

[23]  Yung-Yu Chuang,et al.  DeepCO3: Deep Instance Co-Segmentation by Co-Peak Search and Co-Saliency Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Pavlo Molchanov,et al.  SCOPS: Self-Supervised Co-Part Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Richard Zhang,et al.  Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[26]  C. Olah,et al.  Activation Atlas , 2019, Distill.

[27]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[28]  Yung-Yu Chuang,et al.  Co-attention CNNs for Unsupervised Object Co-segmentation , 2018, IJCAI.

[29]  Sabine Süsstrunk,et al.  Deep Feature Factorization For Concept Discovery , 2018, ECCV.

[30]  Ankush Gupta,et al.  Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[31]  Carsten Rother,et al.  Deep Object Co-Segmentation , 2018, ACCV.

[32]  Yuting Zhang,et al.  Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Lihi Zelnik-Manor,et al.  The Contextual Loss for Image Transformation with Non-Aligned Data , 2018, ECCV.

[34]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Andrea Vedaldi,et al.  Unsupervised Learning of Object Landmarks by Factorized Spatial Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[38]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[39]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Li Fei-Fei,et al.  Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  S. Avidan Best-Buddies Similarity for robust template matching , 2015, CVPR.

[44]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46]  Luc Van Gool,et al.  The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[47]  Sanja Fidler,et al.  Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Michal Irani,et al.  Co-segmentation by Composition , 2013, 2013 IEEE International Conference on Computer Vision.

[49]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Ce Liu,et al.  Unsupervised Joint Object Discovery and Segmentation in Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  Nikos Paragios,et al.  Unsupervised co-segmentation through region matching , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Vladlen Koltun,et al.  Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[53]  Pietro Perona,et al.  Caltech-UCSD Birds 200 , 2010 .

[54]  C. Rother,et al.  TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[55]  Andrew Blake,et al.  "GrabCut" , 2004, ACM Trans. Graph..

[56]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[57]  Deborah Silver,et al.  Feature Visualization , 1994, Scientific Visualization.

[58]  Seungryong Kim,et al.  Semantic Correspondence with Transformers , 2021, ArXiv.

[59]  Jitendra Malik,et al.  Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.