论文信息 - Deep ViT Features as Dense Visual Descriptors

Deep ViT Features as Dense Visual Descriptors

We study the use of deep features extracted from a pretrained Vision Transformer (ViT) as dense visual descriptors. We observe and empirically demonstrate that such features, when extractedfrom a self-supervised ViT model (DINO-ViT), exhibit several striking properties, including: (i) the features encode powerful, well-localized semantic information, at high spatial granularity, such as object parts; (ii) the encoded semantic information is shared across related, yet different object categories, and (iii) positional bias changes gradually throughout the layers. These properties allow us to design simple methods for a variety of applications, including co-segmentation, part co-segmentation and semantic correspondences. To distill the power of ViT features from convoluted design choices, we restrict ourselves to lightweight zero-shot methodologies (e.g., binning and clustering) applied directly to the features. Since our methods require no additional training nor data, they are readily applicable across a variety of domains. We show by extensive qualitative and quantitative evaluation that our simple methodologies achieve competitive results with recent state-of-the-art supervised methods, and outperform previous unsupervised methods by a large margin. Code is available in dino-vit-features.github.io.

[1] W. Freeman,et al. Unsupervised Semantic Segmentation by Distilling Feature Correspondences , 2022, ICLR.

[2] D. Vaufreydaz,et al. Self-Supervised Transformers for Unsupervised Object Discovery using Normalized Cut , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3] A. Vedaldi,et al. Generalized Category Discovery , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] A. Vedaldi,et al. Unsupervised Part Discovery from Contrastive Reconstruction , 2021, NeurIPS.

[5] Jean Ponce,et al. Localizing Objects with Self-Supervised Transformers and no Labels , 2021, BMVC.

[6] A. Dosovitskiy,et al. Do Vision Transformers See Like Convolutional Neural Networks? , 2021, NeurIPS.

[7] Xiao Yang,et al. Unsupervised Part Segmentation through Disentangling Appearance and Shape , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Fahad Shahbaz Khan,et al. Intriguing Properties of Vision Transformers , 2021, NeurIPS.

[9] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10] Hujun Bao,et al. LoFTR: Detector-Free Local Feature Matching with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Andrea Tagliasacchi,et al. COTR: Correspondence Transformer for Matching Across Images , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[12] Guosheng Lin,et al. CycleSegNet: Object Co-Segmentation With Cycle Refinement and Region Correspondence , 2021, IEEE Transactions on Image Processing.

[13] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[14] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.

[15] Yin Li,et al. Interpretable and Accurate Fine-grained Recognition via Region Grouping , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] William T. Freeman,et al. Semantic Pyramid for Image Generation , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Jung-Woo Ha,et al. StarGAN v2: Diverse Image Synthesis for Multiple Domains , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Bo Liu,et al. Deep Object Co-segmentation via Spatial-Semantic Network Modulation , 2019, AAAI.

[19] Martin Jaggi,et al. On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[20] Bo Li. Group-Wise Deep Object Co-Segmentation With Co-Attention Recurrent Neural Network , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21] Jean Ponce,et al. SPair-71k: A Large-scale Benchmark for Semantic Correspondence , 2019, ArXiv.

[22] Blaž Zupan,et al. openTSNE: a modular Python library for t-SNE dimensionality reduction and embedding , 2019, bioRxiv.

[23] Yung-Yu Chuang,et al. DeepCO3: Deep Instance Co-Segmentation by Co-Peak Search and Co-Saliency Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Pavlo Molchanov,et al. SCOPS: Self-Supervised Co-Part Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Richard Zhang,et al. Making Convolutional Networks Shift-Invariant Again , 2019, ICML.

[26] C. Olah,et al. Activation Atlas , 2019, Distill.

[27] Matthias Bethge,et al. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[28] Yung-Yu Chuang,et al. Co-attention CNNs for Unsupervised Object Co-segmentation , 2018, IJCAI.

[29] Sabine Süsstrunk,et al. Deep Feature Factorization For Concept Discovery , 2018, ECCV.

[30] Ankush Gupta,et al. Unsupervised Learning of Object Landmarks through Conditional Image Generation , 2018, NeurIPS.

[31] Carsten Rother,et al. Deep Object Co-Segmentation , 2018, ACCV.

[32] Yuting Zhang,et al. Unsupervised Discovery of Object Landmarks as Structural Representations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33] Lihi Zelnik-Manor,et al. The Contextual Loss for Image Transformation with Non-Aligned Data , 2018, ECCV.

[34] Alexei A. Efros,et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[35] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36] Andrea Vedaldi,et al. Unsupervised Learning of Object Landmarks by Factorized Spatial Embeddings , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[37] Jeff Johnson,et al. Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[38] Raquel Urtasun,et al. Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[39] Leon A. Gatys,et al. Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Li Fei-Fei,et al. Perceptual Losses for Real-Time Style Transfer and Super-Resolution , 2016, ECCV.

[42] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43] S. Avidan. Best-Buddies Similarity for robust template matching , 2015, CVPR.

[44] Xiaogang Wang,et al. Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[45] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[46] Luc Van Gool,et al. The Pascal Visual Object Classes Challenge: A Retrospective , 2014, International Journal of Computer Vision.

[47] Sanja Fidler,et al. Detect What You Can: Detecting and Representing Objects Using Holistic Models and Body Parts , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[48] Michal Irani,et al. Co-segmentation by Composition , 2013, 2013 IEEE International Conference on Computer Vision.

[49] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Ce Liu,et al. Unsupervised Joint Object Discovery and Segmentation in Internet Images , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[51] Nikos Paragios,et al. Unsupervised co-segmentation through region matching , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[52] Vladlen Koltun,et al. Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials , 2011, NIPS.

[53] Pietro Perona,et al. Caltech-UCSD Birds 200 , 2010 .

[54] C. Rother,et al. TextonBoost: Joint Appearance, Shape and Context Modeling for Multi-class Object Recognition and Segmentation , 2006, ECCV.

[55] Andrew Blake,et al. "GrabCut" , 2004, ACM Trans. Graph..

[56] Michael I. Jordan,et al. On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[57] Deborah Silver,et al. Feature Visualization , 1994, Scientific Visualization.

[58] Seungryong Kim,et al. Semantic Correspondence with Transformers , 2021, ArXiv.

[59] Jitendra Malik,et al. Spectral grouping using the Nystrom method , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.