Visual Parser: Representing Part-whole Hierarchies with Transformers

Human vision is able to capture the part-whole hierarchical information from the entire scene. This paper presents the Visual Parser (ViP) that explicitly constructs such a hierarchy with transformers. ViP divides visual representations into two levels, the part level and the whole level. Information of each part represents a combination of several independent vectors within the whole. To model the representations of the two levels, we first encode the information from the whole into part vectors through an attention mechanism, then decode the global information within the part vectors back into the whole representation. By iteratively parsing the two levels with the proposed encoder-decoder interaction, the model can gradually refine the features on both levels. Experimental results demonstrate that ViP can achieve very competitive performance on three major tasks e.g. classification, detection and instance segmentation. In particular, it can surpass the previous stateof-the-art CNN backbones by a large margin on object detection. The tiny model of the ViP family with 7.2× fewer parameters and 10.9× fewer FLOPS can perform comparably with the largest model ResNeXt-101-64×4d of ResNe(X)t family. Visualization results also demonstrate that the learnt parts are highly informative of the predicting class, making ViP more explainable than previous fundamental architectures. Code is available at https://github.com/kevin-ssy/ViP.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yichen Wei,et al.  Relation Networks for Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[5]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[6]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[7]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[9]  Ross B. Girshick,et al.  Fast R-CNN , 2015, 1504.08083.

[10]  Bo Zhang,et al.  Do We Really Need Explicit Position Encodings for Vision Transformers? , 2021, ArXiv.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[13]  Kurt Keutzer,et al.  Visual Transformers: Token-based Image Representation and Processing for Computer Vision , 2020, ArXiv.

[14]  Irwan Bello LambdaNetworks: Modeling Long-Range Interactions Without Attention , 2021, ICLR.

[15]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[16]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[17]  Shuai Yi,et al.  FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction , 2019, NeurIPS.

[18]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[19]  Levent Sagun,et al.  ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases , 2021, ICML.

[20]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[22]  Geoffrey E. Hinton,et al.  How to Represent Part-Whole Hierarchies in a Neural Network , 2021, Neural Computation.

[23]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[24]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[25]  K. Simonyan,et al.  High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[26]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[27]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[29]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[30]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[31]  Xiaogang Wang,et al.  Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[33]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[34]  Nitish Srivastava,et al.  Capsules with Inverted Dot-Product Attention Routing , 2020, ICLR.

[35]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[36]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[37]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Geoffrey E. Hinton,et al.  Matrix capsules with EM routing , 2018, ICLR.

[39]  Shuicheng Yan,et al.  A2-Nets: Double Attention Networks , 2018, NeurIPS.

[40]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[41]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[42]  Yee Whye Teh,et al.  Stacked Capsule Autoencoders , 2019, NeurIPS.

[43]  Roberto Cipolla,et al.  SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[45]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[49]  Luc Van Gool,et al.  Dynamic Filter Networks , 2016, NIPS.

[50]  Geoffrey E. Hinton Some Demonstrations of the Effects of Structural Descriptions in Mental Imagery , 1979, Cogn. Sci..

[51]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[53]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[54]  Jonathon Shlens,et al.  Scaling Local Self-Attention for Parameter Efficient Visual Backbones , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[56]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[57]  Georg Heigold,et al.  Object-Centric Learning with Slot Attention , 2020, NeurIPS.

[58]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[59]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[61]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[62]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[63]  D. Kahneman,et al.  The reviewing of object files: Object-specific integration of information , 1992, Cognitive Psychology.

[64]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[65]  Shuicheng Yan,et al.  Graph-Based Global Reasoning Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Zhuowen Tu,et al.  Image Parsing: Unifying Segmentation, Detection, and Recognition , 2005, International Journal of Computer Vision.

[67]  Vladlen Koltun,et al.  Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[70]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[71]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[72]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[73]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).