论文信息 - S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

Recently, MLP-based vision backbones emerge. MLP-based vision architectures with less inductive bias achieve competitive performance in image recognition compared with CNNs and vision Transformers. Among them, spatial-shift MLP (S-MLP), adopting the straightforward spatial-shift operation, achieves better performance than the pioneering works including MLP-mixer and ResMLP. More recently, using smaller patches with a pyramid structure, Vision Permutator (ViP) and Global Filter Network (GFNet) achieve better performance than S-MLP. In this paper, we improve the S-MLP vision backbone. We expand the feature map along the channel dimension and split the expanded feature map into several parts. We conduct different spatial-shift operations on split parts. Meanwhile, we exploit the split-attention operation to fuse these split parts. Moreover, like the counterparts, we adopt smaller-scale patches and use a pyramid structure for boosting the image recognition accuracy. We term the improved spatial-shift MLP vision backbone as S-MLPv2. Using 55M parameters, our medium-scale model, SMLPv2-Medium achieves an 83.6% top-1 accuracy on the ImageNet-1K benchmark using 224× 224 images without self-attention and external training data.

[1] Yunfeng Cai,et al. S2-MLP: Spatial-Shift MLP Architecture for Vision , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[2] Seong Joon Oh,et al. Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[4] Shi-Min Hu,et al. Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Shuicheng Yan,et al. Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6] Kaitao Song,et al. PVTv2: Improved Baselines with Pyramid Vision Transformer , 2021, ArXiv.

[7] Jiwen Lu,et al. Global Filter Networks for Image Classification , 2021, NeurIPS.

[8] Lu Yuan,et al. Focal Self-attention for Local-Global Interactions in Vision Transformers , 2021, ArXiv.

[9] Ding Liang,et al. CycleMLP: A MLP-like Architecture for Dense Prediction , 2021, ArXiv.

[10] Sergey Ioffe,et al. Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Nenghai Yu,et al. CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[13] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14] Yi Yang,et al. Random Erasing Data Augmentation , 2017, AAAI.

[15] Kaiming He,et al. Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Matthieu Cord,et al. Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[17] Shuicheng Yan,et al. Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[18] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Long Zhao,et al. Aggregating Nested Transformers , 2021, ArXiv.

[20] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[21] Quoc V. Le,et al. Pay Attention to MLPs , 2021, NeurIPS.

[22] Xing Sun,et al. AS-MLP: An Axial Shifted MLP Architecture for Vision , 2021, ArXiv.

[23] Zhe Gan,et al. Chasing Sparsity in Vision Transformers: An End-to-End Exploration , 2021, NeurIPS.

[24] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[25] Zilong Huang,et al. Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer , 2021, ArXiv.

[26] Alexander Kolesnikov,et al. MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[27] Ling Shao,et al. Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[28] Timo Aila,et al. Temporal Ensembling for Semi-Supervised Learning , 2016, ICLR.

[29] Enhua Wu,et al. Transformer in Transformer , 2021, NeurIPS.

[30] Yunfeng Cai,et al. Rethinking Token-Mixing MLP for MLP-based Vision Backbone , 2021, BMVC.

[31] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[32] Jiwen Lu,et al. DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[33] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[34] Chunhua Shen,et al. Twins: Revisiting the Design of Spatial Attention in Vision Transformers , 2021, NeurIPS.

[35] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[36] Christoph Feichtenhofer,et al. Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Chongruo Wu,et al. ResNeSt: Split-Attention Networks , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[39] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[40] Matthieu Cord,et al. ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.