S2-MLP: Spatial-Shift MLP Architecture for Vision

Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S-MLP). Different from MLP-Mixer, our S-MLP only contains channel-mixing MLP. We utilize a spatial-shift operation for communications between patches. It has a local reception field and is spatialagnostic. It is parameter-free and efficient for computation. The proposed S-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Matthieu Cord,et al.  Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Luke Melas-Kyriazi,et al.  Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet , 2021, ArXiv.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Marcel Worring,et al.  4-Connected Shift Residual Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[7]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[9]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[10]  Chunhua Shen,et al.  Twins: Revisiting the Design of Spatial Attention in Vision Transformers , 2021, NeurIPS.

[11]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[12]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[13]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[14]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[15]  Quoc V. Le,et al.  Pay Attention to MLPs , 2021, NeurIPS.

[16]  Long Zhao,et al.  Aggregating Nested Transformers , 2021, ArXiv.

[17]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[18]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[20]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Lukasz Kaiser,et al.  Depthwise Separable Convolutions for Neural Machine Translation , 2017, ICLR.

[23]  Gregory Shakhnarovich,et al.  FractalNet: Ultra-Deep Neural Networks without Residuals , 2016, ICLR.

[24]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[25]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Kurt Keutzer,et al.  Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[31]  Torsten Hoefler,et al.  Augment Your Batch: Improving Generalization Through Instance Repetition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Roozbeh Mottaghi,et al.  Container: Context Aggregation Network , 2021, NeurIPS.

[33]  Seong Joon Oh,et al.  Rethinking Spatial Dimensions of Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[34]  Shi-Min Hu,et al.  Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Lu Yuan,et al.  Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, ArXiv.

[36]  Matthieu Cord,et al.  ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.