Hire-MLP: Vision MLP via Hierarchical Rearrangement

Previous vision MLPs such as MLP-Mixer and ResMLP accept linearly flattened image patches as input, making them inflexible for different input sizes and hard to capture spatial information. Such approach withholds MLPs from getting comparable performance with their transformer-based counterparts and prevents them from becoming a general backbone for computer vision. This paper presents Hire-MLP, a simple yet competitive vision MLP architecture via Hierarchical rearrangement, which contains two levels of rearrangements. Specifically, the innerregion rearrangement is proposed to capture local information inside a spatial region, and the cross-region rearrangement is proposed to enable information communication between different regions and capture global context by circularly shifting all tokens along spatial directions. Extensive experiments demonstrate the effectiveness of Hire-MLP as a versatile backbone for various vision tasks. In particular, Hire-MLP achieves competitive results on image classification, object detection and semantic segmentation tasks, e.g., 83.8% top-1 accuracy on ImageNet, 51.7% box AP and 44.8% mask AP on COCO val2017, and 49.9% mIoU on ADE20K, surpassing previous transformer-based and MLP-based models with better trade-off for accuracy and throughput. Code is available at https://github.com/ggjy/Hire-Wave-MLP.pytorch.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[5]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[6]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[8]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[9]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Shuicheng Yan,et al.  Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Nuno Vasconcelos,et al.  Cascade R-CNN: Delving Into High Quality Object Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  D. Tao,et al.  A Survey on Visual Transformer , 2020, ArXiv.

[14]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[15]  Enhua Wu,et al.  Transformer in Transformer , 2021, NeurIPS.

[16]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[17]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[18]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[20]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[21]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[23]  Quoc V. Le,et al.  Pay Attention to MLPs , 2021, NeurIPS.

[24]  Xing Sun,et al.  AS-MLP: An Axial Shifted MLP Architecture for Vision , 2021, ArXiv.

[25]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[27]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[28]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[29]  Bolei Zhou,et al.  Semantic Understanding of Scenes Through the ADE20K Dataset , 2016, International Journal of Computer Vision.

[30]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[31]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[32]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[33]  Chunhua Shen,et al.  Twins: Revisiting Spatial Attention Design in Vision Transformers , 2021, ArXiv.

[34]  Yunfeng Cai,et al.  S2-MLP: Spatial-Shift MLP Architecture for Vision , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[35]  George Papandreou,et al.  Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation , 2018, ECCV.

[36]  Kaitao Song,et al.  PVTv2: Improved Baselines with Pyramid Vision Transformer , 2021, ArXiv.

[37]  Jiwen Lu,et al.  Global Filter Networks for Image Classification , 2021, NeurIPS.

[38]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Ding Liang,et al.  CycleMLP: A MLP-like Architecture for Dense Prediction , 2021, ArXiv.

[40]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[41]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[43]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[45]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Matthieu Cord,et al.  ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[48]  Xinggang Wang,et al.  What Makes for Hierarchical Vision Transformer? , 2021, ArXiv.

[49]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[50]  Yunfeng Cai,et al.  S2-MLPv2: Improved Spatial-Shift MLP Architecture for Vision , 2021, ArXiv.

[51]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[52]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[53]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[54]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[55]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[56]  Fengwei Yu,et al.  Incorporating Convolution Designs into Visual Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).