CycleMLP: A MLP-like Architecture for Dense Prediction

This paper presents a simple MLP-like architecture, CycleMLP, which is a versatile backbone for visual recognition and dense predictions, unlike modern MLP architectures, e.g., MLP-Mixer [49], ResMLP [50], and gMLP [35], whose architectures are correlated to image size and thus are infeasible in object detection and segmentation. CycleMLP has two advantages compared to modern approaches. (1) It can cope with various image sizes. (2) It achieves linear computational complexity to image size by using local windows. In contrast, previous MLPs have quadratic computations because of their fully spatial connections. We build a family of models that surpass existing MLPs and achieve a comparable accuracy (83.2%) on ImageNet-1K classification compared to the state-of-the-art Transformer such as Swin Transformer [36] (83.3%), but using fewer parameters and FLOPs. We expand the MLPlike models’ applicability, making them a versatile backbone for dense prediction tasks. CycleMLP aims to provide a competitive baseline on object detection, instance segmentation, and semantic segmentation for MLP models. In particular, CycleMLP achieves 45.1 mIoU on ADE20K val, comparable to Swin (45.2 mIOU). Code is available at https://github.com/ShoufaChen/CycleMLP.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Xiaojun Chang,et al.  BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Kaiming He,et al.  Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Xilin Chen,et al.  Object-Contextual Representations for Semantic Segmentation , 2019, ECCV.

[8]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[9]  Yunchao Wei,et al.  CCNet: Criss-Cross Attention for Semantic Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[11]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Shuicheng Yan,et al.  Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[15]  Kai Chen,et al.  MMDetection: Open MMLab Detection Toolbox and Benchmark , 2019, ArXiv.

[16]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[17]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[18]  Yuning Jiang,et al.  Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[19]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[20]  Kaiming He,et al.  Panoptic Feature Pyramid Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Jun Fu,et al.  Adaptive Context Network for Scene Parsing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Shuicheng Yan,et al.  Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet , 2021, ArXiv.

[24]  Bolei Zhou,et al.  Scene Parsing through ADE20K Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[26]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[27]  Sebastian Ramos,et al.  The Cityscapes Dataset for Semantic Urban Scene Understanding , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[29]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  Yunfeng Cai,et al.  S2-MLP: Spatial-Shift MLP Architecture for Vision , 2021, 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).

[31]  Gedas Bertasius,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[32]  Kaitao Song,et al.  PVTv2: Improved Baselines with Pyramid Vision Transformer , 2021, ArXiv.

[33]  Jiwen Lu,et al.  Global Filter Networks for Image Classification , 2021, NeurIPS.

[34]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[35]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[37]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[38]  Chen Sun,et al.  Multi-modal Transformer for Video Retrieval , 2020, ECCV.

[39]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[40]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[41]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[42]  Quoc V. Le,et al.  Pay Attention to MLPs , 2021, NeurIPS.

[43]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[45]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[46]  Xing Sun,et al.  AS-MLP: An Axial Shifted MLP Architecture for Vision , 2021, ArXiv.

[47]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[48]  Ling Shao,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, ArXiv.

[49]  Pieter Abbeel,et al.  Bottleneck Transformers for Visual Recognition , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Tao Xiang,et al.  Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Matthieu Cord,et al.  ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[53]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[54]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[55]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[56]  Martin Jaggi,et al.  On the Relationship between Self-Attention and Convolutional Layers , 2019, ICLR.

[57]  Aleksandr Petiushko,et al.  MDMMT: Multidomain Multimodal Transformer for Video Retrieval , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[58]  Matthieu Cord,et al.  Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Shi-Min Hu,et al.  Beyond Self-Attention: External Attention Using Two Linear Layers for Visual Tasks , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[61]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Chunhua Shen,et al.  End-to-End Video Instance Segmentation with Transformers , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[64]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[65]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[66]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[67]  Zheng Zhang,et al.  Disentangled Non-Local Neural Networks , 2020, ECCV.

[68]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[69]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[70]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[71]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[72]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[73]  Yi Jiang,et al.  Sparse R-CNN: End-to-End Object Detection with Learnable Proposals , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[74]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[75]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[76]  Yuan He,et al.  Rethinking the Design Principles of Robust Vision Transformer , 2021 .

[77]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[78]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).