论文信息 - ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training

We present ResMLP, an architecture built entirely upon multi-layer perceptrons for image classification. It is a simple residual network that alternates (i) a linear layer in which image patches interact, independently and identically across channels, and (ii) a two-layer feed-forward network in which channels interact independently per patch. When trained with a modern training strategy using heavy data-augmentation and optionally distillation, it attains surprisingly good accuracy/complexity trade-offs on ImageNet. We will share our code based on the Timm library and pre-trained models.

[1] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[2] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[3] Alexander Novikov,et al. Tensorizing Neural Networks , 2015, NIPS.

[4] Jorge Nocedal,et al. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[5] Torsten Hoefler,et al. Augment Your Batch: Improving Generalization Through Instance Repetition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Hongyi Zhang,et al. mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[7] Lexing Ying,et al. SwitchNet: a neural network model for forward and inverse scattering problems , 2018, SIAM J. Sci. Comput..

[8] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[9] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[10] Edouard Grave,et al. Training with Quantization Noise for Extreme Model Compression , 2020, ICLR.

[11] Clément Chatelain,et al. Extraction de séquences numériques dans des documents manuscrits quelconques , 2006 .

[12] Giorgos Tolias,et al. Fine-Tuning CNN Image Retrieval with No Human Annotation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13] Andrew Zisserman,et al. Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[14] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[15] Matthieu Cord,et al. Training data-efficient image transformers & distillation through attention , 2020, ICML.

[16] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[17] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[20] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[21] Luke Melas-Kyriazi,et al. Do You Even Need Attention? A Stack of Feed-Forward Layers Does Surprisingly Well on ImageNet , 2021, ArXiv.

[22] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23] Ross B. Girshick,et al. Fast and Accurate Model Scaling , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24] Peter Stone,et al. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science , 2017, Nature Communications.

[25] Sivaraman Balakrishnan,et al. How Many Samples are Needed to Estimate a Convolutional Neural Network? , 2018, NeurIPS.

[26] Patrice Y. Simard,et al. Best practices for convolutional neural networks applied to visual document analysis , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[27] Gabriel Synnaeve,et al. Differentiable Model Compression via Pseudo Quantization Noise , 2021, ArXiv.

[28] Jian Sun,et al. Identity Mappings in Deep Residual Networks , 2016, ECCV.

[29] Behnam Neyshabur,et al. Towards Learning Convolutions from Scratch , 2020, NeurIPS.

[30] Guiguang Ding,et al. RepMLP: Re-parameterizing Convolutions into Fully-connected Layers for Image Recognition , 2021, ArXiv.

[31] Kaiming He,et al. Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[32] Alexander Kolesnikov,et al. MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[33] K. Simonyan,et al. High-Performance Large-Scale Image Recognition Without Normalization , 2021, ICML.

[34] Thomas Brox,et al. Discriminative Unsupervised Feature Learning with Convolutional Neural Networks , 2014, NIPS.

[35] Quoc V. Le,et al. Pay Attention to MLPs , 2021, NeurIPS.

[36] Ekin D. Cubuk,et al. Revisiting ResNets: Improved Training and Scaling Strategies , 2021, NeurIPS.

[37] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[38] Joan Bruna,et al. Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias , 2019, NeurIPS.

[39] Zhuowen Tu,et al. Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40] Jaehoon Lee,et al. Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[41] Julien Mairal,et al. Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[42] Yi Yang,et al. Random Erasing Data Augmentation , 2017, AAAI.

[43] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[44] Benjamin Recht,et al. Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[45] Matthieu Cord,et al. Grafit: Learning fine-grained image representations with coarse labels , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[46] Yang Song,et al. The iNaturalist Species Classification and Detection Dataset , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[48] Roland Memisevic,et al. How far can we go without convolution: Improving fully-connected networks , 2015, ArXiv.

[49] Seong Joon Oh,et al. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50] Dustin Tran,et al. Image Transformer , 2018, ICML.

[51] François Chollet,et al. Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52] Théodore Bluche,et al. Deep Neural Networks for Large Vocabulary Handwritten Text Recognition , 2015 .

[53] Luca Maria Gambardella,et al. Deep Big Multilayer Perceptrons for Digit Recognition , 2012, Neural Networks: Tricks of the Trade.

[54] Xiaohua Zhai,et al. Are we done with ImageNet? , 2020, ArXiv.

[55] Matthieu Cord,et al. Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[56] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[57] Quoc V. Le,et al. Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[58] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[59] Quoc V. Le,et al. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[60] Kilian Q. Weinberger,et al. Deep Networks with Stochastic Depth , 2016, ECCV.

[61] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[62] Jonathan Krause,et al. 3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[63] Matthew Richardson,et al. Do Deep Convolutional Nets Really Need to be Deep and Convolutional? , 2016, ICLR.

[64] James Demmel,et al. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes , 2019, ICLR.

[65] Vladlen Koltun,et al. Exploring Self-Attention for Image Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[67] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.