MAXIM: Multi-Axis MLP for Image Processing

Recent progress on Transformers and multilayer perceptron (MLP) models provide new network architectural designs for computer vision tasks. Although these models proved to be effective in many vision tasks such as image recognition, there remain challenges in adapting them for lowlevel vision. The inflexibility to support high-resolution images and limitations of local attention are perhaps the main bottlenecks. In this work, we present a multi-axis MLP based architecture called MAXIM, that can serve as an efficient and flexible general-purpose vision backbone for image processing tasks. MAXIM uses a UNet-shaped hierarchical structure and supports long-range interactions enabled by spatially-gated MLPs. Specifically, MAXIM contains two MLP-based building blocks: a multi-axis gated MLP that allows for efficient and scalable spatial mixing of local and global visual cues, and a cross-gating block, an alternative to cross-attention, which accounts for cross-feature conditioning. Both these modules are exclusively based on MLPs, but also benefit from being both global and ‘fully-convolutional’, two properties that are desirable for image processing. Our extensive experimental results show that the proposed MAXIM model achieves state-of-the-art performance on more than ten benchmarks across a range of image processing tasks, including denoising, deblurring, de raining, dehazing, and enhancement while requiring fewer or comparable numbers of parameters and FLOPs than competitive models. The source code and trained models will be available at https://github.com/google-research/maxim.

[1]  Syed Waqas Zamir,et al.  Learning Enriched Features for Real Image Restoration and Enhancement , 2020, ECCV.

[2]  Ming-Hsuan Yang,et al.  V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer , 2022, ECCV.

[3]  Ming-Hsuan Yang,et al.  Toward Real-World Super-Resolution via Adaptive Downsampling Models , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jianmin Bao,et al.  Uformer: A General U-Shaped Transformer for Image Restoration , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Joshua Ainslie,et al.  FNet: Mixing Tokens with Fourier Transforms , 2021, NAACL.

[6]  Matthieu Cord,et al.  ResMLP: Feedforward Networks for Image Classification With Data-Efficient Training , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  King-Sun Fu,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Publication Information , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Luc Van Gool,et al.  SwinIR: Image Restoration Using Swin Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[9]  Sung-Jea Ko,et al.  Rethinking Coarse-to-Fine Approach in Single Image Deblurring , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Ying Shan,et al.  Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data , 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW).

[11]  Jiwen Lu,et al.  Global Filter Networks for Image Classification , 2021, NeurIPS.

[12]  Dimitris N. Metaxas,et al.  Improved Transformer for High-Resolution GANs , 2021, NeurIPS.

[13]  Luc Van Gool,et al.  Video Super-Resolution Transformer , 2021, ArXiv.

[14]  Anima Anandkumar,et al.  SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers , 2021, NeurIPS.

[15]  Quoc V. Le,et al.  Pay Attention to MLPs , 2021, NeurIPS.

[16]  J. Zhang,et al.  HINet: Half Instance Normalization Network for Image Restoration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[17]  A. Dosovitskiy,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[18]  Radu Timofte,et al.  NTIRE 2021 Challenge on Image Deblurring , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[19]  Lizhuang Ma,et al.  Contrastive Learning for Compact Single Image Dehazing , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Quanfu Fan,et al.  CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Jonathon Shlens,et al.  Scaling Local Self-Attention for Parameter Efficient Visual Backbones , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Luc Soler,et al.  U-Net Transformer: Self and Cross Attention for Medical Image Segmentation , 2021, MLMI@MICCAI.

[24]  Ling Shao,et al.  Multi-Stage Progressive Image Restoration , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Chen Change Loy,et al.  Focal Frequency Loss for Image Reconstruction and Synthesis , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  P. Milanfar,et al.  Projected Distribution Loss for Image Enhancement , 2020, 2021 IEEE International Conference on Computational Photography (ICCP).

[27]  Wen Gao,et al.  Pre-Trained Image Processing Transformer , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[29]  A. Bovik,et al.  ProxIQA: A Proxy Approach to Perceptual Optimization of Learned Image Compression , 2019, IEEE Transactions on Image Processing.

[30]  Ding Liu,et al.  EnlightenGAN: Deep Light Enhancement Without Paired Supervision , 2019, IEEE Transactions on Image Processing.

[31]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Shiyu Chang,et al.  TransGAN: Two Transformers Can Make One Strong GAN , 2021, ArXiv.

[33]  P. Luo,et al.  TransTrack: Multiple-Object Tracking with Transformer , 2020, ArXiv.

[34]  Sam Kwong,et al.  Towards Unsupervised Deep Image Enhancement With Generative Adversarial Network , 2020, IEEE Transactions on Image Processing.

[35]  Zibo Meng,et al.  GIA-Net: Global Information Aware Network for Low-light Imaging , 2020, ECCV Workshops.

[36]  Xiaochun Cao,et al.  Correction to: Single Image Super-Resolution via a Holistic Attention Network , 2020, ECCV.

[37]  Sunghyun Cho,et al.  Real-World Blur Dataset for Learning and Benchmarking Deblurring Algorithms , 2020, ECCV.

[38]  Deyu Meng,et al.  Dual Adversarial Network: Toward Real-world Noise Removal and Noise Generation , 2020, ECCV.

[39]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[40]  Lei Xiang,et al.  Multi-Scale Boosted Dehazing Network With Dense Feature Fusion , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  A. N. Rajagopalan,et al.  Spatially-Attentive Patch-Hierarchical Network for Adaptive Motion Deblurring , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  B. Stenger,et al.  Deblurring by Realistic Blurring , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Chen Chen,et al.  Multi-Scale Progressive Fusion Network for Single Image Deraining , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Yoonsik Kim,et al.  Transfer Learning From Synthetic to Real-Noise Denoising With Adaptive Instance Normalization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Zhihai Xu,et al.  Spatial-Adaptive Network for Single Image Denoising , 2020, ECCV.

[46]  Xiaodong Xie,et al.  FFA-Net: Feature Fusion Attention Network for Single Image Dehazing , 2019, AAAI.

[47]  Se Young Chun,et al.  Multi-Temporal Recurrent Neural Networks For Progressive Non-Uniform Single Image Deblurring With Incremental Temporal Training , 2019, ECCV.

[48]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Vishal M. Patel,et al.  Image De-Raining Using a Conditional Generative Adversarial Network , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[50]  Ling Shao,et al.  Human-Aware Motion Deblurring , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Yixin Chen,et al.  Deep Learning for Seeing Through Window With Raindrops , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[52]  Deyu Meng,et al.  Variational Denoising Network: Toward Blind Noise Modeling and Removal , 2019, NeurIPS.

[53]  Zhangyang Wang,et al.  DeblurGAN-v2: Deblurring (Orders-of-Magnitude) Faster and Better , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[54]  Jun Chen,et al.  GridDehazeNet: Attention-Based Multi-Scale Network for Image Dehazing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[55]  Xiaoyong Shen,et al.  Dynamic Scene Deblurring With Parameter Selective Sharing and Nested Skip Connections , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Vishal M. Patel,et al.  Uncertainty Guided Multi-Scale Residual Learning-Using a Cycle Spinning CNN for Single Image De-Raining , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Xiaojie Guo,et al.  Kindling the Darkness: A Practical Low-light Image Enhancer , 2019, ACM Multimedia.

[58]  Nick Barnes,et al.  Real Image Denoising With Feature Attention , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[59]  Keyan Wang,et al.  Single Image Dehazing with a Generic Model-Agnostic Convolutional Neural Network , 2019, IEEE Signal Processing Letters.

[60]  Hongdong Li,et al.  Deep Stacked Hierarchical Multi-Patch Network for Image Deblurring , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Rynson W. H. Lau,et al.  Spatial Attentive Single-Image Deraining With a High Quality Real Rain Dataset , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Masanori Suganuma,et al.  Dual Residual Networks Leveraging the Potential of Paired Operations for Image Restoration , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[63]  Yazan Abu Farha,et al.  MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Dong Liu,et al.  Deep High-Resolution Representation Learning for Human Pose Estimation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Qinghua Hu,et al.  Progressive Image Deraining Networks: A Better and Simpler Baseline , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Gang Yu,et al.  Rethinking on Multi-Stage Networks for Human Pose Estimation , 2019, ArXiv.

[67]  Gang Hua,et al.  Gated Context Aggregation Network for Image Dehazing and Deraining , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[68]  Ying Wu,et al.  Semi-Supervised Transfer Learning for Image Rain Removal , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Wangmeng Zuo,et al.  Toward Convolutional Blind Denoising of Real Photographs , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Dan Feng,et al.  Benchmarking Single-Image Dehazing and Beyond , 2017, IEEE Transactions on Image Processing.

[71]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[72]  Yu Qiao,et al.  ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks , 2018, ECCV Workshops.

[73]  Chen Wei,et al.  Deep Retinex Decomposition for Low-Light Enhancement , 2018, BMVC.

[74]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[75]  Hongbin Zha,et al.  Recurrent Squeeze-and-Excitation Context Aggregation Net for Single Image Deraining , 2018, ECCV.

[76]  Thomas S. Huang,et al.  Non-Local Recurrent Network for Image Restoration , 2018, NeurIPS.

[77]  Stephen Lin,et al.  A High-Quality Denoising Dataset for Smartphone Cameras , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[78]  Rynson W. H. Lau,et al.  Dynamic Scene Deblurring Using Spatially Variant Recurrent Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[79]  Yung-Yu Chuang,et al.  Deep Photo Enhancer: Unpaired Learning for Image Enhancement from Photographs with GANs , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[80]  Chen Wei,et al.  GLADNet: Low-Light Enhancement Network with Global Awareness , 2018, 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018).

[81]  Loïc Le Folgoc,et al.  Attention U-Net: Learning Where to Look for the Pancreas , 2018, ArXiv.

[82]  Wei Liu,et al.  Gated Fusion Network for Single Image Dehazing , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[83]  Yun Fu,et al.  Residual Dense Network for Image Super-Resolution , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[84]  Vishal M. Patel,et al.  Density-Aware Single Image De-raining Using a Multi-stream Dense Network , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[85]  Yi Wang,et al.  Scale-Recurrent Network for Deep Image Deblurring , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[86]  Wenhan Yang,et al.  Attentive Generative Adversarial Network for Raindrop Removal from A Single Image , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[87]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[88]  Gang Yu,et al.  Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[89]  Jiri Matas,et al.  DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[90]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[91]  Hao He,et al.  Exposure , 2017, ACM Trans. Graph..

[92]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[93]  Harshad Rai,et al.  Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks , 2018 .

[94]  Jizheng Xu,et al.  AOD-Net: All-in-One Dehazing Network , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[95]  Stefan Roth,et al.  Benchmarking Denoising Algorithms with Real Photographs , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[96]  Delu Zeng,et al.  Removing Rain from Single Images via a Deep Detail Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[98]  Narendra Ahuja,et al.  Deep Laplacian Pyramid Networks for Fast and Accurate Super-Resolution , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[100]  Tae Hyun Kim,et al.  Deep Multi-scale Convolutional Neural Network for Dynamic Scene Deblurring , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[101]  Shuicheng Yan,et al.  Deep Joint Rain Detection and Removal from a Single Image , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[102]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[103]  Xinghao Ding,et al.  Clearing the Skies: A Deep Network Architecture for Single-Image Rain Removal , 2016, IEEE Transactions on Image Processing.

[104]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[105]  Lei Zhang,et al.  Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising , 2016, IEEE Transactions on Image Processing.

[106]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[107]  Michael S. Brown,et al.  Rain Streak Removal Using Layer Priors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[108]  Deqing Sun,et al.  Blind Image Deblurring Using Dark Channel Prior , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[109]  Dacheng Tao,et al.  DehazeNet: An End-to-End System for Single Image Haze Removal , 2016, IEEE Transactions on Image Processing.

[110]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[111]  Ming-Hsuan Yang,et al.  Deblurring Low-Light Images with Light Streaks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[112]  Li Xu,et al.  Unnatural L0 Sparse Representation for Natural Image Deblurring , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[113]  Xiang Zhu,et al.  Deconvolving PSFs for a Better Motion Deblurring Using Multiple Images , 2012, ECCV.

[114]  Stefan Harmeling,et al.  Image denoising: Can plain neural networks compete with BM3D? , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[115]  Peyman Milanfar,et al.  Removal of haze and noise from a single image , 2012, Electronic Imaging.

[116]  Sylvain Paris,et al.  Learning photographic global tonal adjustment with a database of input/output image pairs , 2011, CVPR 2011.

[117]  Alessandro Foi,et al.  Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering , 2007, IEEE Transactions on Image Processing.

[118]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.