Linear Complexity Self-Attention with 3 rd Order Polynomials

—Self-attention mechanisms and non-local blocks have become crucial building blocks for state-of-the-art neural architectures thanks to their unparalleled ability in capturing long-range dependencies in the input. However their cost is quadratic with the number of spatial positions hence making their use impractical in many real case applications. In this work, we analyze these methods through a polynomial lens, and we show that self-attention can be seen as a special case of a 3rd order polynomial. Within this polynomial framework, we are able to design polynomial operators capable of accessing the same data pattern of non-local and self-attention blocks while reducing the complexity from quadratic to linear. As a result, we propose two modules (Poly-NL and Poly-SA) that can be used as ”drop-in” replacements for more-complex non-local and self-attention layers in state-of-the-art CNNs and ViT architectures. Our modules can achieve comparable, if not better, performance across a wide range of computer vision tasks while keeping a complexity equivalent to a standard linear layer.

[1]  Grigorios G. Chrysos,et al.  Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a Polynomial Net Study , 2022, NeurIPS.

[2]  D. Mahajan,et al.  Scalable Interpretability via Polynomials , 2022, NeurIPS.

[3]  Grigorios G. Chrysos,et al.  The Spectral Bias of Polynomial Neural Networks , 2022, ICLR.

[4]  Junjie Yan,et al.  cosFormer: Rethinking Softmax in Attention , 2022, ICLR.

[5]  Shuicheng Yan,et al.  MetaFormer is Actually What You Need for Vision , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Sankalan Pal Chowdhury,et al.  Learning the Transformer Kernel , 2021, Trans. Mach. Learn. Res..

[7]  Fei Wang,et al.  Expressivity and Trainability of Quadratic Networks , 2021, ArXiv.

[8]  Grigorios G. Chrysos,et al.  Poly-NL: Linear Complexity Non-local Layers With 3rd Order Polynomials , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  Matthijs Douze,et al.  XCiT: Cross-Covariance Image Transformers , 2021, NeurIPS.

[10]  Xiaojie Jin,et al.  Refiner: Refining Self-attention for Vision Transformers , 2021, ArXiv.

[11]  Nitish Srivastava,et al.  An Attention Free Transformer , 2021, ArXiv.

[12]  Julien Mairal,et al.  Emerging Properties in Self-Supervised Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Grigorios G. Chrysos,et al.  Augmenting Deep Classifiers with Polynomial Neural Networks , 2021, ECCV.

[14]  Matthieu Cord,et al.  Going deeper with Image Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Chi-Keung Tang,et al.  Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[18]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[19]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[20]  Lucy J. Colwell,et al.  Rethinking Attention with Performers , 2020, ICLR.

[21]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[22]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[23]  Stefanos Zafeiriou,et al.  Deep Polynomial Neural Networks , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Stefanos Zafeiriou,et al.  TESA: Tensor Element Self-Attention via Matricization , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[26]  Yee Whye Teh,et al.  Multiplicative Interactions and Where to Find Them , 2020, ICLR.

[27]  Ding Liu,et al.  Pyramid Attention Networks for Image Restoration , 2020, ArXiv.

[28]  A. Yuille,et al.  Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation , 2020, ECCV.

[29]  Stefanos Zafeiriou,et al.  P–nets: Deep Polynomial Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[31]  Yao-Hung Hubert Tsai,et al.  Transformer Dissection: An Unified Understanding for Transformer’s Attention via the Lens of Kernel , 2019, EMNLP.

[32]  Dan Xu,et al.  Dynamic Graph Message Passing Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[34]  Shu-Tao Xia,et al.  Second-Order Attention Network for Single Image Super-Resolution , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Xuming He,et al.  LatentGNN: Learning Efficient Non-local Relations for Visual Recognition , 2019, ICML.

[36]  Xiang Li,et al.  Spatial Group-wise Enhance: Improving Semantic Feature Learning in Convolutional Networks , 2019, ArXiv.

[37]  Chen Change Loy,et al.  EDVR: Video Restoration With Enhanced Deformable Convolutional Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[38]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[39]  Stephen Lin,et al.  GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[40]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[41]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[42]  Shuai Yi,et al.  Efficient Attention: Attention with Linear Complexities , 2018, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[43]  Shuicheng Yan,et al.  Graph-Based Global Reasoning Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Errui Ding,et al.  Compact Generalized Non-local Network , 2018, NeurIPS.

[45]  Shuicheng Yan,et al.  A2-Nets: Double Attention Networks , 2018, NeurIPS.

[46]  Jun Fu,et al.  Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Xinge You,et al.  Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition , 2018, ECCV.

[48]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[49]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[50]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Gang Sun,et al.  Squeeze-and-Excitation Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[54]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[56]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[57]  Steve Renals,et al.  Multiplicative LSTM for sequence modelling , 2016, ICLR.

[58]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Shuo Yang,et al.  WIDER FACE: A Face Detection Benchmark , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Subhransu Maji,et al.  Bilinear CNN Models for Fine-Grained Visual Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[61]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[62]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[63]  Cristian Sminchisescu,et al.  Semantic Segmentation with Second-Order Pooling , 2012, ECCV.

[64]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[65]  Tamara G. Kolda,et al.  Tensor Decompositions and Applications , 2009, SIAM Rev..

[66]  Alessandro Foi,et al.  Image Denoising by Sparse 3-D Transform-Domain Collaborative Filtering , 2007, IEEE Transactions on Image Processing.

[67]  Geoffrey E. Hinton,et al.  Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[68]  Jean-Michel Morel,et al.  A non-local algorithm for image denoising , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[69]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[70]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[71]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[72]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[73]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[74]  N. J. Cohen,et al.  Higher-Order Boltzmann Machines , 1986 .