Efficient Representation Learning via Adaptive Context Pooling

Self-attention mechanisms model long-range context by using pairwise attention between all input tokens. In doing so, they assume a fixed attention granularity defined by the individual tokens ( e.g. , text characters or image pixels), which may not be optimal for modeling complex dependencies at higher levels. In this paper, we propose ContextPool to address this problem by adapting the attention granularity for each token. Inspired by the success of ConvNets that are combined with pooling to capture long-range dependencies, we learn to pool neighboring features for each token before computing attention in a given attention layer. The pooling weights and support size are adaptively determined, allowing the pooled features to encode meaningful context with varying scale. We show that ContextPool makes attention models more expressive, achieving strong performance often with fewer layers and thus significantly reduced cost. Experiments validate that our ContextPool module, when plugged into transformer models, matches or surpasses state-of-the-art performance using less compute on several language and image benchmarks, outperforms recent works with learned context sizes or sparse attention patterns, and is also applicable to ConvNets for efficient feature learning.

[1]  Marco Loog,et al.  Resolution Learning in Deep Convolutional Networks Using Scale-Space Theory , 2021, IEEE Transactions on Image Processing.

[2]  Jiwen Lu,et al.  DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification , 2021, NeurIPS.

[3]  Lu Yuan,et al.  Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[4]  Xiang Li,et al.  Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[6]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[7]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[8]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[9]  A. Piergiovanni,et al.  TokenLearner: Adaptive Space-Time Tokenization for Videos , 2021, NeurIPS.

[10]  Nergis Tomen,et al.  Deep Continuous Networks , 2024, ICML.

[11]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[12]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[13]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[14]  Santiago Ontañón,et al.  ETC: Encoding Long and Structured Data in Transformers , 2020, ArXiv.

[15]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[16]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[17]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[18]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2019, FINDINGS.

[19]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[20]  Zheng Zhang,et al.  BP-Transformer: Modelling Long-Range Context via Binary Partitioning , 2019, ArXiv.

[21]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[22]  Wenhu Chen,et al.  Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting , 2019, NeurIPS.

[23]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[24]  Trevor Darrell,et al.  Blurring the Line Between Structure and Learning to Optimize and Adapt Receptive Fields , 2019, ArXiv.

[25]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[26]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[27]  Yang Li,et al.  Area Attention , 2018, ICML.

[28]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[29]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[30]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[31]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Yi Li,et al.  Deformable Convolutional Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[34]  Raquel Urtasun,et al.  Understanding the Effective Receptive Field in Deep Convolutional Neural Networks , 2016, NIPS.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Jian Sun,et al.  Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Trevor Darrell,et al.  Beyond spatial pyramids: Receptive field learning for pooled image features , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[38]  Andrew Y. Ng,et al.  Selecting Receptive Fields in Deep Networks , 2011, NIPS.

[39]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .