Learnable Fourier Features for Multi-Dimensional Spatial Positional Encoding

Attentional mechanisms are order-invariant. Positional encoding is a crucial component to allow attention-based deep model architectures such as Transformer to address sequences or images where the position of information matters. In this paper, we propose a novel positional encoding method based on learnable Fourier features. Instead of hard-coding each position as a token or a vector, we represent each position, which can be multi-dimensional, as a trainable encoding based on learnable Fourier feature mapping, modulated with a multi-layer perceptron. The representation is particularly advantageous for a spatial multi-dimensional position, e.g., pixel positions on an image, where L2 distances or more complex positional relationships need to be captured. Our experiments based on several public benchmark tasks show that our learnable Fourier feature representation for multi-dimensional positional encoding outperforms existing methods by both improving the accuracy and allowing faster convergence.

[1]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Jakob Grue Simonsen,et al.  Encoding word order in complex embeddings , 2019, ICLR.

[3]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[6]  Cho-Jui Hsieh,et al.  Learning to Encode Position for Transformer with Continuous Dynamical Model , 2020, ICML.

[7]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[8]  Douglas Eck,et al.  An Improved Relative Self-Attention Mechanism for Transformer with Application to Music Generation , 2018, ArXiv.

[9]  Dennis DeCoste,et al.  Compact Random Feature Maps , 2013, ICML.

[10]  Vikas Sindhwani,et al.  Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels , 2014, J. Mach. Learn. Res..

[11]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[12]  Alex Sherstinsky,et al.  Fundamentals of Recurrent Neural Network (RNN) and Long Short-Term Memory (LSTM) Network , 2018, Physica D: Nonlinear Phenomena.

[13]  Naoki Yoshinaga,et al.  On the Relation between Position Information and Sentence Length in Neural Machine Translation , 2019, CoNLL.

[14]  Lucy J. Colwell,et al.  Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers , 2020, ArXiv.

[15]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[16]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[17]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[18]  Zhiwei Guan,et al.  Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements , 2020, EMNLP.

[19]  Marius Leordeanu,et al.  Recurrent Space-time Graph Neural Networks , 2019, NeurIPS.

[20]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[21]  Ambuj Tewari,et al.  But How Does It Work in Theory? Linear SVM with Random Features , 2018, NeurIPS.

[22]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[23]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[24]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[25]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[26]  AI Koan,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[27]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[28]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[29]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[30]  Jakob Grue Simonsen,et al.  On Position Embeddings in BERT , 2021, ICLR.

[31]  Xin Zhou,et al.  Mapping Natural Language Instructions to Mobile UI Action Sequences , 2020, ACL.

[32]  Giambattista Parascandolo,et al.  Taming the waves: sine as activation function in deep neural networks , 2017 .

[33]  Antoine Liutkus,et al.  Relative Positional Encoding for Transformers with Linear Complexity , 2021, ICML.

[34]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  Chris Quirk,et al.  Novel positional encodings to enable tree-based transformers , 2019, NeurIPS.

[36]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[37]  Xing Wang,et al.  Self-Attention with Structural Position Representations , 2019, EMNLP.

[38]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[39]  Sanjiv Kumar,et al.  Learning Adaptive Random Features , 2019, AAAI.

[40]  Matthijs Douze,et al.  LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Alexander J. Smola,et al.  Fastfood: Approximate Kernel Expansions in Loglinear Time , 2014, ArXiv.