CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings

Without positional information, attention-based Transformer neural networks are permutation-invariant. Absolute or relative positional embeddings are the most popular ways to feed Transformer models with positional information. Absolute positional embeddings are simple to implement, but suffer from generalization issues when evaluating on sequences longer than seen at training time. Relative positions are more robust to input length change, but are more complex to implement and yield inferior model throughput due to extra computational and memory costs. In this paper, we propose an augmentation-based approach (CAPE) for absolute positional embeddings, which keeps the advantages of both absolute (simplicity and speed) and relative positional embeddings (better generalization). In addition, our empirical evaluation on state-of-the-art models in machine translation, image and speech recognition demonstrates that CAPE leads to better generalization performance as well as increased stability with respect to training hyper-parameters.

[1]  Matthijs Douze,et al.  LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[3]  Torsten Hoefler,et al.  Augment Your Batch: Improving Generalization Through Instance Repetition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  M. Seltzer,et al.  Transformer-Based Acoustic Modeling for Hybrid Speech Recognition , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[6]  Iasonas Kokkinos,et al.  MultiGrain: a unified image embedding for classes and instances , 2019, ArXiv.

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[9]  Yi Yang,et al.  Random Erasing Data Augmentation , 2017, AAAI.

[10]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Mohammad Norouzi,et al.  SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network , 2021, ArXiv.

[12]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[14]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[15]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Quoc V. Le,et al.  Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition , 2020, ArXiv.

[17]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[18]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[19]  Stephen Lin,et al.  Deep Metric Transfer for Label Propagation with Limited Annotated Data , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[20]  Luke S. Zettlemoyer,et al.  Transformers with convolutional context for ASR , 2019, ArXiv.

[21]  Gabriel Synnaeve,et al.  Rethinking Evaluation in ASR: Are Our Models Robust Enough? , 2020, Interspeech.

[22]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[23]  Quoc V. Le,et al.  Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[24]  Martin Schmitt,et al.  Position Information in Transformers: An Overview , 2021, Computational Linguistics.

[25]  Jakob Grue Simonsen,et al.  Encoding word order in complex embeddings , 2019, ICLR.

[26]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[27]  Kevin Duh,et al.  Very Deep Transformers for Neural Machine Translation , 2020, ArXiv.

[28]  Tie-Yan Liu,et al.  Rethinking Positional Encoding in Language Pre-training , 2020, ICLR.

[29]  Davis Liang,et al.  Improve Transformer Models with Better Relative Position Embeddings , 2020, FINDINGS.

[30]  Hermann Ney,et al.  Analysis of Positional Encodings for Neural Machine Translation , 2019, IWSLT.

[31]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[32]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[33]  Mary Williamson,et al.  Recipes for Building an Open-Domain Chatbot , 2020, EACL.

[34]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[35]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[36]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[37]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[38]  Hermann Ney,et al.  A Comparison of Transformer and LSTM Encoder Decoder Models for ASR , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[39]  Heng Wang,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[40]  Yannick Estève,et al.  TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation , 2018, SPECOM.

[41]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[42]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[43]  Steve J. Young,et al.  Large vocabulary continuous speech recognition using HTK , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[44]  Wei Chen,et al.  Improving Generalization of Transformer for Speech Recognition with Parallel Schedule Sampling and Relative Positional Embedding , 2019, ArXiv.

[45]  Jakob Grue Simonsen,et al.  On Position Embeddings in BERT , 2021, ICLR.

[46]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[47]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[48]  N. Codella,et al.  CvT: Introducing Convolutions to Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Shengfeng Pan,et al.  RoFormer: Enhanced Transformer with Rotary Position Embedding , 2021, ArXiv.

[50]  Seong Joon Oh,et al.  CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[51]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[52]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[53]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[54]  Andrew M. Dai,et al.  Music Transformer: Generating Music with Long-Term Structure , 2018, ICLR.