FNet: Mixing Tokens with Fourier Transforms

We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that “mix” input tokens. These linear transformations, along with simple nonlinearities in feed-forward layers, are sufficient to model semantic relationships in several text classification tasks. Perhaps most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs. The resulting model, which we name FNet, scales very efficiently to long inputs, matching the accuracy of the most accurate “efficient” Transformers on the Long Range Arena benchmark, but training and running faster across all sequence lengths on GPUs and relatively shorter sequence lengths on TPUs. Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.

[1]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[2]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[3]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[4]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[5]  Nikoli Dryden,et al.  Data Movement Is All You Need: A Case Study of Transformer Networks , 2020 .

[6]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[7]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[8]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ArXiv.

[9]  Franccois Fleuret,et al.  Fast Transformers with Clustered Attention , 2020, NeurIPS.

[10]  John Hewitt,et al.  Designing and Interpreting Probes with Control Tasks , 2019, EMNLP.

[11]  Eduardo D. Sontag,et al.  Using Fourier-neural recurrent networks to fit sequential input/output data , 1997, Neurocomputing.

[12]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[13]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[14]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[15]  J. Tiedemann,et al.  Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, FINDINGS.

[16]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[17]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[18]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[19]  Noam Shazeer,et al.  Fast Transformer Decoding: One Write-Head is All You Need , 2019, ArXiv.

[20]  Lukasz Kaiser,et al.  Rethinking Attention with Performers , 2020, ArXiv.

[21]  Alexander M. Rush,et al.  Pre-trained Summarization Distillation , 2020, ArXiv.

[22]  Tara N. Sainath,et al.  Structured Transforms for Small-Footprint Deep Learning , 2015, NIPS.

[23]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  R. Kumar,et al.  Cardiac arrhythmias detection in an ECG beat signal using fast fourier transform and artificial neural network , 2011 .

[26]  Yonatan Belinkov,et al.  Analyzing the Structure of Attention in a Transformer Language Model , 2019, BlackboxNLP@ACL.

[27]  Jirí Bíla,et al.  Fast fourier transform for feature extraction and neural network for classification of electrocardiogram signals , 2015, 2015 Fourth International Conference on Future Generation Communication Technology (FGCT).

[28]  Patrice Simardy,et al.  Learning Long-Term Dependencies with , 2007 .

[29]  Yingqian Zhang,et al.  ForeNet : fourier recurrent networks for time series prediction , 2000 .

[30]  Inderjit S. Dhillon,et al.  Learning Long Term Dependencies via Fourier Recurrent Units , 2018, ICML.

[31]  Liu Yang,et al.  Long Range Arena: A Benchmark for Efficient Transformers , 2020, ICLR.

[32]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[33]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[34]  Misha Denil,et al.  ACDC: A Structured Efficient Linear Layer , 2015, ICLR.

[35]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[36]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[37]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[38]  Ming-Wei Chang,et al.  Well-Read Students Learn Better: On the Importance of Pre-training Compact Models , 2019 .

[39]  Massoud Pedram,et al.  FFT-based deep learning deployment in embedded systems , 2017, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[40]  Noah D. Goodman,et al.  Language Through a Prism: A Spectral Approach for Multiscale Language Representations , 2020, NeurIPS.

[41]  Frans Coenen,et al.  FCNN: Fourier Convolutional Neural Networks , 2017, ECML/PKDD.

[42]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[43]  Mohit Iyyer,et al.  Hard-Coded Gaussian Attention for Neural Machine Translation , 2020, ACL.

[44]  Henry O. Kunz On the Equivalence Between One-Dimensional Discrete Walsh-Hadamard and Multidimensional Discrete Fourier Transforms , 1979, IEEE Transactions on Computers.

[45]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[47]  Shih-Fu Chang,et al.  An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[48]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[49]  Yann LeCun,et al.  Fast training of convolutional networks through FFTS: International Conference on Learning Representations (ICLR2014), CBLS, April 2014 , 2014, ICLR 2014.

[50]  Yi Tay,et al.  Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[51]  H. Nakajima,et al.  Real-time discrimination of ventricular tachyarrhythmia with Fourier-transform neural network , 1999, IEEE Transactions on Biomedical Engineering.

[52]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2020, EMNLP.

[53]  Qiangfu Zhao,et al.  Fast Object/Face Detection Using Neural Networks and Fast Fourier Transform , 2007 .

[54]  Roy Schwartz,et al.  Random Feature Attention , 2021, ICLR.

[55]  Kamyar Azizzadenesheli,et al.  Fourier Neural Operator for Parametric Partial Differential Equations , 2021, ICLR.

[56]  Lucy J. Colwell,et al.  Masked Language Modeling for Proteins via Linearly Scalable Long-Context Transformers , 2020, ArXiv.

[57]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[58]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[59]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[60]  Hany Hassan Awadalla,et al.  FastFormers: Highly Efficient Transformer Models for Natural Language Understanding , 2020, SUSTAINLP.

[61]  Shadrokh Samavi,et al.  Acceleration of Convolutional Neural Network Using FFT-Based Split Convolutions , 2020, 2003.12621.

[62]  Tyler Highlander,et al.  Very Efficient Training of Convolutional Neural Networks using Fast Fourier Transform and Overlap-and-Add , 2016, BMVC.

[63]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[64]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2021, TACL.

[65]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[66]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[67]  Colin Raffel,et al.  Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.

[68]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[69]  Yi Wang,et al.  Fault diagnosis and prognosis using wavelet packet decomposition, Fourier transform and artificial neural network , 2013, J. Intell. Manuf..

[70]  Orhan Firat,et al.  GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , 2020, ICLR.