Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences

Attention is a commonly used mechanism in sequence processing, but it is of O(n^2) complexity which prevents its application to long sequences. The recently introduced Neural Shuffle-Exchange network offers a computation-efficient alternative, enabling the modelling of long-range dependencies in O(n log n) time. The model, however, is quite complex, involving a sophisticated gating mechanism derived from Gated Recurrent Unit. In this paper, we present a simple and lightweight variant of the Shuffle-Exchange network, which is based on a residual network employing GELU and Layer Normalization. The proposed architecture not only scales to longer sequences but also converges faster and provides better accuracy. It surpasses Shuffle-Exchange network on the LAMBADA language modelling task and achieves state-of-the-art performance on the MusicNet dataset for music transcription while using significantly fewer parameters. We show how to combine Shuffle-Exchange network with convolutional layers establishing it as a useful building block in long sequence processing applications.

[1]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[2]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[3]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[4]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[5]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Hai Wang,et al.  Broad Context Language Modeling as Reading Comprehension , 2016, EACL.

[8]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[9]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[11]  Tomas Mikolov,et al.  Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[12]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[13]  Alex Graves,et al.  Grid Long Short-Term Memory , 2015, ICLR.

[14]  Samy Bengio,et al.  Can Active Memory Replace Attention? , 2016, NIPS.

[15]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[16]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[17]  Sandeep Subramanian,et al.  Deep Complex Networks , 2017, ICLR.

[18]  Wojciech Zaremba,et al.  Reinforcement Learning Neural Turing Machines - Revised , 2015 .

[19]  Christopher Clark,et al.  Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[20]  Dongyu Li,et al.  Complex Transformer: A Framework for Modeling Complex-Valued Sequence , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23]  Phil Blunsom,et al.  Learning to Transduce with Unbounded Memory , 2015, NIPS.

[24]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[25]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[26]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[27]  Zaïd Harchaoui,et al.  Learning Features of Music from Scratch , 2016, ICLR.

[28]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[29]  Angela Yao,et al.  Complex Gated Recurrent Neural Networks , 2018, NeurIPS.

[30]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[31]  Li Yang,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[32]  William J. Dally,et al.  Principles and Practices of Interconnection Networks , 2004 .

[33]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[34]  Simon Dixon,et al.  Automatic Music Transcription: An Overview , 2019, IEEE Signal Processing Magazine.

[35]  Agris Sostaks,et al.  Neural Shuffle-Exchange Networks − Sequence Processing in O( n log n ) Time , 2019 .

[36]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[37]  Zhiyuan Zhang,et al.  Understanding and Improving Layer Normalization , 2019, NeurIPS.

[38]  Wojciech Zaremba,et al.  Learning Simple Algorithms from Examples , 2015, ICML.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Neel Kant,et al.  Recent Advances in Neural Program Synthesis , 2018, ArXiv.

[41]  Karlis Freivalds,et al.  Improving the Neural GPU Architecture for Algorithm Learning , 2017, ArXiv.

[42]  Zaïd Harchaoui,et al.  Invariances and Data Augmentation for Supervised Music Transcription , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Zheng Zhang,et al.  Star-Transformer , 2019, NAACL.

[44]  Tomas Mikolov,et al.  Advances in Pre-Training Distributed Word Representations , 2017, LREC.

[45]  Sandro Pezzelle,et al.  The LAMBADA dataset: Word prediction requiring a broad discourse context , 2016, ACL.

[46]  Han Fang,et al.  Linformer: Self-Attention with Linear Complexity , 2020, ArXiv.

[47]  Douglas Eck,et al.  Music Transformer , 2018, 1809.04281.

[48]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[49]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.