论文信息 - Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences

Residual Shuffle-Exchange Networks for Fast Processing of Long Sequences

Attention is a commonly used mechanism in sequence processing, but it is of O(n^2) complexity which prevents its application to long sequences. The recently introduced Neural Shuffle-Exchange network offers a computation-efficient alternative, enabling the modelling of long-range dependencies in O(n log n) time. The model, however, is quite complex, involving a sophisticated gating mechanism derived from Gated Recurrent Unit. In this paper, we present a simple and lightweight variant of the Shuffle-Exchange network, which is based on a residual network employing GELU and Layer Normalization. The proposed architecture not only scales to longer sequences but also converges faster and provides better accuracy. It surpasses Shuffle-Exchange network on the LAMBADA language modelling task and achieves state-of-the-art performance on the MusicNet dataset for music transcription while using significantly fewer parameters. We show how to combine Shuffle-Exchange network with convolutional layers establishing it as a useful building block in long sequence processing applications.

[1] Heiga Zen,et al. WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[2] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[3] Noah Constant,et al. Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[4] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[5] Vladlen Koltun,et al. Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[6] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7] Hai Wang,et al. Broad Context Language Modeling as Reading Comprehension , 2016, EACL.

[8] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[9] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[10] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016 .

[11] Tomas Mikolov,et al. Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets , 2015, NIPS.

[12] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[13] Alex Graves,et al. Grid Long Short-Term Memory , 2015, ICLR.

[14] Samy Bengio,et al. Can Active Memory Replace Attention? , 2016, NIPS.

[15] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[16] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[17] Sandeep Subramanian,et al. Deep Complex Networks , 2017, ICLR.

[18] Wojciech Zaremba,et al. Reinforcement Learning Neural Turing Machines - Revised , 2015 .

[19] Christopher Clark,et al. Simple and Effective Multi-Paragraph Reading Comprehension , 2017, ACL.

[20] Dongyu Li,et al. Complex Transformer: A Framework for Modeling Complex-Valued Sequence , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[22] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[23] Phil Blunsom,et al. Learning to Transduce with Unbounded Memory , 2015, NIPS.

[24] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[25] Alex Graves,et al. Neural Turing Machines , 2014, ArXiv.

[26] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[27] Zaïd Harchaoui,et al. Learning Features of Music from Scratch , 2016, ICLR.

[28] Navdeep Jaitly,et al. Pointer Networks , 2015, NIPS.

[29] Angela Yao,et al. Complex Gated Recurrent Neural Networks , 2018, NeurIPS.

[30] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[31] Li Yang,et al. Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[32] William J. Dally,et al. Principles and Practices of Interconnection Networks , 2004 .