Untangling tradeoffs between recurrence and self-attention in neural networks

Attention and self-attention mechanisms, inspired by cognitive processes, are now central to state-of-the-art deep learning on sequential tasks. However, most recent progress hinges on heuristic approaches with limited understanding of attention's role in model optimization and computation, and rely on considerable memory and computational resources that scale poorly. In this work, we present a formal analysis of how self-attention affects gradient propagation in recurrent networks, and prove that it mitigates the problem of vanishing gradients when trying to capture long-term dependencies. Building on these results, we propose a relevancy screening mechanism, inspired by the cognitive process of memory consolidation, that allows for a scalable use of sparse self-attention with recurrence. While providing guarantees to avoid vanishing gradients, we use simple numerical experiments to demonstrate the tradeoffs in performance and computational resources by efficiently balancing attention and recurrence. Based on our results, we propose a concrete direction of research to improve scalability of attentive networks.

[1]  Sergey Levine,et al.  Recurrent Independent Mechanisms , 2019, ICLR.

[2]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[3]  Xuancheng Ren,et al.  Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[4]  Yann LeCun,et al.  Recurrent Orthogonal Networks and Long-Memory Tasks , 2016, ICML.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[7]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[8]  Jeffrey M. Zacks,et al.  Event boundaries in memory and cognition , 2017, Current Opinion in Behavioral Sciences.

[9]  Yoshua Bengio,et al.  Gated Orthogonal Recurrent Units: On Learning to Forget , 2017, Neural Computation.

[10]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[11]  Yan Wu,et al.  Optimizing agent behavior over long time scales by transporting value , 2018, Nature Communications.

[12]  Alec Radford,et al.  Proximal Policy Optimization Algorithms , 2017, ArXiv.

[13]  Wojciech Zaremba,et al.  OpenAI Gym , 2016, ArXiv.

[14]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[17]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[18]  Jeffrey M. Zacks,et al.  Event perception: a mind-brain perspective. , 2007, Psychological bulletin.

[19]  Sepp Hochreiter,et al.  Untersuchungen zu dynamischen neuronalen Netzen , 1991 .

[20]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[21]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[22]  Alex Graves,et al.  Neural Turing Machines , 2014, ArXiv.

[23]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[24]  Christopher Joseph Pal,et al.  Sparse Attentive Backtracking: Temporal CreditAssignment Through Reminding , 2018, NeurIPS.

[25]  Yoshua Bengio,et al.  Unitary Evolution Recurrent Neural Networks , 2015, ICML.

[26]  Razvan Pascanu,et al.  Relational recurrent neural networks , 2018, NeurIPS.

[27]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[28]  Ioannis Mitliagkas,et al.  h-detach: Modifying the LSTM Gradient Towards Better Optimization , 2018, ICLR.

[29]  Mario Lezcano Casado,et al.  Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and Unitary Group , 2019, ICML.

[30]  Sergio Gomez Colmenarejo,et al.  Hybrid computing using a neural network with dynamic external memory , 2016, Nature.

[31]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[32]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[33]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[34]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35]  Geoffrey E. Hinton,et al.  A Simple Way to Initialize Recurrent Networks of Rectified Linear Units , 2015, ArXiv.

[36]  Yann LeCun,et al.  Orthogonal RNNs and Long-Memory Tasks , 2016, ArXiv.

[37]  Tsendsuren Munkhdalai,et al.  Metalearned Neural Memory , 2019, NeurIPS.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.