Bird-Eye Transformers for Text Generation Models

Transformers have become an indispens-able module for text generation models since their great success in machine translation. Previous works attribute the success of transformers to the query-key-value dot-product attention, which provides a ro-bust inductive bias by the fully connected token graphs. However, we found that self-attention has a severe limitation. When predicting the ( i + 1) -th token, self-attention only takes the i -th token as an information collector, and it tends to give a high attention weight to those tokens similar to itself. Therefore, most of the historical information that occurred before the i -th token is not taken into consideration. Based on this observation, in this paper, we propose a new architecture, called bird-eye transformer ( BET ), which goes one step further to improve the performance of transformers by reweighting self-attention to encourage it to focus more on important historical information. We have conducted experiments on multiple text generation tasks, including machine translation (2 datasets) and language models (3 datasets). These experimental results show that our proposed model achieves a better performance than the baseline transformer architectures on all datasets. The code is released at: https://sites.google.com/

[1]  Anna Maria Di Sciullo On Aspects of the Theory of Syntax , 2021, Inference: International Review of Science.

[2]  Elena Agliari,et al.  Boltzmann Machines as Generalized Hopfield Networks: A Review of Recent Results and Outlooks , 2020, Entropy.

[3]  Yi Tay,et al.  Efficient Transformers: A Survey , 2020, ACM Comput. Surv..

[4]  J. Hopfield,et al.  Large Associative Memory Problem in Neurobiology and Machine Learning , 2020, ICLR.

[5]  Vitaly Feldman,et al.  What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation , 2020, NeurIPS.

[6]  M. Zaheer,et al.  Big Bird: Transformers for Longer Sequences , 2020, NeurIPS.

[7]  Nikolaos Pappas,et al.  Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , 2020, ICML.

[8]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[9]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[10]  Aurko Roy,et al.  Efficient Content-Based Sparse Attention with Routing Transformers , 2020, TACL.

[11]  Liu Yang,et al.  Sparse Sinkhorn Attention , 2020, ICML.

[12]  S. Venkatesh,et al.  Self-Attentive Associative Memory , 2020, ICML.

[13]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[14]  Xuancheng Ren,et al.  Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection , 2019, ArXiv.

[15]  Omer Levy,et al.  Blockwise Self-Attention for Long Document Understanding , 2019, FINDINGS.

[16]  Chris Quirk,et al.  Novel positional encodings to enable tree-based transformers , 2019, NeurIPS.

[17]  Mikhail Belkin,et al.  Overparameterized neural networks implement associative memory , 2019, Proceedings of the National Academy of Sciences.

[18]  Tim Salimans,et al.  Axial Attention in Multidimensional Transformers , 2019, ArXiv.

[19]  Hung-yi Lee,et al.  Tree Transformer: Integrating Tree Structures into Self-Attention , 2019, EMNLP.

[20]  Guillaume Lample,et al.  Augmenting Self-attention with Persistent Memory , 2019, ArXiv.

[21]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[22]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[23]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[24]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[25]  Yee Whye Teh,et al.  Set Transformer , 2018, ICML.

[26]  Aaron C. Courville,et al.  Ordered Neurons: Integrating Tree Structures into Recurrent Neural Networks , 2018, ICLR.

[27]  Satrajit Chatterjee,et al.  Learning and Memorization , 2018, ICML.

[28]  Zhifang Sui,et al.  Jointly Extracting Event Triggers and Arguments by Dependency-Bridge RNN and Tensor-Based Argument Interaction , 2018, AAAI.

[29]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[30]  Peter J. Liu,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[31]  Jin-Hui Wang,et al.  Associative memory cells and their working principle in the brain , 2018, F1000Research.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[34]  John J. Hopfield,et al.  Dense Associative Memory for Pattern Recognition , 2016, NIPS.

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Alexandra Birch,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Alex Graves Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[39]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[41]  Robert L. Mercer,et al.  An Estimate of an Upper Bound for the Entropy of English , 1992, CL.

[42]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[43]  Mauro Cettolo,et al.  The IWSLT 2016 Evaluation Campaign , 2016, IWSLT.

[44]  R. Miikkulainen Hopfield Network , 2010, Encyclopedia of Machine Learning and Data Mining.

[45]  Alaa A. Kharbouch,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.