Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

The Transformer architecture is widely used in natural language processing. Despite its success, the design principle of the Transformer remains elusive. In this paper, we provide a novel perspective towards understanding the architecture: we show that the Transformer can be mathematically interpreted as a numerical Ordinary Differential Equation (ODE) solver for a convection-diffusion equation in a multi-particle dynamic system. In particular, how words in a sentence are abstracted into contexts by passing through the layers of the Transformer can be interpreted as approximating multiple particles' movement in the space using the Lie-Trotter splitting scheme and the Euler's method. Given this ODE's perspective, the rich literature of numerical analysis can be brought to guide us in designing effective structures beyond the Transformer. As an example, we propose to replace the Lie-Trotter splitting scheme by the Strang-Marchuk splitting scheme, a scheme that is more commonly used and with much lower local truncation errors. The Strang-Marchuk splitting scheme suggests that the self-attention and position-wise feed-forward network (FFN) sub-layers should not be treated equally. Instead, in each layer, two position-wise FFN sub-layers should be used, and the self-attention sub-layer is placed in between. This leads to a brand new architecture. Such an FFN-attention-FFN layer is "Macaron-like", and thus we call the network with this new architecture the Macaron Net. Through extensive experiments, we show that the Macaron Net is superior to the Transformer on both supervised and unsupervised learning tasks. The reproducible codes and pretrained models can be found at this https URL

[1]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[2]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  E Weinan,et al.  A Proposal on Machine Learning via Dynamical Systems , 2017, Communications in Mathematics and Statistics.

[4]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[5]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[6]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[7]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[8]  G. Quispel,et al.  Acta Numerica 2002: Splitting methods , 2002 .

[9]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[10]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[11]  Bin Dong,et al.  Beyond Finite Layer Neural Networks: Bridging Deep Architectures and Numerical Differential Equations , 2017, ICML.

[12]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[13]  G. Strang On the Construction and Comparison of Difference Schemes , 1968 .

[14]  Uri M. Ascher,et al.  Computer methods for ordinary differential equations and differential-algebraic equations , 1998 .

[15]  M. Thorpe,et al.  Deep limits of residual neural networks , 2018, Research in the Mathematical Sciences.

[16]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[17]  Richard Socher,et al.  Weighted Transformer Network for Machine Translation , 2017, ArXiv.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[20]  Chong Fu,et al.  Convolutional neural networks combined with Runge–Kutta methods , 2018, Neural Comput. Appl..

[21]  Alexander V. Bobylev,et al.  The error of the splitting scheme for solving evolutionary equations , 2001, Appl. Math. Lett..

[22]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[23]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[24]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[25]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[26]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[27]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[28]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[29]  Noboru Murata,et al.  Transport Analysis of Infinitely Deep Neural Network , 2016, J. Mach. Learn. Res..

[30]  David Duvenaud,et al.  Neural Ordinary Differential Equations , 2018, NeurIPS.

[31]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[32]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[33]  Ed H. Chi,et al.  AntisymmetricRNN: A Dynamical System View on Recurrent Neural Networks , 2019, ICLR.

[34]  Bin Dong,et al.  Dynamically Unfolding Recurrent Restorer: A Moving Endpoint Control Method for Image Restoration , 2018, ICLR.

[35]  Tomaso A. Poggio,et al.  Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex , 2016, ArXiv.

[36]  Wotao Yin,et al.  Splitting Methods in Communication, Imaging, Science, and Engineering , 2017 .

[37]  Peter Clark,et al.  The Seventh PASCAL Recognizing Textual Entailment Challenge , 2011, TAC.

[38]  Eldad Haber,et al.  Stable architectures for deep neural networks , 2017, ArXiv.

[39]  Eneko Agirre,et al.  SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation , 2017, *SEMEVAL.

[40]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[41]  Chris Brockett,et al.  Automatically Constructing a Corpus of Sentential Paraphrases , 2005, IJCNLP.

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[44]  J. Geiser Decomposition Methods for Differential Equations: Theory and Applications , 2009 .

[45]  Tengyu Ma,et al.  Fixup Initialization: Residual Learning Without Normalization , 2019, ICLR.

[46]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[47]  Ido Dagan,et al.  The Sixth PASCAL Recognizing Textual Entailment Challenge , 2009, TAC.

[48]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[49]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[50]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[51]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[52]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[53]  Wei Liu,et al.  Nonlocal Neural Networks, Nonlocal Diffusion and Nonlocal Modeling , 2018, NeurIPS.

[54]  H. P.,et al.  An Introduction to Celestial Mechanics , 1914, Nature.