论文信息 - N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using Neural Ordinary Differential Equations

N-ODE Transformer: A Depth-Adaptive Variant of the Transformer Using Neural Ordinary Differential Equations

We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver. Our goal in proposing the N-ODE Transformer is to investigate whether its depth-adaptivity may aid in overcoming some specific known theoretical limitations of the Transformer in handling nonlocal effects. Specifically, we consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations that can only be overcome by using a sufficiently large number of layers or attention heads. We find, however, that the depth-adaptivity of the N-ODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem, and provide explanations for why this is so. Next, we pursue regularization of the N-ODE Transformer by penalizing the arclength of the ODE trajectories, but find that this fails to improve the accuracy or efficiency of the N-ODE Transformer on the challenging parity problem. We suggest future avenues of research for modifications and extensions of the N-ODE Transformer that may lead to improved accuracy and efficiency for sequence modelling tasks such as neural machine translation.

Hans De Sterck | Aaron Baier-Reinio

[1] Arman Cohan,et al. Longformer: The Long-Document Transformer , 2020, ArXiv.

[2] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[3] Ilya Sutskever,et al. Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[4] Michael Hahn,et al. Theoretical Limitations of Self-Attention in Neural Sequence Models , 2019, TACL.

[5] David Duvenaud,et al. Neural Ordinary Differential Equations , 2018, NeurIPS.

[6] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[7] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[8] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[9] Edouard Grave,et al. Depth-Adaptive Transformer , 2020, ICLR.

[10] David Duvenaud,et al. Learning Differential Equations that are Easy to Solve , 2020, NeurIPS.

[11] Xingjian Li,et al. OT-Flow: Fast and Accurate Continuous Normalizing Flows via Optimal Transport , 2020, ArXiv.

[12] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.

[13] Will Grathwohl. Scalable Reversible Generative Models with Free-form Continuous Dynamics , 2018 .

[14] Gholamreza Haffari,et al. A Survey on Document-level Machine Translation: Methods and Evaluation , 2019, ArXiv.