We use neural ordinary differential equations to formulate a variant of the Transformer that is depth-adaptive in the sense that an input-dependent number of time steps is taken by the ordinary differential equation solver. Our goal in proposing the N-ODE Transformer is to investigate whether its depth-adaptivity may aid in overcoming some specific known theoretical limitations of the Transformer in handling nonlocal effects. Specifically, we consider the simple problem of determining the parity of a binary sequence, for which the standard Transformer has known limitations that can only be overcome by using a sufficiently large number of layers or attention heads. We find, however, that the depth-adaptivity of the N-ODE Transformer does not provide a remedy for the inherently nonlocal nature of the parity problem, and provide explanations for why this is so. Next, we pursue regularization of the N-ODE Transformer by penalizing the arclength of the ODE trajectories, but find that this fails to improve the accuracy or efficiency of the N-ODE Transformer on the challenging parity problem. We suggest future avenues of research for modifications and extensions of the N-ODE Transformer that may lead to improved accuracy and efficiency for sequence modelling tasks such as neural machine translation.
[1]
Arman Cohan,et al.
Longformer: The Long-Document Transformer
,
2020,
ArXiv.
[2]
Geoffrey E. Hinton,et al.
Layer Normalization
,
2016,
ArXiv.
[3]
Ilya Sutskever,et al.
Generating Long Sequences with Sparse Transformers
,
2019,
ArXiv.
[4]
Michael Hahn,et al.
Theoretical Limitations of Self-Attention in Neural Sequence Models
,
2019,
TACL.
[5]
David Duvenaud,et al.
Neural Ordinary Differential Equations
,
2018,
NeurIPS.
[6]
Lukasz Kaiser,et al.
Reformer: The Efficient Transformer
,
2020,
ICLR.
[7]
Lukasz Kaiser,et al.
Attention is All you Need
,
2017,
NIPS.
[8]
Nitish Srivastava,et al.
Dropout: a simple way to prevent neural networks from overfitting
,
2014,
J. Mach. Learn. Res..
[9]
Edouard Grave,et al.
Depth-Adaptive Transformer
,
2020,
ICLR.
[10]
David Duvenaud,et al.
Learning Differential Equations that are Easy to Solve
,
2020,
NeurIPS.
[11]
Xingjian Li,et al.
OT-Flow: Fast and Accurate Continuous Normalizing Flows via Optimal Transport
,
2020,
ArXiv.
[12]
Lukasz Kaiser,et al.
Universal Transformers
,
2018,
ICLR.
[13]
Will Grathwohl.
Scalable Reversible Generative Models with Free-form Continuous Dynamics
,
2018
.
[14]
Gholamreza Haffari,et al.
A Survey on Document-level Machine Translation: Methods and Evaluation
,
2019,
ArXiv.