论文信息 - GLU Variants Improve Transformer

GLU Variants Improve Transformer

Gated Linear Units (arXiv:1612.08083) consist of the component-wise product of two linear projections, one of which is first passed through a sigmoid function. Variations on GLU are possible, using different nonlinear (or even linear) functions in place of sigmoid. We test these variants in the feed-forward sublayers of the Transformer (arXiv:1706.03762) sequence-to-sequence model, and find that some of them yield quality improvements over the typically-used ReLU or GELU activations.

Noam Shazeer | Noam M. Shazeer

[1] Kevin Gimpel,et al. Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , 2016, ArXiv.

[2] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[3] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[4] Quoc V. Le,et al. Swish: a Self-Gated Activation Function , 2017, 1710.05941.

[5] Yoshua Bengio,et al. Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[6] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[7] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[8] Noam Shazeer,et al. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[9] Quoc V. Le,et al. Searching for Activation Functions , 2018, arXiv.

[10] Yann Dauphin,et al. Language Modeling with Gated Convolutional Networks , 2016, ICML.

[11] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[12] Geoffrey E. Hinton,et al. Three new graphical models for statistical language modelling , 2007, ICML '07.