Pay Less Attention with Lightweight and Dynamic Convolutions

Self-attention is a useful mechanism to build generative models for language and images. It determines the importance of context elements by comparing each element to the current time step. In this paper, we show that a very lightweight convolution can perform competitively to the best reported self-attention results. Next, we introduce dynamic convolutions which are simpler and more efficient than self-attention. We predict separate convolution kernels based solely on the current time-step in order to determine the importance of context elements. The number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic. Experiments on large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models. On the WMT'14 English-German test set dynamic convolutions achieve a new state of the art of 29.7 BLEU.

[1]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[2]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[3]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[4]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[5]  Chengqi Zhang,et al.  Fast Directional Self-Attention Mechanism , 2018, ArXiv.

[6]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[7]  Ming Yang,et al.  DeepFace: Closing the Gap to Human-Level Performance in Face Verification , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[10]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[11]  Di He,et al.  Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input , 2018, AAAI.

[12]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[14]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[15]  Lawrence Carin,et al.  Learning Context-Sensitive Convolutional Filters for Text Processing , 2017 .

[16]  Yejin Choi,et al.  Deep Communicating Agents for Abstractive Summarization , 2018, NAACL.

[17]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[18]  Richard Socher,et al.  Weighted Transformer Network for Machine Translation , 2017, ArXiv.

[19]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[20]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[21]  Aurko Roy,et al.  Fast Decoding in Sequence Models using Discrete Latent Variables , 2018, ICML.

[22]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[23]  Deyi Xiong,et al.  Accelerating Neural Transformer via an Average Attention Network , 2018, ACL.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Chengqi Zhang,et al.  Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling , 2018, ICLR.

[27]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[28]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[29]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[30]  Laurent Besacier,et al.  Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction , 2018, CoNLL.

[31]  Yonghui Wu,et al.  Exploring the Limits of Language Modeling , 2016, ArXiv.

[32]  Tara N. Sainath,et al.  Locally-connected and convolutional neural networks for small footprint speaker recognition , 2015, INTERSPEECH.

[33]  Lukasz Kaiser,et al.  Depthwise Separable Convolutions for Neural Machine Translation , 2017, ICLR.

[34]  Xuanjing Huang,et al.  Convolutional Interaction Network for Natural Language Inference , 2018, EMNLP.

[35]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[36]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[37]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[38]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[39]  Moustapha Cissé,et al.  Efficient softmax approximation for GPUs , 2016, ICML.

[40]  Jason Weston,et al.  End-To-End Memory Networks , 2015, NIPS.

[41]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[42]  Zhengyang Wang,et al.  Smoothed dilated convolutions for improved dense prediction , 2018, Data Mining and Knowledge Discovery.

[43]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[44]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[45]  Geoffrey E. Hinton,et al.  Regularizing Neural Networks by Penalizing Confident Output Distributions , 2017, ICLR.

[46]  Rico Sennrich,et al.  Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures , 2018, EMNLP.

[47]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[48]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[49]  Di He,et al.  Hint-based Training for Non-Autoregressive Translation , 2018 .

[50]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[51]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[52]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[53]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[54]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[55]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[56]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[57]  Alexander M. Rush,et al.  Latent Alignment and Variational Attention , 2018, NeurIPS.