A Mixture of h - 1 Heads is Better than h Heads

Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead "reallocate" them -- the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over "transformer-base" by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.

[1]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[2]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[3]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[4]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[5]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[6]  Alexander M. Rush,et al.  Latent Alignment and Variational Attention , 2018, NeurIPS.

[7]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[8]  André F. T. Martins,et al.  Sparse and Constrained Attention for Neural Machine Translation , 2018, ACL.

[9]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[12]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[13]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[14]  Daniel Soudry,et al.  No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[15]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[16]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[17]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[18]  Hannaneh Hajishirzi,et al.  Mixture Content Selection for Diverse Sequence Generation , 2019, EMNLP.

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[21]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[22]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[23]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[24]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[25]  Gholamreza Haffari,et al.  Sequence to Sequence Mixture Model for Diverse Machine Translation , 2018, CoNLL.

[26]  Marc'Aurelio Ranzato,et al.  Mixture Models for Diverse Machine Translation: Tricks of the Trade , 2019, ICML.

[27]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[28]  Gholamreza Haffari,et al.  Selective Attention for Context-aware Neural Machine Translation , 2019, NAACL.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Geoffrey E. Hinton,et al.  Adaptive Mixtures of Local Experts , 1991, Neural Computation.

[31]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[32]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[33]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[34]  Cho-Jui Hsieh,et al.  Towards Robust Neural Networks via Random Self-ensemble , 2017, ECCV.

[35]  Xing Wang,et al.  Modeling Recurrence for Transformer , 2019, NAACL.

[36]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[37]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[38]  Rich Caruana,et al.  Ensemble selection from libraries of models , 2004, ICML.

[39]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[40]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[41]  Zhaopeng Tu,et al.  Convolutional Self-Attention Networks , 2019, NAACL.

[42]  Georges Quénot,et al.  Coupled Ensembles of Neural Networks , 2017, 2018 International Conference on Content-Based Multimedia Indexing (CBMI).

[43]  Takehito Utsuro,et al.  Attention over Heads: A Multi-Hop Attention for Neural Machine Translation , 2019, ACL.

[44]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[45]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.