论文信息 - Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

Analyzing and Controlling Inter-Head Diversity in Multi-Head Attention

Multi-head attention, a powerful strategy for Transformer, is assumed to utilize information from diverse representation subspaces. However, measuring diversity between heads’ representations or exploiting the diversity has been rarely studied. In this paper, we quantitatively analyze inter-head diversity of multi-head attention by applying recently developed similarity measures between two deep representations: Singular Vector Canonical Correlation Analysis (SVCCA) and Centered Kernel Alignment (CKA). By doing so, we empirically show that multi-head attention does diversify representation subspaces of each head as the number of heads increases. Based on our analysis, we hypothesize that there exists an optimal inter-head diversity with which a model can achieve better performance. To examine our hypothesis, we deeply inspect three techniques to control the inter-head diversity; (1) Hilbert-Schmidt Independence Criterion regularizer among representation subspaces, (2) Orthogonality regularizer, and (3) Drophead as zero-outing each head randomly in every training step. In our experiments on various machine translation and language modeling tasks, we show that controlling inter-head diversity leads to the best performance among baselines.

[1] Bernhard Schölkopf,et al. Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[2] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[3] F. Xavier Roca,et al. Regularizing CNNs with Locally Constrained Decorrelations , 2016, ICLR.

[4] Veselin Stoyanov,et al. Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[5] N. Cristianini,et al. On Kernel-Target Alignment , 2001, NIPS.

[6] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[7] Furu Wei,et al. Scheduled DropHead: A Regularization Method for Transformer Models , 2020, FINDINGS.

[8] Surya Ganguli,et al. Universality and individuality in neural dynamics across large populations of recurrent networks , 2019, NeurIPS.

[9] Mehryar Mohri,et al. Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[10] Yonatan Belinkov,et al. Identifying and Controlling Important Neurons in Neural Machine Translation , 2018, ICLR.

[11] Joakim Nivre,et al. An Analysis of Attention Mechanisms: The Case of Word Sense Disambiguation in Neural Machine Translation , 2018, WMT.

[12] Ankur Bapna,et al. Investigating Multilingual NMT Representations at Scale , 2019, EMNLP.