Understanding Multi-Head Attention in Abstractive Summarization

Attention mechanisms in deep learning architectures have often been used as a means of transparency and, as such, to shed light on the inner workings of the architectures. Recently, there has been a growing interest in whether or not this assumption is correct. In this paper we investigate the interpretability of multi-head attention in abstractive summarization, a sequence-to-sequence task for which attention does not have an intuitive alignment role, such as in machine translation. We first introduce three metrics to gain insight in the focus of attention heads and observe that these heads specialize towards relative positions, specific part-of-speech tags, and named entities. However, we also find that ablating and pruning these heads does not lead to a significant drop in performance, indicating redundancy. By replacing the softmax activation functions with sparsemax activation functions, we find that attention heads behave seemingly more transparent: we can ablate fewer heads and heads score higher on our interpretability metrics. However, if we apply pruning to the sparsemax model we find that we can prune even more heads, raising the question whether enforced sparsity actually improves transparency. Finally, we find that relative positions heads seem integral to summarization performance and persistently remain after pruning.

[1]  Bowen Zhou,et al.  Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond , 2016, CoNLL.

[2]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[3]  Jimeng Sun,et al.  RETAIN: An Interpretable Predictive Model for Healthcare using Reverse Time Attention Mechanism , 2016, NIPS.

[4]  Xiaoli Z. Fern,et al.  Interpreting Recurrent and Attention-Based Neural Models: a Case Study on Natural Language Inference , 2018, EMNLP.

[5]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[6]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[7]  Manaal Faruqui,et al.  Attention Interpretability Across NLP Tasks , 2019, ArXiv.

[8]  Been Kim,et al.  Towards A Rigorous Science of Interpretable Machine Learning , 2017, 1702.08608.

[9]  Paul Voigt,et al.  The EU General Data Protection Regulation (GDPR) , 2017 .

[10]  Paul Voigt,et al.  The Eu General Data Protection Regulation (Gdpr): A Practical Guide , 2017 .

[11]  Lalana Kagal,et al.  Explaining Explanations: An Overview of Interpretability of Machine Learning , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[12]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[13]  Mirella Lapata,et al.  Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization , 2018, EMNLP.

[14]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[15]  Claire Cardie,et al.  SparseMAP: Differentiable Sparse Structured Inference , 2018, ICML.

[16]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[17]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[18]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[19]  Jörg Tiedemann,et al.  An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[20]  Tao Lei Interpretable neural models for natural language processing , 2017 .

[21]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[22]  Lalana Kagal,et al.  J un 2 01 8 Explaining Explanations : An Approach to Evaluating Interpretability of Machine Learning , 2018 .

[23]  Roland Vollgraf,et al.  Contextual String Embeddings for Sequence Labeling , 2018, COLING.

[24]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[25]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[26]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[27]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[28]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[29]  Alexander M. Rush,et al.  Latent Alignment and Variational Attention , 2018, NeurIPS.

[30]  Alexander M. Rush,et al.  Bottom-Up Abstractive Summarization , 2018, EMNLP.

[31]  Tim Miller,et al.  Explanation in Artificial Intelligence: Insights from the Social Sciences , 2017, Artif. Intell..

[32]  André F. T. Martins,et al.  Sparse and Constrained Attention for Neural Machine Translation , 2018, ACL.

[33]  Noah A. Smith,et al.  Is Attention Interpretable? , 2019, ACL.

[34]  Max Welling,et al.  Learning Sparse Neural Networks through L0 Regularization , 2017, ICLR.

[35]  André F. T. Martins,et al.  Adaptively Sparse Transformers , 2019, EMNLP.

[36]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Yonatan Belinkov,et al.  Analyzing the Structure of Attention in a Transformer Language Model , 2019, BlackboxNLP@ACL.

[39]  M. de Rijke,et al.  Do Transformer Attention Heads Provide Transparency in Abstractive Summarization? , 2019, ArXiv.