论文信息 - Human Interpretation and Exploitation of Self-attention Patterns in Transformers: A Case Study in Extractive Summarization

Human Interpretation and Exploitation of Self-attention Patterns in Transformers: A Case Study in Extractive Summarization

The transformer multi-head self-attention mechanism has been thoroughly investigated recently. On one hand, researchers are interested in understanding why and how transformers work. On the other hand, they propose new attention augmentation methods to make transformers more accurate, efficient and interpretable. In this paper, we synergize these two lines of research in a human-in-the-loop pipeline to first find important task-specific attention patterns. Then those patterns are applied, not only to the original model, but also to smaller models, as a human-guided knowledge distillation process. The benefits of our pipeline are demonstrated in a case study with the extractive summarization task. After finding three meaningful attention patterns in the popular BERTSum model, experiments indicate that when we inject such patterns, both the original and the smaller model show improvements in performance and arguably interpretability.

[1] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2] Yu Cheng,et al. Discourse-Aware Neural Extractive Text Summarization , 2020, ACL.

[3] Todor Mihaylov,et al. Discourse-Aware Semantic Self-Attention for Narrative Reading Comprehension , 2019, EMNLP.

[4] Qun Liu,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[5] J. Tiedemann,et al. Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, FINDINGS.

[6] Jesse Vig,et al. A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[7] Christopher D. Manning,et al. Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[8] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9] Yiyun Zhao,et al. How does BERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scope , 2020, ACL.

[10] Ming Zhou,et al. HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[11] Byung Cheol Song,et al. Graph-based Knowledge Distillation by Multi-head Attention Network , 2019, BMVC.

[12] Giuseppe Carenini,et al. Predicting Discourse Trees from Transformer-based Neural Summarizers , 2021, NAACL.

[13] Pengfei Liu,et al. Extractive Summarization as Text Matching , 2020, ACL.

[14] Furu Wei,et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[15] Yi Tay,et al. Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[16] Jimmy J. Lin,et al. DocBERT: BERT for Document Classification , 2019, ArXiv.

[17] Pavlo Molchanov,et al. Importance Estimation for Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[19] Tong Zhang,et al. Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[20] Chin-Yew Lin,et al. ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[21] Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[22] Bowen Zhou,et al. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[23] Iain Murray,et al. BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning , 2019, ICML.

[24] Li Yang,et al. ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[25] Fedor Moiseev,et al. Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[26] Mirella Lapata,et al. Text Summarization with Pretrained Encoders , 2019, EMNLP.

[27] Omer Levy,et al. What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[28] Danqi Chen,et al. of the Association for Computational Linguistics: , 2001 .

[29] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.