Human Interpretation and Exploitation of Self-attention Patterns in Transformers: A Case Study in Extractive Summarization

The transformer multi-head self-attention mechanism has been thoroughly investigated recently. On one hand, researchers are interested in understanding why and how transformers work. On the other hand, they propose new attention augmentation methods to make transformers more accurate, efficient and interpretable. In this paper, we synergize these two lines of research in a human-in-the-loop pipeline to first find important task-specific attention patterns. Then those patterns are applied, not only to the original model, but also to smaller models, as a human-guided knowledge distillation process. The benefits of our pipeline are demonstrated in a case study with the extractive summarization task. After finding three meaningful attention patterns in the popular BERTSum model, experiments indicate that when we inject such patterns, both the original and the smaller model show improvements in performance and arguably interpretability.

[1]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[2]  Yu Cheng,et al.  Discourse-Aware Neural Extractive Text Summarization , 2020, ACL.

[3]  Todor Mihaylov,et al.  Discourse-Aware Semantic Self-Attention for Narrative Reading Comprehension , 2019, EMNLP.

[4]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[5]  J. Tiedemann,et al.  Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, FINDINGS.

[6]  Jesse Vig,et al.  A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[7]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Yiyun Zhao,et al.  How does BERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scope , 2020, ACL.

[10]  Ming Zhou,et al.  HIBERT: Document Level Pre-training of Hierarchical Bidirectional Transformers for Document Summarization , 2019, ACL.

[11]  Byung Cheol Song,et al.  Graph-based Knowledge Distillation by Multi-head Attention Network , 2019, BMVC.

[12]  Giuseppe Carenini,et al.  Predicting Discourse Trees from Transformer-based Neural Summarizers , 2021, NAACL.

[13]  Pengfei Liu,et al.  Extractive Summarization as Text Matching , 2020, ACL.

[14]  Furu Wei,et al.  MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[15]  Yi Tay,et al.  Synthesizer: Rethinking Self-Attention for Transformer Models , 2020, ICML.

[16]  Jimmy J. Lin,et al.  DocBERT: BERT for Document Classification , 2019, ArXiv.

[17]  Pavlo Molchanov,et al.  Importance Estimation for Neural Network Pruning , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[19]  Tong Zhang,et al.  Modeling Localness for Self-Attention Networks , 2018, EMNLP.

[20]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[21]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[22]  Bowen Zhou,et al.  SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents , 2016, AAAI.

[23]  Iain Murray,et al.  BERT and PALs: Projected Attention Layers for Efficient Adaptation in Multi-Task Learning , 2019, ICML.

[24]  Li Yang,et al.  ETC: Encoding Long and Structured Inputs in Transformers , 2020, EMNLP.

[25]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[26]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[27]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[28]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[29]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[30]  Jiacheng Xu,et al.  Neural Extractive Text Summarization with Syntactic Compression , 2019, EMNLP.

[31]  Jianping Gou,et al.  Knowledge Distillation: A Survey , 2020, International Journal of Computer Vision.

[32]  Jörg Tiedemann,et al.  An Analysis of Encoder Representations in Transformer-Based Machine Translation , 2018, BlackboxNLP@EMNLP.

[33]  Alexander Binder,et al.  On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation , 2015, PloS one.

[34]  Zhe Zhao,et al.  K-BERT: Enabling Language Representation with Knowledge Graph , 2019, AAAI.

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Giuseppe Carenini,et al.  T3-Vis: visual analytic for Training and fine-Tuning Transformers in NLP , 2021, EMNLP.

[37]  Yonatan Belinkov,et al.  Analyzing the Structure of Attention in a Transformer Language Model , 2019, BlackboxNLP@ACL.

[38]  Sebastian Gehrmann,et al.  exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models , 2019, ArXiv.

[39]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[40]  André F. T. Martins,et al.  Do Context-Aware Translation Models Pay the Right Attention? , 2021, ACL.

[41]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[42]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[43]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[44]  悠太 菊池,et al.  大規模要約資源としてのNew York Times Annotated Corpus , 2015 .

[45]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[46]  Yunhai Tong,et al.  Syntax-BERT: Improving Pre-trained Transformers with Syntax Trees , 2021, EACL.

[47]  Xuanjing Huang,et al.  Mask Attention Networks: Rethinking and Strengthen Transformer , 2021, NAACL.

[48]  Minyi Guo,et al.  How Far Does BERT Look At: Distance-based Clustering and Analysis of BERT's Attention , 2020, COLING.