Multi-Head Self-Attention with Role-Guided Masks

The state of the art in learning meaningful semantic representations of words is the Transformer model and its attention mechanisms. Simply put, the attention mechanisms learn to attend to specific parts of the input dispensing recurrence and convolutions. While some of the learned attention heads have been found to play linguistically interpretable roles, they can be redundant or prone to errors. We propose a method to guide the attention heads towards roles identified in prior work as important. We do this by defining role-specific masks to constrain the heads to attend to specific parts of the input, such that different heads are designed to play different roles. Experiments on text classification and machine translation using 7 different datasets show that our method outperforms competitive attention-based, CNN, and RNN baselines.

[1]  Tao Shen,et al.  DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding , 2017, AAAI.

[2]  Ming Zhou,et al.  A Tensorized Transformer for Language Modeling , 2019, NeurIPS.

[3]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[4]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[6]  Wei Wu,et al.  Phrase-level Self-Attention Networks for Universal Sentence Encoding , 2018, EMNLP.

[7]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[8]  Christina Lioma,et al.  Part of speech n-grams and Information Retrieval , 2008 .

[9]  Hai Zhao,et al.  Semantics-aware BERT for Language Understanding , 2020, AAAI.

[10]  Fedor Moiseev,et al.  Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned , 2019, ACL.

[11]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[12]  Christina Lioma,et al.  Part of Speech Based Term Weighting for Information Retrieval , 2009, ECIR.

[13]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[14]  Jakob Grue Simonsen,et al.  The Copenhagen Team Participation in the Factuality Task of the Competition of Automatic Identification and Verification of Claims in Political Debates of the CLEF-2018 Fact Checking Lab , 2018, CLEF.

[15]  Sebastian Gehrmann,et al.  exBERT: A Visual Analysis Tool to Explore Learned Representations in Transformers Models , 2019, ArXiv.

[16]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[17]  Jakob Grue Simonsen,et al.  Encoding word order in complex embeddings , 2019, ICLR.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Rico Sennrich,et al.  Linguistic Input Features Improve Neural Machine Translation , 2016, WMT.

[20]  Yang Liu,et al.  Visualizing and Understanding Neural Machine Translation , 2017, ACL.

[21]  Kyle Lo,et al.  SciBERT: Pretrained Contextualized Embeddings for Scientific Text , 2019, ArXiv.

[22]  Yang Yang,et al.  Bidirectional Long-Short Term Memory for Video Description , 2016, ACM Multimedia.

[23]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.