Rethinking Self-Attention: An Interpretable Self-Attentive Encoder-Decoder Parser

Attention mechanisms have improved the performance of NLP tasks while allowing models to remain explainable. Self-attention is currently widely used, however interpretability is difficult due to the numerous attention distributions. Recent work has shown that model representations can benefit from label-specific information, while facilitating interpretation of predictions. We introduce the Label Attention Layer: a new form of self-attention where attention heads represent labels. We test our novel layer by running constituency and dependency parsing experiments and show our new model obtains new state-of-the-art results for both tasks on both the Penn Treebank (PTB) and Chinese Treebank. Additionally, our model requires fewer layers, therefore, fewer parameters compared to existing work. Finally, we find that the Label Attention heads learn relations between syntactic categories and show pathways to analyze errors.

[1]  Dan Klein,et al.  What’s Going On in Neural Constituency Parsers? An Analysis , 2018, NAACL.

[2]  Dan Klein,et al.  A Minimal Span-Based Neural Constituency Parser , 2017, ACL.

[3]  Yue Zhang,et al.  Shift-Reduce Constituent Parsing with Neural Lookahead Features , 2016, TACL.

[4]  Dan Klein,et al.  Constituency Parsing with a Self-Attentive Encoder , 2018, ACL.

[5]  Zhen-Hua Ling,et al.  Enhancing and Combining Sequential and Tree LSTM for Natural Language Inference , 2016, ArXiv.

[6]  Dan Klein,et al.  Improving Neural Parsing by Disentangling Model Combination and Reranking Effects , 2017, ACL.

[7]  Baobao Chang,et al.  Improved Dependency Parsing using Implicit Word Connections Learned from Unlabeled Data , 2018, EMNLP.

[8]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[9]  Noah A. Smith,et al.  What Do Recurrent Neural Network Grammars Learn About Syntax? , 2016, EACL.

[10]  Eugene Charniak,et al.  Parsing as Language Modeling , 2016, EMNLP.

[11]  Yue Zhang,et al.  Hierarchically-Refined Label Attention Network for Sequence Labeling , 2019, EMNLP.

[12]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[13]  Tadao Kasami,et al.  An Efficient Recognition and Syntax-Analysis Algorithm for Context-Free Languages , 1965 .

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Dan Klein,et al.  Multilingual Constituency Parsing with Self-Attention and Pre-Training , 2018, ACL.

[16]  Yang Liu,et al.  Fine-tune BERT for Extractive Summarization , 2019, ArXiv.

[17]  Jingzhou Liu,et al.  Stack-Pointer Networks for Dependency Parsing , 2018, ACL.

[18]  Matiss Rikters,et al.  Impact of Corpora Quality on Neural Machine Translation , 2018, Baltic HLT.

[19]  Carlos Gómez-Rodríguez,et al.  Left-to-Right Dependency Parsing with Pointer Networks , 2019, NAACL.

[20]  John Cocke,et al.  Programming languages and their compilers: Preliminary notes , 1969 .

[21]  Yoshua Bengio,et al.  Straight to the Tree: Constituency Parsing with Neural Syntactic Distance , 2018, ACL.

[22]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[23]  Yuanbin Wu,et al.  Graph-based Dependency Parsing with Graph Neural Networks , 2019, ACL.

[24]  Hai Zhao,et al.  Seq2seq Dependency Parsing , 2018, COLING.

[25]  Eduard H. Hovy,et al.  Neural Probabilistic Model for Non-projective MST Parsing , 2017, IJCNLP.

[26]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[27]  Diyi Yang,et al.  Hierarchical Attention Networks for Document Classification , 2016, NAACL.

[28]  Lin Xiao,et al.  Label-Specific Document Representation for Multi-Label Text Classification , 2019, EMNLP.

[29]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[30]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[31]  Noah A. Smith,et al.  Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser , 2016, EMNLP.

[32]  Noah A. Smith,et al.  Recurrent Neural Network Grammars , 2016, NAACL.

[33]  Daniel H. Younger,et al.  Recognition and Parsing of Context-Free Languages in Time n^3 , 1967, Inf. Control..

[34]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[35]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[36]  Yue Zhang,et al.  Two Local Models for Neural Constituent Parsing , 2018, COLING.

[37]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[38]  Dan Klein,et al.  Policy Gradient as a Proxy for Dynamic Oracles in Constituency Parsing , 2018, ACL.

[39]  Quoc V. Le,et al.  Semi-Supervised Sequence Modeling with Cross-View Training , 2018, EMNLP.

[40]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[41]  Yue Zhang,et al.  In-Order Transition-based Constituent Parsing , 2017, TACL.

[42]  Junru Zhou,et al.  Head-Driven Phrase Structure Grammar Parsing on Penn Treebank , 2019, ACL.

[43]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[44]  Stephen Clark,et al.  A Tale of Two Parsers: Investigating and Combining Graph-based and Transition-based Dependency Parsing , 2008, EMNLP.

[45]  John D. Kelleher,et al.  Attentive Language Models , 2017, IJCNLP.

[46]  Andrew McCallum,et al.  Linguistically-Informed Self-Attention for Semantic Role Labeling , 2018, EMNLP.

[47]  Marcis Pinnis,et al.  Training and Adapting Multilingual NMT for Less-resourced and Morphologically Rich Languages , 2018, LREC.

[48]  Masaaki Nagata,et al.  An Empirical Study of Building a Strong Baseline for Constituency Parsing , 2018, ACL.

[49]  Yuval Pinter,et al.  Attention is not not Explanation , 2019, EMNLP.

[50]  Masaaki Nagata,et al.  Direct Output Connection for a High-Rank Language Model , 2018, EMNLP.