How Self-Attention Improves Rare Class Performance in a Question-Answering Dialogue Agent

Contextualized language modeling using deep Transformer networks has been applied to a variety of natural language processing tasks with remarkable success. However, we find that these models are not a panacea for a questionanswering dialogue agent corpus task, which has hundreds of classes in a long-tailed frequency distribution, with only thousands of data points. Instead, we find substantial improvements in recall and accuracy on rare classes from a simple one-layer RNN with multi-headed self-attention and static word embeddings as inputs. While much research has used attention weights to illustrate what input is important for a task, the complexities of our dialogue corpus offer a unique opportunity to examine how the model represents what it attends to, and we offer a detailed analysis of how that contributes to improved performance on rare classes. A particularly interesting phenomenon we observe is that the model picks up implicit meanings by splitting different aspects of the semantics of a single word across multiple attention heads.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Evan Jaffe,et al.  Combining CNNs and Pattern Matching for Question Interpretation in a Virtual Patient Dialogue System , 2017, BEA@EMNLP.

[3]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[4]  David King,et al.  Using Paraphrasing and Memory-Augmented Models to Combat Data Sparsity in Question Interpretation with a Virtual Patient Dialogue System , 2018, BEA@NAACL-HLT.

[5]  Evan Jaffe,et al.  Interpreting Questions with a Log-Linear Ranking Model in a Virtual Patient Dialogue System , 2015, BEA@NAACL-HLT.

[6]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[7]  Bowen Zhou,et al.  A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[8]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9]  Abhijit Mahabal,et al.  Text Classification with Few Examples using Controlled Generalization , 2019, NAACL-HLT.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Jason Weston,et al.  Memory Networks , 2014, ICLR.

[15]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[16]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[17]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.