论文信息 - How Self-Attention Improves Rare Class Performance in a Question-Answering Dialogue Agent

How Self-Attention Improves Rare Class Performance in a Question-Answering Dialogue Agent

Contextualized language modeling using deep Transformer networks has been applied to a variety of natural language processing tasks with remarkable success. However, we find that these models are not a panacea for a questionanswering dialogue agent corpus task, which has hundreds of classes in a long-tailed frequency distribution, with only thousands of data points. Instead, we find substantial improvements in recall and accuracy on rare classes from a simple one-layer RNN with multi-headed self-attention and static word embeddings as inputs. While much research has used attention weights to illustrate what input is important for a task, the complexities of our dialogue corpus offer a unique opportunity to examine how the model represents what it attends to, and we offer a detailed analysis of how that contributes to improved performance on rare classes. A particularly interesting phenomenon we observe is that the model picks up implicit meanings by splitting different aspects of the semantics of a single word across multiple attention heads.

[1] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2] Evan Jaffe,et al. Combining CNNs and Pattern Matching for Question Interpretation in a Virtual Patient Dialogue System , 2017, BEA@EMNLP.

[3] Omer Levy,et al. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[4] David King,et al. Using Paraphrasing and Memory-Augmented Models to Combat Data Sparsity in Question Interpretation with a Virtual Patient Dialogue System , 2018, BEA@NAACL-HLT.

[5] Evan Jaffe,et al. Interpreting Questions with a Log-Linear Ranking Model in a Virtual Patient Dialogue System , 2015, BEA@NAACL-HLT.

[6] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[7] Bowen Zhou,et al. A Structured Self-attentive Sentence Embedding , 2017, ICLR.

[8] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[9] Abhijit Mahabal,et al. Text Classification with Few Examples using Controlled Generalization , 2019, NAACL-HLT.

[10] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[14] Jason Weston,et al. Memory Networks , 2014, ICLR.

[15] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[16] Yoon Kim,et al. Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[17] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.