How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations

Bidirectional Encoder Representations from Transformers (BERT) reach state-of-the-art results in a variety of Natural Language Processing tasks. However, understanding of their internal functioning is still insufficient and unsatisfactory. In order to better understand BERT and other Transformer-based models, we present a layer-wise analysis of BERT's hidden states. Unlike previous research, which mainly focuses on explaining Transformer models by their attention weights, we argue that hidden states contain equally valuable information. Specifically, our analysis focuses on models fine-tuned on the task of Question Answering (QA) as an example of a complex downstream task. We inspect how QA models transform token vectors in order to find the correct answer. To this end, we apply a set of general and QA-specific probing tasks that reveal the information stored in each representation layer. Our qualitative analysis of hidden state visualizations provides additional insights into BERT's reasoning process. Our results show that the transformations within BERT go through phases that are related to traditional pipeline tasks. The system can therefore implicitly incorporate task-specific information into its token representations. Furthermore, our analysis reveals that fine-tuning has little impact on the models' semantic abilities and that prediction errors can be recognized in the vector representations of even early layers.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  Richard Socher,et al.  The Natural Language Decathlon: Multitask Learning as Question Answering , 2018, ArXiv.

[3]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[4]  Mitchell P. Marcus,et al.  OntoNotes : A Large Training Corpus for Enhanced Processing , 2017 .

[5]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[6]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[7]  A ZadehLotfi Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic , 1997 .

[8]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[9]  Douwe Kiela,et al.  SentEval: An Evaluation Toolkit for Universal Sentence Representations , 2018, LREC.

[10]  Ellen M. Voorhees,et al.  Overview of TREC 2001 , 2001, TREC.

[11]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[12]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[13]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[14]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.

[15]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[16]  Tasha Nagamine,et al.  Exploring how deep neural networks form phonemic categories , 2015, INTERSPEECH.

[17]  Filip Karlo Dosilovic,et al.  Explainable artificial intelligence: A survey , 2018, 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO).

[18]  Daniel Jurafsky,et al.  Understanding Neural Networks through Representation Erasure , 2016, ArXiv.

[19]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[20]  Xing Shi,et al.  Does String-Based Neural MT Learn Source Syntax? , 2016, EMNLP.

[21]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[22]  Quoc V. Le,et al.  Semi-supervised Sequence Learning , 2015, NIPS.

[23]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[24]  Franco Turini,et al.  A Survey of Methods for Explaining Black Box Models , 2018, ACM Comput. Surv..

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Laurens van der Maaten,et al.  Learning a Parametric Embedding by Preserving Local Structure , 2009, AISTATS.

[27]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[28]  Lotfi A. Zadeh,et al.  Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic , 1997, Fuzzy Sets Syst..

[29]  Sebastian Ruder,et al.  Fine-tuned Language Models for Text Classification , 2018, ArXiv.

[30]  Dan Roth,et al.  Learning Question Classifiers , 2002, COLING.

[31]  Zhiyuan Liu,et al.  Understanding the Behaviors of BERT in Ranking , 2019, ArXiv.

[32]  Willem H. Zuidema,et al.  Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , 2017, J. Artif. Intell. Res..

[33]  Percy Liang,et al.  Know What You Don’t Know: Unanswerable Questions for SQuAD , 2018, ACL.

[34]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[35]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[36]  Quanshi Zhang,et al.  Visual interpretability for deep learning: a survey , 2018, Frontiers of Information Technology & Electronic Engineering.

[37]  Byron C. Wallace,et al.  Attention is not Explanation , 2019, NAACL.

[38]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[39]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[40]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[41]  Zachary Chase Lipton The mythos of model interpretability , 2016, ACM Queue.