Visualizing and Measuring the Geometry of BERT

Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.

[1]  I. J. Schoenberg On Certain Metric Spaces Arising From Euclidean Spaces by a Change of Metric and Their Imbedding in Hilbert Space , 1937 .

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[4]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[5]  Christopher D. Manning,et al.  Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Hiroshi Maehara,et al.  Euclidean embeddings of finite metric spaces , 2013, Discret. Math..

[9]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[11]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[12]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[13]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[14]  Roberto Navigli,et al.  Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison , 2017, EACL.

[15]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[16]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[17]  Martin Wattenberg,et al.  Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[18]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[19]  Omer Levy,et al.  Deep RNNs Encode Soft Hierarchical Syntax , 2018, ACL.

[20]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[21]  Jesse Vig Visualizing Attention in Transformer-Based Language models , 2019 .

[22]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[25]  Yonatan Belinkov,et al.  Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.