论文信息 - Visualizing and Measuring the Geometry of BERT - 字舞流文

Visualizing and Measuring the Geometry of BERT

Transformer architectures show significant promise for natural language processing. Given that a single pretrained model can be fine-tuned to perform well on many different tasks, these networks appear to extract generally useful linguistic features. A natural question is how such networks represent this information internally. This paper describes qualitative and quantitative investigations of one particularly effective model, BERT. At a high level, linguistic features seem to be represented in separate semantic and syntactic subspaces. We find evidence of a fine-grained geometric representation of word senses. We also present empirical descriptions of syntactic representations in both attention matrices and individual word embeddings, as well as a mathematical argument to explain the geometry of these representations.

Martin Wattenberg | Been Kim | Ann Yuan | Fernanda B. Viégas | Adam Pearce | Emily Reif | Andy Coenen | F. Viégas | M. Wattenberg | Been Kim | Ann Yuan | Andy Coenen | Emily Reif | Adam Pearce

[1] I. J. Schoenberg. On Certain Metric Spaces Arising From Euclidean Spaces by a Change of Metric and Their Imbedding in Hilbert Space , 1937 .

[2] Beatrice Santorini,et al. Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3] George A. Miller,et al. A Semantic Concordance , 1993, HLT.

[4] Yoshua Bengio,et al. Convolutional networks for images, speech, and time series , 1998 .

[5] Christopher D. Manning,et al. Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[6] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8] Hiroshi Maehara,et al. Euclidean embeddings of finite metric spaces , 2013, Discret. Math..

[9] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[10] Rob Fergus,et al. Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[11] Emmanuel Dupoux,et al. Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[12] Douwe Kiela,et al. Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[13] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[14] Roberto Navigli,et al. Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison , 2017, EACL.

[15] Leland McInnes,et al. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[16] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[17] Martin Wattenberg,et al. Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) , 2017, ICML.

[18] Guillaume Lample,et al. What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[19] Omer Levy,et al. Deep RNNs Encode Soft Hierarchical Syntax , 2018, ACL.

[20] Christopher D. Manning,et al. A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[21] Jesse Vig. Visualizing Attention in Transformer-Based Language models , 2019 .

[22] Dipanjan Das,et al. BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[23] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24] Alex Wang,et al. What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[25] Yonatan Belinkov,et al. Linguistic Knowledge and Transferability of Contextual Representations , 2019, NAACL.