How does BERT capture semantics? A closer look at polysemous words

The recent paradigm shift to contextual word embeddings has seen tremendous success across a wide range of down-stream tasks. However, little is known on how the emergent relation of context and semantics manifests geometrically. We investigate polysemous words as one particularly prominent instance of semantic organization. Our rigorous quantitative analysis of linear separability and cluster organization in embedding vectors produced by BERT shows that semantics do not surface as isolated clusters but form seamless structures, tightly coupled with sentiment and syntax.

[1]  Nando de Freitas,et al.  Bayesian Optimization in High Dimensions via Random Embeddings , 2013, IJCAI.

[2]  Florian Schmidt,et al.  BERT as a Teacher: Contextual Embeddings for Sequence-Level Reward , 2020, ArXiv.

[3]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[4]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[5]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[6]  Mikael Kågebäck,et al.  Word Sense Disambiguation using a Bidirectional LSTM , 2016, CogALex@COLING.

[7]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[8]  Chandler May,et al.  On Measuring Social Biases in Sentence Encoders , 2019, NAACL.

[9]  Christian Biemann,et al.  Making Sense of Word Embeddings , 2016, Rep4NLP@ACL.

[10]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[11]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.

[12]  Ricardo Ribeiro,et al.  L2F/INESC-ID at SemEval-2019 Task 2: Unsupervised Lexical Semantic Frame Induction using Contextualized Word Representations , 2019, SemEval@NAACL-HLT.

[13]  M. Cugmas,et al.  On comparing partitions , 2015 .

[14]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[15]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[16]  José Camacho-Collados,et al.  WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations , 2018, NAACL.

[17]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[18]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[19]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[20]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[21]  Shai Ben-David,et al.  Clusterability: A Theoretical Study , 2009, AISTATS.

[22]  Christian Biemann,et al.  Retrofitting Word Representations for Unsupervised Sense Aware Word Similarities , 2018, LREC.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  P. Hagoort Interplay between Syntax and Semantics during Sentence Comprehension: ERP Effects of Combining Syntactic and Semantic Violations , 2003, Journal of Cognitive Neuroscience.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[27]  Yoav Shoham,et al.  SenseBERT: Driving Some Sense into BERT , 2019, ACL.

[28]  Jimmy J. Lin,et al.  Simple BERT Models for Relation Extraction and Semantic Role Labeling , 2019, ArXiv.

[29]  Bin Wang,et al.  Evaluating word embedding models: methods and experimental results , 2019, APSIPA Transactions on Signal and Information Processing.

[30]  Gregor Wiedemann,et al.  Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings , 2019, KONVENS.

[31]  Elia Bruni,et al.  Multimodal Distributional Semantics , 2014, J. Artif. Intell. Res..

[32]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Christian Biemann,et al.  Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems , 2006 .

[35]  Katrin Erk,et al.  Word Sense Clustering and Clusterability , 2016, CL.

[36]  G. Miller,et al.  Contextual correlates of semantic similarity , 1991 .

[37]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..