IsoScore: Measuring the Uniformity of Embedding Space Utilization

The recent success of distributed word representations has led to an increased interest in analyzing the properties of their spatial distribution. Several studies have suggested that contextualized word embedding models do not isotropically project tokens into vector space. However, current methods designed to measure isotropy, such as average random cosine similarity and the partition score, have not been thoroughly analyzed and are not appropriate for measuring isotropy. We propose IsoScore: a novel tool that quantifies the degree to which a point cloud uniformly utilizes the ambient vector space. Using rigorously designed tests, we demonstrate that IsoScore is the only tool available in the literature that accurately measures how uniformly distributed variance is across dimensions in vector space. Additionally, we use IsoScore to challenge a number of recent conclusions in the NLP literature that have been derived using brittle metrics of isotropy. We caution future studies from using existing tools to measure isotropy in contextualized embedding space as resulting conclusions will be misleading or altogether inaccurate.

[1]  Maksim Podkorytov,et al.  Too Much in Common: Shifting of Embeddings in Transformer Language Models and its Implications , 2021, NAACL.

[2]  Jie Zheng,et al.  Learning to Remove: Towards Isotropic Pre-trained BERT Embedding , 2021, ICANN.

[3]  Bill Yuchen Lin,et al.  IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization , 2020, AAAI.

[4]  Kenneth Ward Church,et al.  Isotropy in the Contextual Embedding Space: Clusters and Manifolds , 2021, ICLR.

[5]  Kees van Deemter,et al.  What do you mean, BERT? Assessing BERT as a Distributional Semantics Model , 2019, ArXiv.

[6]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[7]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[8]  Shaowu Zhang,et al.  Refining Word Representations by Manifold Learning , 2019, IJCAI.

[9]  Jordan Rodu,et al.  Getting in Shape: Word Embedding SubSpaces , 2019, IJCAI.

[10]  Di He,et al.  Representation Degeneration Problem in Training Natural Language Generation Models , 2019, ICLR.

[11]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[12]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[13]  Dilin Wang,et al.  Improving Neural Language Modeling via Adversarial Training , 2019, ICML.

[14]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[15]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[16]  Di He,et al.  FRAGE: Frequency-Agnostic Word Representation , 2018, NeurIPS.

[17]  Pramod Viswanath,et al.  All-but-the-Top: Simple and Effective Postprocessing for Word Representations , 2017, ICLR.

[18]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[19]  Edward Curry,et al.  Word Re-Embedding via Manifold Dimensionality Retention , 2017, EMNLP.

[20]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[21]  P. Campadelli,et al.  Intrinsic Dimension Estimation: Relevant Techniques and a Benchmark Framework , 2015 .

[22]  Sanjeev Arora,et al.  Random Walks on Context Spaces: Towards an Explanation of the Mysteries of Semantic Word Embeddings , 2015, ArXiv.

[23]  Peter J. Bickel,et al.  Maximum Likelihood Estimation of Intrinsic Dimension , 2004, NIPS.