Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions

Neural language models (LMs) such as GPT-2 estimate the probability distribution over the next word by a softmax over the vocabulary. The softmax layer produces the distribution based on the dot products of a single hidden state and the embeddings of words in the vocabulary. However, we discover that this single hidden state cannot produce all probability distributions regardless of the LM size or training data size because the single hidden state embedding cannot be close to the embeddings of all the possible next words simultaneously when there are other interfering word embeddings between them. In this work, we demonstrate the importance of this limitation both theoretically and practically. Our work not only deepens our understanding of softmax bottleneck and mixture of softmax (MoS) but also inspires us to propose multi-facet softmax (MFS) to address the limitations of MoS. Extensive empirical analyses confirm our findings and show that against MoS, the proposed MFS achieves two-fold improvements in the perplexity of GPT-2 and BERT.

[1]  Ewa Dąbrowska,et al.  Beyond accuracy , 2024, Journal of Monolingual and Bilingual Speech.

[2]  Dani Yogatama,et al.  A Contrastive Framework for Neural Text Generation , 2022, NeurIPS.

[3]  Noah A. Smith,et al.  WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation , 2022, EMNLP.

[4]  Qinli Yang,et al.  Mixhead: Breaking the low-rank bottleneck in multi-head attention language models , 2022, Knowl. Based Syst..

[5]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[6]  Sharath Chandra Raparthy,et al.  Compositional Attention: Disentangling Search and Retrieval , 2021, ICLR.

[7]  Mohammad Taher Pilehvar,et al.  A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space , 2021, ACL.

[8]  Diana Inkpen,et al.  On the Softmax Bottleneck of Recurrent Language Models , 2021, AAAI.

[9]  William W. Cohen,et al.  What's the best place for an AI conference, Vancouver or ______: Why completing comparative questions is difficult , 2021, AAAI.

[10]  Hyung Won Chung,et al.  Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.

[11]  Noam M. Shazeer,et al.  Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..

[12]  Sebastian Ruder,et al.  Rethinking embedding coupling in pre-trained language models , 2020, ICLR.

[13]  Tom B. Brown,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[14]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[15]  Doug Downey,et al.  Stolen Probability: A Structural Weakness of Neural Language Models , 2020, ACL.

[16]  Rajarshi Das,et al.  ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning , 2020, EMNLP.

[17]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[18]  Yue Zhang,et al.  How Can BERT Help Lexical Semantics Tasks , 2019, 1911.02929.

[19]  Hao Zhou,et al.  Kernelized Bayesian Softmax for Text Generation , 2019, NeurIPS.

[20]  Hermann Ney,et al.  Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification , 2019, IWSLT.

[21]  Peter J. Liu,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[22]  Cyprien de Masson d'Autume,et al.  A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.

[23]  Di He,et al.  Representation Degeneration Problem in Training Natural Language Generation Models , 2019, ICLR.

[24]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[25]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[26]  Gary Bécigneul,et al.  Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities , 2019, ICML.

[27]  Graeme Hirst,et al.  Towards Understanding Linear Word Analogies , 2018, ACL.

[28]  Masaaki Nagata,et al.  Direct Output Connection for a High-Rank Language Model , 2018, EMNLP.

[29]  Yasuhiro Fujiwara,et al.  Sigsoftmax: Reanalysis of the Softmax Bottleneck , 2018, NeurIPS.

[30]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[31]  Ruslan Salakhutdinov,et al.  Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.

[32]  Andrew Gordon Wilson,et al.  Multimodal Word Distributions , 2017, ACL.

[33]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[34]  Dominik Schlechtweg,et al.  Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection , 2016, EACL.

[35]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[36]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[37]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[38]  Mariano Sigman,et al.  Global organization of the Wordnet lexicon , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[39]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[41]  I. J. Good,et al.  Stirling Numbers and a Geometric Structure from Voting Theory , 1977, J. Comb. Theory A.

[42]  Thomas M. Cover,et al.  THE NUMBER OF LINEARLY INDUCIBLE ORDERINGS OF POINTS IN d-SPACE* , 1967 .

[43]  Nikita Nangia,et al.  Scaling Laws vs Model Architectures : How does Inductive Bias Influence Scaling? An Extensive Empirical Study on Language Tasks , 2021 .

[44]  Kenneth Ward Church,et al.  Isotropy in the Contextual Embedding Space: Clusters and Manifolds , 2021, ICLR.

[45]  Nanyun Peng,et al.  STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation , 2020, EMNLP.

[46]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[47]  S. Srihari Mixture Density Networks , 1994 .

[48]  S. Baker On the sentence , 1978 .