Softmax Bottleneck Makes Language Models Unable to Represent Multi-mode Word Distributions
暂无分享,去创建一个
[1] Ewa Dąbrowska,et al. Beyond accuracy , 2024, Journal of Monolingual and Bilingual Speech.
[2] Dani Yogatama,et al. A Contrastive Framework for Neural Text Generation , 2022, NeurIPS.
[3] Noah A. Smith,et al. WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation , 2022, EMNLP.
[4] Qinli Yang,et al. Mixhead: Breaking the low-rank bottleneck in multi-head attention language models , 2022, Knowl. Based Syst..
[5] Mohammad Bavarian,et al. Training Verifiers to Solve Math Word Problems , 2021, ArXiv.
[6] Sharath Chandra Raparthy,et al. Compositional Attention: Disentangling Search and Retrieval , 2021, ICLR.
[7] Mohammad Taher Pilehvar,et al. A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space , 2021, ACL.
[8] Diana Inkpen,et al. On the Softmax Bottleneck of Recurrent Language Models , 2021, AAAI.
[9] William W. Cohen,et al. What's the best place for an AI conference, Vancouver or ______: Why completing comparative questions is difficult , 2021, AAAI.
[10] Hyung Won Chung,et al. Do Transformer Modifications Transfer Across Implementations and Applications? , 2021, EMNLP.
[11] Noam M. Shazeer,et al. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , 2021, J. Mach. Learn. Res..
[12] Sebastian Ruder,et al. Rethinking embedding coupling in pre-trained language models , 2020, ICLR.
[13] Tom B. Brown,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.
[14] Sameer Singh,et al. Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.
[15] Doug Downey,et al. Stolen Probability: A Structural Weakness of Neural Language Models , 2020, ACL.
[16] Rajarshi Das,et al. ProtoQA: A Question Answering Dataset for Prototypical Common-Sense Reasoning , 2020, EMNLP.
[17] Alec Radford,et al. Scaling Laws for Neural Language Models , 2020, ArXiv.
[18] Yue Zhang,et al. How Can BERT Help Lexical Semantics Tasks , 2019, 1911.02929.
[19] Hao Zhou,et al. Kernelized Bayesian Softmax for Text Generation , 2019, NeurIPS.
[20] Hermann Ney,et al. Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification , 2019, IWSLT.
[21] Peter J. Liu,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[22] Cyprien de Masson d'Autume,et al. A Mutual Information Maximization Perspective of Language Representation Learning , 2019, ICLR.
[23] Di He,et al. Representation Degeneration Problem in Training Natural Language Generation Models , 2019, ICLR.
[24] Alex Wang,et al. What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.
[25] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.
[26] Gary Bécigneul,et al. Breaking the Softmax Bottleneck via Learnable Monotonic Pointwise Non-linearities , 2019, ICML.
[27] Graeme Hirst,et al. Towards Understanding Linear Word Analogies , 2018, ACL.
[28] Masaaki Nagata,et al. Direct Output Connection for a High-Rank Language Model , 2018, EMNLP.
[29] Yasuhiro Fujiwara,et al. Sigsoftmax: Reanalysis of the Softmax Bottleneck , 2018, NeurIPS.
[30] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[31] Ruslan Salakhutdinov,et al. Breaking the Softmax Bottleneck: A High-Rank RNN Language Model , 2017, ICLR.
[32] Andrew Gordon Wilson,et al. Multimodal Word Distributions , 2017, ACL.
[33] Geoffrey E. Hinton,et al. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.
[34] Dominik Schlechtweg,et al. Hypernyms under Siege: Linguistically-motivated Artillery for Hypernymy Detection , 2016, EACL.
[35] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.
[36] Kevin Gimpel,et al. Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.
[37] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.
[38] Mariano Sigman,et al. Global organization of the Wordnet lexicon , 2001, Proceedings of the National Academy of Sciences of the United States of America.
[39] S. Hochreiter,et al. Long Short-Term Memory , 1997, Neural Computation.
[40] George A. Miller,et al. WordNet: A Lexical Database for English , 1995, HLT.
[41] I. J. Good,et al. Stirling Numbers and a Geometric Structure from Voting Theory , 1977, J. Comb. Theory A.
[42] Thomas M. Cover,et al. THE NUMBER OF LINEARLY INDUCIBLE ORDERINGS OF POINTS IN d-SPACE* , 1967 .
[43] Nikita Nangia,et al. Scaling Laws vs Model Architectures : How does Inductive Bias Influence Scaling? An Extensive Empirical Study on Language Tasks , 2021 .
[44] Kenneth Ward Church,et al. Isotropy in the Contextual Embedding Space: Clusters and Manifolds , 2021, ICLR.
[45] Nanyun Peng,et al. STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation , 2020, EMNLP.
[46] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .
[47] S. Srihari. Mixture Density Networks , 1994 .
[48] S. Baker. On the sentence , 1978 .