Truncation Sampling as Language Model Desmoothing

Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms--like top-$p$ or top-$k$ -- address this by setting some words' probabilities to zero at each step. This work provides framing for the aim of truncation, and an improved algorithm for that aim. We propose thinking of a neural language model as a mixture of a true distribution and a smoothing distribution that avoids infinite perplexity. In this light, truncation algorithms aim to perform desmoothing, estimating a subset of the support of the true distribution. Finding a good subset is crucial: we show that top-$p$ unnecessarily truncates high-probability words, for example causing it to truncate all words but Trump for a document that starts with Donald. We introduce $\eta$-sampling, which truncates words below an entropy-dependent probability threshold. Compared to previous algorithms, $\eta$-sampling generates more plausible long English documents according to humans, is better at breaking out of repetition, and behaves more reasonably on a battery of test distributions.

[1]  Kai-Wei Chang,et al.  An Analysis of The Effects of Decoding Algorithms on Fairness in Open-Ended Language Generation , 2022, 2022 IEEE Spoken Language Technology Workshop (SLT).

[2]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[3]  Tiago Pimentel,et al.  High probability or low information? The probability–quality paradox in language generation , 2022, ACL.

[4]  Vassilina Nikoulina,et al.  Speeding Up Entmax , 2021, NAACL-HLT.

[5]  Clara Meister,et al.  Revisiting the Uniform Information Density Hypothesis , 2021, EMNLP.

[6]  Benjamin Van Roy,et al.  Epistemic Neural Networks , 2021, ArXiv.

[7]  Clara Meister,et al.  Language Model Evaluation Beyond Perplexity , 2021, ACL.

[8]  Clara Meister,et al.  A Cognitive Regularizer for Language Modeling , 2021, ACL.

[9]  Yejin Choi,et al.  MAUVE: Measuring the Gap Between Neural Text and Human Text using Divergence Frontiers , 2021, NeurIPS.

[10]  Charles Foster,et al.  The Pile: An 800GB Dataset of Diverse Text for Language Modeling , 2020, ArXiv.

[11]  James R. Glass,et al.  A Systematic Characterization of Sampling Algorithms for Open-ended Language Generation , 2020, AACL.

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[14]  Tatsunori B. Hashimoto,et al.  Improved Natural Language Generation via Loss Truncation , 2020, ACL.

[15]  Richard Yuanzhe Pang,et al.  Consistency of a Recurrent Language Model with Respect to Incomplete Decoding , 2020, EMNLP.

[16]  Chris Callison-Burch,et al.  Human and Automatic Detection of Generated Text , 2019, ArXiv.

[17]  Mirella Lapata,et al.  Text Summarization with Pretrained Encoders , 2019, EMNLP.

[18]  Jason Weston,et al.  ELI5: Long Form Question Answering , 2019, ACL.

[19]  André F. T. Martins,et al.  Sparse Sequence-to-Sequence Models , 2019, ACL.

[20]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[21]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[22]  Max Welling,et al.  Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement , 2019, ICML.

[23]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[24]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[25]  Aaron C. Courville,et al.  Generative Adversarial Nets , 2014, NIPS.

[26]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[27]  Brian Roark,et al.  Probabilistic Context-Free Grammar Induction Based on Structural Zeros , 2006, NAACL.

[28]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[29]  Tiago Pimentel,et al.  Typical Decoding for Natural Language Generation , 2022, ArXiv.

[30]  Lav R. Varshney,et al.  Mirostat: a Neural Text decoding Algorithm that directly controls perplexity , 2021, ICLR.

[31]  Tim Vieira,et al.  Conditional Poisson Stochastic Beams , 2021, EMNLP.

[32]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[33]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[34]  Kenneth Ward Church,et al.  A comparison of the enhanced Good-Turing and deleted estimation methods for estimating probabilities of English bigrams , 1991 .