Residual Energy-Based Models for Text

Current large-scale auto-regressive language models (Radford et al., 2019; Liu et al., 2018; Graves, 2013) display impressive fluency and can generate convincing text. In this work we start by asking the question: Can the generations of these models be reliably distinguished from real text by statistical discriminators? We find experimentally that the answer is affirmative when we have access to the training data for the model, and guardedly affirmative even if we do not. This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process. We give a formalism for this using the Energy-Based Model framework, and show that it indeed improves the results of the generative models, measured both in terms of perplexity and in terms of human evaluation.

[1]  Song-Chun Zhu,et al.  Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ali Farhadi,et al.  Defending Against Neural Fake News , 2019, NeurIPS.

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Geoffrey E. Hinton,et al.  Modeling Natural Images Using Gated MRFs , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[6]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[7]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[8]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[9]  Percy Liang,et al.  Unifying Human and Statistical Evaluation for Natural Language Generation , 2019, NAACL.

[10]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[13]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[14]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[15]  Ronald Rosenfeld,et al.  Whole-sentence exponential language models: a vehicle for linguistic-statistical integration , 2001, Comput. Speech Lang..

[16]  Marc'Aurelio Ranzato,et al.  A Unified Energy-Based Framework for Unsupervised Learning , 2007, AISTATS.

[17]  Anoop Sarkar,et al.  Discriminative Reranking for Machine Translation , 2004, NAACL.

[18]  Mohit Iyyer,et al.  Energy-Based Reranking: Improving Neural Machine Translation Using Energy-Based Models , 2020, ACL.

[19]  Lav R. Varshney,et al.  CTRL: A Conditional Transformer Language Model for Controllable Generation , 2019, ArXiv.

[20]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[21]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[22]  Shakir Mohamed,et al.  Training language GANs from Scratch , 2019, NeurIPS.

[23]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[24]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[25]  Joelle Pineau,et al.  Language GANs Falling Short , 2018, ICLR.

[26]  Brian Roark,et al.  Discriminative n-gram language modeling , 2007, Comput. Speech Lang..

[27]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[28]  Alexei Baevski,et al.  Adaptive Input Representations for Neural Language Modeling , 2018, ICLR.

[29]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[30]  Chris Callison-Burch,et al.  Human and Automatic Detection of Generated Text , 2019, ArXiv.

[31]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[32]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[33]  Song-Chun Zhu,et al.  Learning Descriptor Networks for 3D Shape Synthesis and Analysis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Sylvain Lamprier,et al.  ColdGANs: Taming Language GANs with Cautious Sampling Strategies , 2020, NeurIPS.

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Jean-Marc Andreoli,et al.  Global Autoregressive Models for Data-Efficient Sequence Learning , 2019, CoNLL.

[37]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[38]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[39]  Yang Lu,et al.  A Theory of Generative ConvNet , 2016, ICML.

[40]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[41]  Jason Weston,et al.  Neural Text Generation with Unlikelihood Training , 2019, ICLR.

[42]  Bin Wang,et al.  Trans-dimensional Random Fields for Language Modeling , 2015, ACL.

[43]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[44]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[45]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[46]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[47]  Marc'Aurelio Ranzato,et al.  Classical Structured Prediction Losses for Sequence to Sequence Learning , 2017, NAACL.

[48]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[49]  Sebastian Nowozin,et al.  Debiasing Evidence Approximations: On Importance-weighted Autoencoders and Jackknife Variational Inference , 2018, ICLR.

[50]  Fu Jie Huang,et al.  A Tutorial on Energy-Based Learning , 2006 .

[51]  J J Hopfield,et al.  Neural networks and physical systems with emergent collective computational abilities. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[52]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[53]  Igor Mordatch,et al.  Implicit Generation and Generalization with Energy Based Models , 2018 .

[54]  Yang Lu,et al.  Learning Generative ConvNets via Multi-grid Modeling and Sampling , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Alexander M. Rush,et al.  GLTR: Statistical Detection and Visualization of Generated Text , 2019, ACL.

[56]  Zhuang Ma,et al.  Noise Contrastive Estimation and Negative Sampling for Conditional Models: Consistency and Statistical Efficiency , 2018, EMNLP.

[57]  Trevor Darrell,et al.  Discriminator Rejection Sampling , 2018, ICLR.

[58]  Bin Wang,et al.  Learning Neural Trans-Dimensional Random Field Language Models with Noise-Contrastive Estimation , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[59]  Daphne Ippolito,et al.  Trading Off Diversity and Quality in Natural Language Generation , 2020, HUMEVAL.

[60]  Alexander M. Rush,et al.  Adversarially Regularized Autoencoders , 2017, ICML.

[61]  Eric Horvitz,et al.  Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting , 2019, DGS@ICLR.

[62]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[63]  Bin Wang,et al.  Learning Trans-Dimensional Random Fields with Applications to Language Modeling , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Yee Whye Teh,et al.  Energy-Based Models for Sparse Overcomplete Representations , 2003, J. Mach. Learn. Res..

[65]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[66]  Song-Chun Zhu,et al.  Learning Energy-Based Spatial-Temporal Generative ConvNets for Dynamic Patterns , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Yann Dauphin,et al.  Language Modeling with Gated Convolutional Networks , 2016, ICML.

[68]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[69]  Bin Wang,et al.  Language modeling with neural trans-dimensional random fields , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[70]  Graham Neubig,et al.  Lagging Inference Networks and Posterior Collapse in Variational Autoencoders , 2019, ICLR.

[71]  Bin Wang,et al.  Improved Training Of Neural Trans-Dimensional Random field Language Models with Dynamic Noise-Contrastive Estimation , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[72]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[73]  Erik Nijkamp,et al.  Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model , 2019, NeurIPS.