Generalization in Generation: A closer look at Exposure Bias

Exposure bias refers to the train-test discrepancy that seemingly arises when an autoregressive generative model uses only ground-truth contexts at training time but generated ones at test time. We separate the contribution of the learning framework and the model to clarify the debate on consequences and review proposed counter-measures. In this light, we argue that generalization is the underlying property to address and propose unconditional generation as its fundamental benchmark. Finally, we combine latent variable modeling with a recent formulation of exploration in reinforcement learning to obtain a rigorous handling of true and generated contexts. Results on language modeling and variational sentence auto-encoding confirm the model’s generalization capability.

[1]  Slav Petrov,et al.  Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[2]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[3]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[4]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[5]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[7]  Yee Whye Teh,et al.  Tighter Variational Bounds are Not Necessarily Better , 2018, ICML.

[8]  Wenhu Chen,et al.  Neural Sequence Prediction by Coaching , 2017, ArXiv.

[9]  Ferenc Huszar,et al.  How (not) to Train your Generative Model: Scheduled Sampling, Likelihood, Adversary? , 2015, ArXiv.

[10]  Radu Soricut,et al.  Cold-Start Reinforcement Learning with Softmax Policy Gradient , 2017, NIPS.

[11]  Dale Schuurmans,et al.  Reward Augmented Maximum Likelihood for Neural Structured Prediction , 2016, NIPS.

[12]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[13]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[14]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[15]  Roger B. Grosse,et al.  Isolating Sources of Disentanglement in Variational Autoencoders , 2018, NeurIPS.

[16]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[17]  William Yang Wang,et al.  Deep Reinforcement Learning with Distributional Semantic Rewards for Abstractive Summarization , 2019, EMNLP.

[18]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[19]  Sergey Levine,et al.  Learning Actionable Representations with Goal-Conditioned Policies , 2018, ICLR.

[20]  Daan Wierstra,et al.  Stochastic Back-propagation and Variational Inference in Deep Latent Gaussian Models , 2014, ArXiv.

[21]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[22]  Christian Osendorfer,et al.  Learning Stochastic Recurrent Networks , 2014, NIPS 2014.

[23]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[24]  Alexander M. Rush,et al.  Latent Normalizing Flows for Discrete Sequences , 2019, ICML.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Sanja Fidler,et al.  Skip-Thought Vectors , 2015, NIPS.

[27]  Eric P. Xing,et al.  Connecting the Dots Between MLE and RL for Sequence Generation , 2018, DeepRLStructPred@ICLR.

[28]  Zhiting Hu,et al.  Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.

[29]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[30]  Naren Ramakrishnan,et al.  Deep Reinforcement Learning for Sequence-to-Sequence Models , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Joelle Pineau,et al.  Language GANs Falling Short , 2018, ICLR.

[32]  Hedvig Kjellström,et al.  Advances in Variational Inference , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[34]  Lukasz Kaiser,et al.  Sample Efficient Text Summarization Using a Single Pre-Trained Transformer , 2019, ArXiv.

[35]  Christopher Burgess,et al.  beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2016, ICLR 2016.

[36]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[37]  Florian Schmidt,et al.  Autoregressive Text Generation Beyond Feedback Loops , 2019, EMNLP.

[38]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[41]  Olivier Bachem,et al.  Recent Advances in Autoencoder-Based Representation Learning , 2018, ArXiv.

[42]  Sanja Fidler,et al.  Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[44]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[45]  Lukasz Kaiser,et al.  Sentence Compression by Deletion with LSTMs , 2015, EMNLP.

[46]  Yong Yu,et al.  Neural Text Generation: Past, Present and Beyond , 2018, 1803.07133.

[47]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[49]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[50]  Basura Fernando,et al.  SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.

[51]  Xiaoyu Shen,et al.  Improving Variational Encoder-Decoders in Dialogue Generation , 2018, AAAI.

[52]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[53]  James R. Glass,et al.  Quantifying Exposure Bias for Neural Language Generation , 2019, ArXiv.

[54]  Thomas Hofmann,et al.  Deep State Space Models for Unconditional Word Generation , 2018, NeurIPS.

[55]  Ole Winther,et al.  Sequential Neural Models with Stochastic Layers , 2016, NIPS.

[56]  Anton Osokin,et al.  SEARNN: Training RNNs with Global-Local Losses , 2017, ICLR.

[57]  Yoshua Bengio,et al.  Z-Forcing: Training Stochastic Recurrent Networks , 2017, NIPS.

[58]  Andrew M. Dai,et al.  MaskGAN: Better Text Generation via Filling in the ______ , 2018, ICLR.

[59]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[60]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[61]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.