Parallel Scheduled Sampling

Auto-regressive models are widely used in sequence generation problems. The output sequence is typically generated in a predetermined order, one discrete unit (pixel or word or character) at a time. The models are trained by teacher-forcing where ground-truth history is fed to the model as input, which at test time is replaced by the model prediction. Scheduled Sampling aims to mitigate this discrepancy between train and test time by randomly replacing some discrete units in the history with the model's prediction. While teacher-forced training works well with ML accelerators as the computation can be parallelized across time, Scheduled Sampling involves undesirable sequential processing. In this paper, we introduce a simple technique to parallelize Scheduled Sampling across time. Experimentally, we find the proposed technique leads to equivalent or better performance on image generation, summarization, dialog generation, and translation compared to teacher-forced training. In dialog response generation task, Parallel Scheduled Sampling achieves 1.6 BLEU score (11.5%) improvement over teacher-forcing while in image generation it achieves 20% and 13.8% improvement in Frechet Inception Distance (FID) and Inception Score (IS) respectively. Further, we discuss the effects of different hyper-parameters associated with Scheduled Sampling on the model performance.

[1]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[2]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[3]  Lukasz Kaiser,et al.  Generating Wikipedia by Summarizing Long Sequences , 2018, ICLR.

[4]  Yuichi Yoshida,et al.  Spectral Normalization for Generative Adversarial Networks , 2018, ICLR.

[5]  Jaakko Lehtinen,et al.  Progressive Growing of GANs for Improved Quality, Stability, and Variation , 2017, ICLR.

[6]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[7]  Ronald J. Williams,et al.  A Learning Algorithm for Continually Running Fully Recurrent Neural Networks , 1989, Neural Computation.

[8]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[9]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[10]  Yoshua Bengio,et al.  Professor Forcing: A New Algorithm for Training Recurrent Networks , 2016, NIPS.

[11]  Sergey Levine,et al.  Unsupervised Learning for Physical Interaction through Video Prediction , 2016, NIPS.

[12]  Stefan Ultes,et al.  MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , 2018, EMNLP.

[13]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[17]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[21]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[22]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[23]  Simon Dixon,et al.  An End-to-End Neural Network for Polyphonic Piano Music Transcription , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[24]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[25]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[26]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[27]  Makoto Miwa,et al.  End-to-End Relation Extraction using LSTMs on Sequences and Tree Structures , 2016, ACL.

[28]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.