Sentence-wise Smooth Regularization for Sequence to Sequence Learning

Maximum-likelihood estimation (MLE) is widely used in sequence to sequence tasks for model training. It uniformly treats the generation/prediction of each target token as multiclass classification, and yields non-smooth prediction probabilities: in a target sequence, some tokens are predicted with small probabilities while other tokens are with large probabilities. According to our empirical study, we find that the non-smoothness of the probabilities results in low quality of generated sequences. In this paper, we propose a sentence-wise regularization method which aims to output smooth prediction probabilities for all the tokens in the target sequence. Our proposed method can automatically adjust the weights and gradients of each token in one sentence to ensure the predictions in a sequence uniformly well. Experiments on three neural machine translation tasks and one text summarization task show that our method outperforms conventional MLE loss on all these tasks and achieves promising BLEU scores on WMT14 English-German and WMT17 Chinese-English translation task.

[1]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Di He,et al.  Dense Information Flow for Neural Machine Translation , 2018, NAACL.

[4]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[5]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[6]  Marcello Federico,et al.  Report on the 11th IWSLT evaluation campaign , 2014, IWSLT.

[7]  Yann Dauphin,et al.  A Convolutional Encoder Model for Neural Machine Translation , 2016, ACL.

[8]  Di He,et al.  Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input , 2018, AAAI.

[9]  Li Zhao,et al.  Dual Transfer Learning for Neural Machine Translation with Marginal Distribution Regularization , 2018, AAAI.

[10]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[11]  Christopher D. Manning,et al.  Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[12]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[13]  Di He,et al.  FRAGE: Frequency-Agnostic Word Representation , 2018, NeurIPS.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Alex Graves,et al.  Neural Machine Translation in Linear Time , 2016, ArXiv.

[16]  Christoph H. Lampert,et al.  Curriculum learning of multiple tasks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[18]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[19]  Jiajun Zhang,et al.  Towards Zero Unknown Word in Neural Machine Translation , 2016, IJCAI.

[20]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[21]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[22]  Ashwin K. Vijayakumar,et al.  Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models , 2016, ArXiv.

[23]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[24]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Chong Wang,et al.  Towards Neural Phrase-based Machine Translation , 2017, ICLR.

[27]  Wei Chen,et al.  Sogou Neural Machine Translation Systems for WMT17 , 2017, WMT.

[28]  Chong Wang,et al.  Neural Phrase-based Machine Translation , 2017, ArXiv.

[29]  Masaaki Nagata,et al.  Cutting-off Redundant Repeating Generations for Neural Abstractive Summarization , 2016, EACL.

[30]  Di He,et al.  Double Path Networks for Sequence to Sequence Learning , 2018, COLING.

[31]  Di He,et al.  Decoding with Value Networks for Neural Machine Translation , 2017, NIPS.

[32]  Di He,et al.  Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation , 2018, NeurIPS.

[33]  Lijun Wu,et al.  Beyond Error Propagation in Neural Machine Translation: Characteristics of Language Also Matter , 2018, EMNLP.

[34]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[35]  Maosong Sun,et al.  Neural Headline Generation with Sentence-wise Optimization , 2016 .

[36]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[37]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Di He,et al.  Multilingual Neural Machine Translation with Knowledge Distillation , 2019, ICLR.

[39]  Alexander M. Rush,et al.  Sequence-to-Sequence Learning as Beam-Search Optimization , 2016, EMNLP.

[40]  Jinming Hu,et al.  XMU Neural Machine Translation Systems for WAT 2017 , 2017, Proceedings of the Second Conference on Machine Translation.