Sequence-to-Sequence Learning as Beam-Search Optimization

Sequence-to-Sequence (seq2seq) modeling has rapidly become an important general-purpose NLP tool that has proven effective for many text-generation and sequence-labeling tasks. Seq2seq builds on deep neural language modeling and inherits its remarkable accuracy in estimating local, next-word distributions. In this work, we introduce a model and beam-search training scheme, based on the work of Daume III and Marcu (2005), that extends seq2seq to learn global sequence scores. This structured approach avoids classical biases associated with local training and unifies the training loss with the test-time usage, while preserving the proven model architecture of seq2seq and its efficient training approach. We show that our system outperforms a highly-optimized attention-based seq2seq system and other baselines on three different sequence to sequence tasks: word ordering, parsing, and machine translation.

[1]  Alexander M. Rush,et al.  Word Ordering Without Syntax , 2016, EMNLP.

[2]  Slav Petrov,et al.  Globally Normalized Transition-Based Neural Networks , 2016, ACL.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[5]  Yue Zhang,et al.  Transition-Based Syntactic Linearization , 2015, NAACL.

[6]  Stephen Clark,et al.  Syntax-Based Grammaticality Improvement using CCG and Guided Search , 2011, EMNLP.

[7]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[8]  John Langford,et al.  Efficient programmable learning to search , 2014, ArXiv.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Yue Zhang,et al.  A Neural Probabilistic Structured-Prediction Model for Transition-Based Dependency Parsing , 2015, ACL.

[11]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[12]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[13]  Joelle Pineau,et al.  Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models , 2015, AAAI.

[14]  Samy Bengio,et al.  Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks , 2015, NIPS.

[15]  Lukasz Kaiser,et al.  Sentence Compression by Deletion with LSTMs , 2015, EMNLP.

[16]  John Langford,et al.  Search-based structured prediction , 2009, Machine Learning.

[17]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.

[18]  Colin Cherry,et al.  A Systematic Comparison of Smoothing Techniques for Sentence-Level BLEU , 2014, WMT@ACL.

[19]  Stephen Clark,et al.  Discriminative Syntax-Based Word Ordering for Text Generation , 2015, CL.

[20]  James Henderson,et al.  Incremental Recurrent Neural Network Dependency Parser with Search-based Discriminative Training , 2015, CoNLL.

[21]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[22]  Yang Liu,et al.  Minimum Risk Training for Neural Machine Translation , 2015, ACL.

[23]  Yang Guo,et al.  Structured Perceptron with Inexact Search , 2012, NAACL.

[24]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[25]  Danqi Chen,et al.  A Fast and Accurate Dependency Parser using Neural Networks , 2014, EMNLP.

[26]  Taro Watanabe,et al.  Transition-based Neural Constituent Parsing , 2015, ACL.

[27]  Georg Heigold,et al.  Sequence discriminative distributed training of long short-term memory recurrent neural networks , 2014, INTERSPEECH.

[28]  Geoffrey J. Gordon,et al.  A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning , 2010, AISTATS.

[29]  Marc'Aurelio Ranzato,et al.  Sequence Level Training with Recurrent Neural Networks , 2015, ICLR.

[30]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[31]  Joelle Pineau,et al.  An Actor-Critic Algorithm for Sequence Prediction , 2016, ICLR.

[32]  Marcello Federico,et al.  Report on the 10th IWSLT evaluation campaign , 2013, IWSLT.

[33]  Hermann Ney,et al.  Sequence-discriminative training of recurrent neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[35]  Geoffrey E. Hinton,et al.  Generating Text with Recurrent Neural Networks , 2011, ICML.

[36]  Christopher Kermorvant,et al.  Dropout Improves Recurrent Neural Networks for Handwriting Recognition , 2013, 2014 14th International Conference on Frontiers in Handwriting Recognition.

[37]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[38]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[39]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[40]  Jonas Kuhn,et al.  Learning Structured Perceptrons for Coreference Resolution with Latent Antecedents and Non-local Features , 2014, ACL.

[41]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[42]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[43]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).