Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models

Neural network language models are often trained by optimizing likelihood, but we would prefer to optimize for a task specific metric, such as BLEU in machine translation. We show how a recurrent neural network language model can be optimized towards an expected BLEU loss instead of the usual cross-entropy criterion. Furthermore, we tackle the issue of directly integrating a recurrent network into firstpass decoding under an efficient approximation. Our best results improve a phrasebased statistical machine translation system trained on WMT 2012 French-English data by up to 2.0 BLEU, and the expected BLEU objective improves over a crossentropy trained model by up to 0.6 BLEU in a single reference setup.

[1]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[2]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[3]  Jianfeng Gao,et al.  Minimum Translation Modeling with Recurrent Neural Networks , 2014, EACL.

[4]  Alexandre Allauzen,et al.  Continuous Space Translation Models with Neural Networks , 2012, NAACL.

[5]  Li Deng,et al.  Maximum Expected BLEU Training of Phrase and Lexicon Translation Models , 2012, ACL.

[6]  Richard M. Schwartz,et al.  Expected BLEU Training for Graphs: BBN System Description for WMT11 System Combination Task , 2011, WMT@EMNLP.

[7]  Tara N. Sainath,et al.  Deep Neural Network Language Models , 2012, WLM@NAACL-HLT.

[8]  Christopher D. Manning,et al.  A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[9]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[10]  Jianfeng Gao,et al.  Training MRF-Based Phrase Translation Models using Gradient Ascent , 2013, NAACL.

[11]  Geoffrey Zweig,et al.  Joint Language and Translation Modeling with Recurrent Neural Networks , 2013, EMNLP.

[12]  Preslav Nakov,et al.  Optimizing for Sentence-Level BLEU+1 Yields Short Translations , 2012, COLING.

[13]  Shankar Kumar,et al.  Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2008, EMNLP.

[14]  Richard M. Schwartz,et al.  BBN System Description for WMT10 System Combination Task , 2010, WMT@ACL.

[15]  Jianfeng Gao,et al.  Learning Continuous Phrase Representations for Translation Modeling , 2014, ACL.

[16]  Alexandre Allauzen,et al.  Limsi @ Wmt12 , 2012, WMT@NAACL-HLT.

[17]  Alexandre Allauzen,et al.  Limsi @ Wmt11 , 2011, WMT@EMNLP.

[18]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[19]  Hermann Ney,et al.  Comparison of feedforward and recurrent neural network language models , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[20]  Haitao Mi,et al.  Max-Violation Perceptron and Forced Decoding for Scalable MT Training , 2013, EMNLP.

[21]  Lukás Burget,et al.  Strategies for training large scale neural network language models , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[22]  Joshua Goodman,et al.  Parsing Algorithms and Metrics , 1996, ACL.

[23]  Adam Lopez,et al.  Training a Log-Linear Parser with Loss Functions via Softmax-Margin , 2011, EMNLP.

[24]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[25]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[26]  Phil Blunsom,et al.  Recurrent Continuous Translation Models , 2013, EMNLP.

[27]  Ben Taskar,et al.  An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[28]  Holger Schwenk,et al.  Large, Pruned or Continuous Space Language Models on a GPU for Statistical Machine Translation , 2012, WLM@NAACL-HLT.

[29]  Ashish Vaswani,et al.  Decoding with Large-Scale Neural Language Models Improves Translation , 2013, EMNLP.