Max-Violation Perceptron and Forced Decoding for Scalable MT Training

While large-scale discriminative training has triumphed in many NLP problems, its definite success on machine translation has been largely elusive. Most recent efforts along this line are not scalable (training on the small dev set with features from top 100 most frequent words) and overly complicated. We instead present a very simple yet theoretically motivated approach by extending the recent framework of “violation-fixing perceptron”, using forced decoding to compute the target derivations. Extensive phrase-based translation experiments on both Chinese-to-English and Spanish-to-English tasks show substantial gains in BLEU by up to +2.3/+2.0 on dev/test over MERT, thanks to 20M+ sparse features. This is the first successful effort of large-scale online discriminative training for MT.

[1]  Jianfeng Gao,et al.  Training MRF-Based Phrase Translation Models using Gradient Ascent , 2013, NAACL.

[2]  Mark Hopkins,et al.  Tuning as Ranking , 2011, EMNLP.

[3]  Hermann Ney,et al.  Training Phrase Translation Models with Leaving-One-Out , 2010, ACL.

[4]  Jaime G. Carbonell,et al.  Large-Scale Discriminative Training for Statistical Machine Translation Using Held-Out Line Search , 2013, HLT-NAACL.

[5]  Hao Zhang,et al.  Online Learning for Inexact Hypergraph Search , 2013, EMNLP.

[6]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[7]  Ben Taskar,et al.  An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[8]  Chris Dyer,et al.  Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT , 2012, ACL.

[9]  Eugene Charniak,et al.  Coarse-to-Fine n-Best Parsing and MaxEnt Discriminative Reranking , 2005, ACL.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  Philip Resnik,et al.  Online Large-Margin Training of Syntactic and Structural Translation Features , 2008, EMNLP.

[12]  Tiejun Zhao,et al.  Forced Decoding for Minimum Error Rate Training in Statistical Machine Translation , 2012 .

[13]  Li Deng,et al.  Maximum Expected BLEU Training of Phrase and Lexicon Translation Models , 2012, ACL.

[14]  Ashish Vaswani,et al.  Smaller Alignment Models for Better Translations: Unsupervised Word Alignment with the l0-norm , 2012, ACL.

[15]  David Chiang,et al.  Forest Rescoring: Faster Decoding with Integrated Language Models , 2007, ACL.

[16]  Qun Liu,et al.  Forest-Based Translation , 2008, ACL.

[17]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[18]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[19]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[20]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[21]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[22]  Xu Sun,et al.  Latent Variable Perceptron Algorithm for Structured Classification , 2009, IJCAI.

[23]  Yang Guo,et al.  Structured Perceptron with Inexact Search , 2012, NAACL.

[24]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[25]  David Chiang,et al.  Hope and Fear for Discriminative Training of Statistical Translation Models , 2012, J. Mach. Learn. Res..

[26]  Yang Liu,et al.  Maximum Entropy based Rule Selection Model for Syntax-based Statistical Machine Translation , 2008, EMNLP.

[27]  Bowen Zhou,et al.  Discriminative Training of 150 Million Translation Parameters and Its Application to Pruning , 2013, HLT-NAACL.

[28]  Liang Huang,et al.  Forest Reranking: Discriminative Parsing with Non-Local Features , 2008, ACL.

[29]  Taro Watanabe,et al.  Online Large-Margin Training for Statistical Machine Translation , 2007, EMNLP.

[30]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[31]  Noah A. Smith,et al.  Structured Ramp Loss Minimization for Machine Translation , 2012, HLT-NAACL.

[32]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.

[33]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[34]  Qun Liu,et al.  Improving Statistical Machine Translation using Lexicalized Rule Selection , 2008, COLING.

[35]  Kai Zhao,et al.  Minibatch and Parallelization for Online Large Margin Structured Learning , 2013, NAACL.

[36]  Ashish Vaswani,et al.  Rule Markov Models for Fast Tree-to-String Translation , 2011, ACL.

[37]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.