Optimization for Statistical Machine Translation: A Survey

In statistical machine translation (SMT), the optimization of the system parameters to maximize translation accuracy is now a fundamental part of virtually all modern systems. In this article, we survey 12 years of research on optimization for SMT, from the seminal work on discriminative models (Och and Ney 2002) and minimum error rate training (Och 2003), to the most recent advances. Starting with a brief introduction to the fundamentals of SMT systems, we follow by covering a wide variety of optimization algorithms for use in both batch and online optimization. Specifically, we discuss losses based on direct error minimization, maximum likelihood, maximum margin, risk minimization, ranking, and more, along with the appropriate methods for minimizing these losses. We also cover recent topics, including large-scale optimization, nonlinear models, domain-dependent optimization, and the effect of MT evaluation measures or search on optimization. Finally, we discuss the current state of affairs in MT optimization, and point out some unresolved problems that will likely be the target of further research in optimization for MT.

[1]  Chris Dyer Two monolingual parses are better than one (synchronous parse) , 2010, HLT-NAACL.

[2]  Richard M. Schwartz,et al.  BBN System Description for WMT10 System Combination Task , 2010, WMT@ACL.

[3]  Alon Lavie,et al.  Learning from Post-Editing: Online Model Adaptation for Statistical Machine Translation , 2014, EACL.

[4]  Mo Yu,et al.  Locally Training the Log-Linear Model for SMT , 2012, EMNLP.

[5]  Expected Error Minimization with Ultraconservative Update for SMT , 2012, COLING.

[6]  Jason Eisner,et al.  Parameter Estimation for Probabilistic Finite-State Transducers , 2002, ACL.

[7]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[8]  Noah A. Smith,et al.  Structured Ramp Loss Minimization for Machine Translation , 2012, HLT-NAACL.

[9]  Andrew McCallum,et al.  Machine Translation Using Overlapping Alignments and SampleRank , 2009 .

[10]  Christopher D. Manning,et al.  A Simple and Effective Hierarchical Phrase Reordering Model , 2008, EMNLP.

[11]  John Langford,et al.  Slow Learners are Fast , 2009, NIPS.

[12]  Kevin Knight,et al.  11,001 New Features for Statistical Machine Translation , 2009, NAACL.

[13]  Lemao Liu,et al.  Additive Neural Networks for Statistical Machine Translation , 2013, ACL.

[14]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[15]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[16]  Alon Lavie,et al.  One System, Many Domains: Open-Domain Statistical Machine Translation via Feature Augmentation , 2012, AMTA.

[17]  Kristina Toutanova,et al.  Regularized Minimum Error Rate Training , 2013, EMNLP.

[18]  Daniel Jurafsky,et al.  Regularization and Search for Minimum Error Rate Training , 2008, WMT@ACL.

[19]  S. Sathiya Keerthi,et al.  Deterministic annealing for semi-supervised kernel machines , 2006, ICML.

[20]  Yinggong Zhao,et al.  Adaptive Development Data Selection for Log-linear Model in Statistical Machine Translation , 2010, COLING.

[21]  Lemao Liu,et al.  Search-Aware Tuning for Machine Translation , 2014, EMNLP.

[22]  Brian Roark,et al.  Minimum Imputed-Risk: Unsupervised Discriminative Training for Machine Translation , 2011, EMNLP.

[23]  Christopher D. Manning,et al.  Fast and Adaptive Online Training of Feature-Rich Translation Models , 2013, ACL.

[24]  Chris Callison-Burch,et al.  Feasibility of Human-in-the-loop Minimum Error Rate Training , 2009, EMNLP.

[25]  Taro Watanabe,et al.  Online Large-Margin Training for Statistical Machine Translation , 2007, EMNLP.

[26]  Josef van Genabith,et al.  Simple and Effective Parameter Tuning for Domain Adaptation of Statistical Machine Translation , 2012, COLING.

[27]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[28]  Chris Quirk,et al.  Optimal Search for Minimum Error Rate Training , 2011, EMNLP.

[29]  Hermann Ney,et al.  Forced Derivations for Hierarchical Machine Translation , 2012, COLING.

[30]  Christopher D. Manning,et al.  An Empirical Comparison of Features and Tuning for Phrase-based Machine Translation , 2014, WMT@ACL.

[31]  Dan Klein,et al.  Parsing and Hypergraphs , 2001, IWPT.

[32]  Jean-Cédric Chappelier,et al.  A Generalized CYK Algorithm for Parsing Stochastic CFG , 1998, TAPD.

[33]  John Shawe-Taylor,et al.  Kernel Regression Based Machine Translation , 2007, NAACL.

[34]  François Yvon,et al.  Computing Lattice BLEU Oracle Scores for Machine Translation , 2012, EACL.

[35]  Chris Quirk,et al.  Random Restarts in Minimum Error Rate Training for Statistical Machine Translation , 2008, COLING.

[36]  Noah A. Smith,et al.  Feature-Rich Translation by Quasi-Synchronous Lattice Parsing , 2009, EMNLP.

[37]  Hermann Ney,et al.  Complexity of Finding the BLEU-optimal Hypothesis in a Confusion Network , 2008, EMNLP.

[38]  Peter L. Bartlett,et al.  Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks , 2008, J. Mach. Learn. Res..

[39]  Taro Watanabe,et al.  Structural support vector machines for log-linear approach in statistical machine translation , 2009, IWSLT.

[40]  Patrick Nguyen,et al.  Training Non-Parametric Features for Statistical Machine Translation , 2007, WMT@ACL.

[41]  Mauro Cettolo,et al.  Online Learning Approaches in Computer Assisted Translation , 2013, WMT@ACL.

[42]  Timothy Baldwin,et al.  Is Machine Translation Getting Better over Time? , 2014, EACL.

[43]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[44]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[45]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[46]  Hwee Tou Ng,et al.  Better Evaluation Metrics Lead to Better Machine Translation , 2011, EMNLP.

[47]  Philip Resnik,et al.  A formal model of ambiguity and its applications in machine translation , 2010 .

[48]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[49]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[50]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[51]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[52]  Christopher D. Manning,et al.  Stanford University's Submissions to the WMT 2014 Translation Task , 2014, WMT@ACL.

[53]  Michel Galley,et al.  Direct Error Rate Minimization for Statistical Machine Translation , 2012, WMT@NAACL-HLT.

[54]  Kristina Toutanova,et al.  Learning Non-linear Features for Machine Translation Using Gradient Boosting Machines , 2013, ACL.

[55]  Dekai Wu,et al.  Improving machine translation by training against an automatic semantic frame based evaluation metric , 2013, ACL.

[56]  Anoop Sarkar,et al.  Stacking for Statistical Machine Translation , 2013, ACL.

[57]  Thomas Ottmann,et al.  Algorithms for Reporting and Counting Geometric Intersections , 1979, IEEE Transactions on Computers.

[58]  Philipp Koehn,et al.  Findings of the 2011 Workshop on Statistical Machine Translation , 2011, WMT@EMNLP.

[59]  Stephen J. Wright,et al.  Conjugate Gradient Methods , 1999 .

[60]  Bowen Zhou,et al.  Discriminative Training of 150 Million Translation Parameters and Its Application to Pruning , 2013, HLT-NAACL.

[61]  James Henderson,et al.  Heuristic Search for Non-Bottom-Up Tree Structure Prediction , 2011, EMNLP.

[62]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[63]  Bing Zhao,et al.  A Simplex Armijo Downhill Algorithm for Optimizing Statistical Machine Translation Decoding Parameters , 2009, NAACL.

[64]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[65]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[66]  Alexander H. Waibel,et al.  Training and Evaluating Error Minimization Decision Rules for Statistical Machine Translation , 2005, ParallelText@ACL.

[67]  Markus Dreyer,et al.  APRO: All-Pairs Ranking Optimization for MT Tuning , 2015, NAACL.

[68]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[69]  Koby Crammer,et al.  Adaptive regularization of weight vectors , 2009, Machine Learning.

[70]  John DeNero,et al.  Fast Consensus Decoding over Translation Forests , 2009, ACL.

[71]  Hermann Ney,et al.  Are Very Large N-Best Lists Useful for SMT? , 2007, HLT-NAACL.

[72]  Phil Blunsom,et al.  Probabilistic Inference for Machine Translation , 2008, EMNLP.

[73]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[74]  Zhifei Li,et al.  First- and Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests , 2009, EMNLP.

[75]  Vladimir Eidelman,et al.  The University of Maryland Statistical Machine Translation System for the Fourth Workshop on Machine Translation , 2009, WMT@ACL.

[76]  Barry Haddow,et al.  Applying Pairwise Ranked Optimisation to Improve the Interpolation of Translation Models , 2013, NAACL.

[77]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[78]  Klaus Obermayer,et al.  Support vector learning for ordinal regression , 1999 .

[79]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[80]  Phil Blunsom,et al.  A Discriminative Latent Variable Model for Statistical Machine Translation , 2008, ACL.

[81]  Yoram Singer,et al.  An Efficient Boosting Algorithm for Combining Preferences by , 2013 .

[82]  Shankar Kumar,et al.  Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2008, EMNLP.

[83]  Tong Zhang,et al.  A Discriminative Global Training Algorithm for Statistical MT , 2006, ACL.

[84]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[85]  Ben Taskar,et al.  An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[86]  Chris Dyer,et al.  Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT , 2012, ACL.

[87]  Kevin Duh,et al.  Multi-Metric Optimization Using Ensemble Tuning , 2013, NAACL.

[88]  Yoram Singer,et al.  Efficient Online and Batch Learning Using Forward Backward Splitting , 2009, J. Mach. Learn. Res..

[89]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[90]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[91]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[92]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[93]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[94]  Yang Liu,et al.  Fast Generation of Translation Forest for Large-Scale SMT Discriminative Training , 2011, EMNLP.

[95]  Philipp Koehn,et al.  SampleRank Training for Phrase-Based Machine Translation , 2011, WMT@EMNLP.

[96]  Kevin Duh,et al.  Beyond Log-Linear Models: Boosted Minimum Error Rate Training for N-best Re-ranking , 2008, ACL.

[97]  Xinyan Xiao,et al.  Max-Margin Synchronous Grammar Induction for Machine Translation , 2013, EMNLP.

[98]  Holger Schwenk,et al.  Optimising Multiple Metrics with MERT , 2011, Prague Bull. Math. Linguistics.

[99]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[100]  François Yvon,et al.  Minimum Error Rate Training Semiring , 2011, EAMT.

[101]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[102]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[103]  Shankar Kumar,et al.  Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices , 2009, ACL/IJCNLP.

[104]  Daniel Jurafsky,et al.  The Best Lexical Metric for Phrase-Based Statistical MT System Optimization , 2010, NAACL.

[105]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[106]  Yang Guo,et al.  Structured Perceptron with Inexact Search , 2012, NAACL.

[107]  Kevin Duh,et al.  Distributed Minimum Error Rate Training of SMT using Particle Swarm Optimization , 2011, IJCNLP.

[108]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[109]  Kishore Papineni Discriminative training via linear programming , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[110]  Hermann Ney,et al.  Generation of Word Graphs in Statistical Machine Translation , 2002, EMNLP.

[111]  Ming Zhou,et al.  Multi-Domain Adaptation for SMT Using Multi-Task Learning , 2013, EMNLP.

[112]  Liang Huang,et al.  A Syntax-Directed Translator with Extended Domain of Locality , 2006 .

[113]  Gholamreza Haffari,et al.  Transductive learning for statistical machine translation , 2007, ACL.

[114]  I. Dan Melamed,et al.  Scalable Discriminative Learning for Natural Language Parsing and Translation , 2006, NIPS.

[115]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[116]  Michael Collins,et al.  A Discriminative Model for Tree-to-Tree Translation , 2006, EMNLP.

[117]  Hermann Ney,et al.  Training Phrase Translation Models with Leaving-One-Out , 2010, ACL.

[118]  Jaime G. Carbonell,et al.  Large-Scale Discriminative Training for Statistical Machine Translation Using Held-Out Line Search , 2013, HLT-NAACL.

[119]  Roland Kuhn,et al.  Stabilizing Minimum Error Rate Training , 2009, WMT@EACL.

[120]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[121]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[122]  Liang Zhou,et al.  Re-evaluating Machine Translation Results with Paraphrase Support , 2006, EMNLP.

[123]  Liang Huang,et al.  Statistical Syntax-Directed Translation with Extended Domain of Locality , 2006, AMTA.

[124]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[125]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[126]  Guodong Zhou,et al.  Transductive Minimum Error Rate Training for Statistical Machine Translation , 2011, IJCNLP.

[127]  Bing Xiang,et al.  Discriminative Feature-Tied Mixture Modeling for Statistical Machine Translation , 2011, ACL.

[128]  Preslav Nakov,et al.  Optimizing for Sentence-Level BLEU+1 Yields Short Translations , 2012, COLING.

[129]  John DeNero,et al.  Consensus Training for Consensus Decoding in Machine Translation , 2009, EMNLP.

[130]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[131]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[132]  Anoop Sarkar,et al.  Discriminative Reranking for Machine Translation , 2004, NAACL.

[133]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[134]  Haitao Mi,et al.  Max-Violation Perceptron and Forced Decoding for Scalable MT Training , 2013, EMNLP.

[135]  David Chiang,et al.  Hope and Fear for Discriminative Training of Statistical Translation Models , 2012, J. Mach. Learn. Res..

[136]  Jianfeng Gao,et al.  Training MRF-Based Phrase Translation Models using Gradient Ascent , 2013, NAACL.

[137]  Francisco Casacuberta,et al.  Applying boosting to statistical machine translation , 2008, EAMT.

[138]  Bowen Zhou,et al.  A Corpus Level MIRA Tuning Strategy for Machine Translation , 2013, EMNLP.

[139]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[140]  Christoph Tillmann,et al.  A Unigram Orientation Model for Statistical Machine Translation , 2004, NAACL.

[141]  Mark Hopkins,et al.  Tuning as Ranking , 2011, EMNLP.

[142]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[143]  Richard M. Schwartz,et al.  Expected BLEU Training for Graphs: BBN System Description for WMT11 System Combination Task , 2011, WMT@EMNLP.

[144]  Philip Resnik,et al.  Soft Syntactic Constraints for Hierarchical Phrased-Based Translation , 2008, ACL.

[145]  Vladimir Eidelman,et al.  Online Relative Margin Maximization for Statistical Machine Translation , 2013, ACL.

[146]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[147]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[148]  Sanjeev Khudanpur,et al.  Forest Reranking for Machine Translation with the Perceptron Algorithm , 2009 .

[149]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[150]  David Chiang,et al.  Forest Rescoring: Faster Decoding with Integrated Language Models , 2007, ACL.

[151]  Yifan He,et al.  Improving the Objective Function in Minimum Error Rate Training , 2009, MTSUMMIT.

[152]  Preslav Nakov,et al.  A Tale about PRO and Monsters , 2013, ACL.

[153]  Gregory Shakhnarovich,et al.  A Systematic Exploration of Diversity in Machine Translation , 2013, EMNLP.

[154]  Philip Resnik,et al.  Online Large-Margin Training of Syntactic and Structural Translation Features , 2008, EMNLP.

[155]  Tiejun Zhao,et al.  Forced Decoding for Minimum Error Rate Training in Statistical Machine Translation , 2012 .

[156]  Li Deng,et al.  Maximum Expected BLEU Training of Phrase and Lexicon Translation Models , 2012, ACL.

[157]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[158]  Nan Duan,et al.  The Feature Subspace Method for SMT System Combination , 2009, EMNLP.

[159]  Koby Crammer,et al.  Ultraconservative Online Algorithms for Multiclass Problems , 2001, J. Mach. Learn. Res..

[160]  Vladimir Eidelman,et al.  Optimization Strategies for Online Large-Margin Learning in Machine Translation , 2012, WMT@NAACL-HLT.

[161]  Stephan Vogel,et al.  Considerations in maximum mutual information and minimum classification error training for statistical machine translation , 2005, EAMT.

[162]  Daniel Gildea,et al.  Tuning as Linear Regression , 2012, HLT-NAACL.

[163]  Jeffrey Heer,et al.  Human Effort and Machine Learnability in Computer Aided Translation , 2014, EMNLP.

[164]  Hermann Ney,et al.  The RWTH Aachen Machine Translation System for WMT 2010 , 2010, IWSLT.

[165]  Hermann Ney,et al.  A Systematic Comparison of Training Criteria for Statistical Machine Translation , 2007, EMNLP-CoNLL.

[166]  Taro Watanabe,et al.  Optimized Online Rank Learning for Machine Translation , 2012, NAACL.

[167]  François Yvon,et al.  Non-linear n-best List Reranking with Few Features , 2012, AMTA.

[168]  Germán Sanchis-Trilles,et al.  Log-linear weight optimisation via Bayesian Adaptation in Statistical Machine Translation , 2010, COLING.

[169]  Brian Roark,et al.  Incremental Parsing with the Perceptron Algorithm , 2004, ACL.

[170]  Kevin Duh,et al.  Learning to Translate with Multiple Objectives , 2012, ACL.

[171]  Germán Sanchis-Trilles,et al.  Bayesian Adaptation for Statistical Machine Translation , 2010, SSPR/SPR.

[172]  Nicola Cancedda,et al.  Minimum Error Rate Training by Sampling the Translation Lattice , 2010, EMNLP.

[173]  Lemao Liu,et al.  Tuning SMT with a Large Number of Features via Online Feature Grouping , 2013, IJCNLP.

[174]  Alon Lavie,et al.  Locally Non-Linear Learning for Statistical Machine Translation via Discretization and Structured Regularization , 2014, Transactions of the Association for Computational Linguistics.

[175]  Ben Taskar,et al.  Learning structured prediction models: a large margin approach , 2005, ICML.

[176]  Avneesh Singh Saluja Machine Translation with Binary Feedback: a Large-Margin Approach , 2012, AMTA.