Optimal Search for Minimum Error Rate Training

Minimum error rate training is a crucial component to many state-of-the-art NLP applications, such as machine translation and speech recognition. However, common evaluation functions such as BLEU or word error rate are generally highly non-convex and thus prone to search errors. In this paper, we present LP-MERT, an exact search algorithm for minimum error rate training that reaches the global optimum using a series of reductions to linear programming. Given a set of N-best lists produced from S input sentences, this algorithm finds a linear model that is globally optimal with respect to this set. We find that this algorithm is polynomial in N and in the size of the model, but exponential in S. We present extensions of this work that let us scale to reasonably large tuning sets (e.g., one thousand sentences), by either searching only promising regions of the parameter space, or by using a variant of LP-MERT that relies on a beam-search approximation. Experimental results show improvements over the standard Och algorithm.

[1]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[2]  Narendra Karmarkar,et al.  A new polynomial-time algorithm for linear programming , 1984, Comb..

[3]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[4]  Nicola Cancedda,et al.  Minimum Error Rate Training by Sampling the Translation Lattice , 2010, EMNLP.

[5]  Taro Watanabe,et al.  Online Large-Margin Training for Statistical Machine Translation , 2007, EMNLP.

[6]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[7]  R. Tichy,et al.  Stochastical approximation of convex bodies , 1985 .

[8]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[9]  A. Bykat,et al.  Convex Hull of a Finite Set of Points in Two Dimensions , 1978, Inf. Process. Lett..

[10]  David P. Dobkin,et al.  The quickhull algorithm for convex hulls , 1996, TOMS.

[11]  Tamir Hazan,et al.  Direct Loss Minimization for Structured Prediction , 2010, NIPS.

[12]  Koby Crammer,et al.  Online Large-Margin Training of Dependency Parsers , 2005, ACL.

[13]  Hermann Ney,et al.  A Systematic Comparison of Training Criteria for Statistical Machine Translation , 2007, EMNLP-CoNLL.

[14]  Ben Taskar,et al.  An End-to-End Discriminative Approach to Machine Translation , 2006, ACL.

[15]  Chris Callison-Burch,et al.  Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation , 2009, ACL.

[16]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[17]  Mitch Weintraub,et al.  Explicit word error minimization in n-best list rescoring , 1997, EUROSPEECH.

[18]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[19]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[20]  Daniel Jurafsky,et al.  Regularization and Search for Minimum Error Rate Training , 2008, WMT@ACL.

[21]  Chris Quirk,et al.  Random Restarts in Minimum Error Rate Training for Statistical Machine Translation , 2008, COLING.

[22]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[23]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[24]  Biing-Hwang Juang,et al.  Minimum error rate training based on N-best string models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[26]  William F. Eddy,et al.  A New Convex Hull Algorithm for Planar Sets , 1977, TOMS.

[27]  Hermann Ney,et al.  Discriminative adaptation for log-linear acoustic models , 2010, INTERSPEECH.

[28]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[29]  Ryan T. McDonald Discriminative Sentence Compression with Soft Syntactic Evidence , 2006, EACL.

[30]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[31]  Kevin Duh,et al.  Beyond Log-Linear Models: Boosted Minimum Error Rate Training for N-best Re-ranking , 2008, ACL.

[32]  Shankar Kumar,et al.  Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices , 2009, ACL/IJCNLP.

[33]  David Chiang,et al.  Better k-best Parsing , 2005, IWPT.

[34]  Bing Zhao,et al.  A Simplex Armijo Downhill Algorithm for Optimizing Statistical Machine Translation Decoding Parameters , 2009, NAACL.

[35]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[36]  M. J. D. Powell,et al.  An efficient method for finding the minimum of a function of several variables without calculating derivatives , 1964, Comput. J..

[37]  J. E. Glynn,et al.  Numerical Recipes: The Art of Scientific Computing , 1989 .

[38]  Hermann Ney,et al.  Complexity of Finding the BLEU-optimal Hypothesis in a Confusion Network , 2008, EMNLP.

[39]  Chris Callison-Burch,et al.  Feasibility of Human-in-the-loop Minimum Error Rate Training , 2009, EMNLP.

[40]  Philip Resnik,et al.  Online Large-Margin Training of Syntactic and Structural Translation Features , 2008, EMNLP.

[41]  Chris Quirk,et al.  Dependency Treelet Translation: Syntactically Informed Phrasal SMT , 2005, ACL.

[42]  V. Klee CONVEX POLYTOPES AND LINEAR PROGRAMMING. , 1964 .