First- and Second-Order Expectation Semirings with Applications to Minimum-Risk Training on Translation Forests

Many statistical translation models can be regarded as weighted logical deduction. Under this paradigm, we use weights from the expectation semiring (Eisner, 2002), to compute first-order statistics (e.g., the expected hypothesis length or feature counts) over packed forests of translations (lattices or hypergraphs). We then introduce a novel second-order expectation semiring, which computes second-order statistics (e.g., the variance of the hypothesis length or the gradient of entropy). This second-order semiring is essential for many interesting training paradigms such as minimum risk, deterministic annealing, active learning, and semi-supervised learning, where gradient descent optimization requires computing the gradient of entropy or risk. We use these semirings in an open-source machine translation toolkit, Joshua, enabling minimum-risk training for a benefit of up to 1.0 bleu point.

[1]  Shankar Kumar,et al.  Lattice Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2008, EMNLP.

[2]  Yang Liu,et al.  Tree-to-String Alignment Template for Statistical Machine Translation , 2006, ACL.

[3]  Kevin Knight,et al.  11,001 New Features for Statistical Machine Translation , 2009, NAACL.

[4]  K. Rose Deterministic annealing for clustering, compression, classification, regression, and related optimization problems , 1998, Proc. IEEE.

[5]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[6]  Clifford,et al.  Preliminary Sketch of Biquaternions , 1871 .

[7]  J. Baker Trainable grammars for speech recognition , 1979 .

[8]  Sanjeev Khudanpur,et al.  Forest Reranking for Machine Translation with the Perceptron Algorithm , 2009 .

[9]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[10]  Giorgio Gallo,et al.  Directed Hypergraphs and Applications , 1993, Discret. Appl. Math..

[11]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[12]  Chris Callison-Burch,et al.  Demonstration of Joshua: An Open Source Toolkit for Parsing-based Machine Translation , 2009, ACL.

[13]  Noah A. Smith,et al.  Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language , 2005, HLT.

[14]  David A. Smith,et al.  Minimum Risk Annealing for Training Log-Linear Models , 2006, ACL.

[15]  Liang Huang,et al.  Forest Reranking: Discriminative Parsing with Non-Local Features , 2008, ACL.

[16]  David H. D. Warren,et al.  Parsing as Deduction , 1983, ACL.

[17]  Daniel Marcu,et al.  Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[18]  Jason Eisner,et al.  Parameter Estimation for Probabilistic Finite-State Transducers , 2002, ACL.

[19]  David Chiang,et al.  Better k-best Parsing , 2005, IWPT.

[20]  Barak A. Pearlmutter,et al.  Lazy multivariate higher-order forward-mode AD , 2007, POPL '07.

[21]  Dan Klein,et al.  Parsing and Hypergraphs , 2001, IWPT.

[22]  Sanjeev Khudanpur,et al.  Variational Decoding for Statistical Machine Translation , 2009, ACL.

[23]  S. Khudanpur,et al.  Large-scale Discriminative n-gram Language Models for Statistical Machine Translation , 2008, AMTA.

[24]  Stuart M. Shieber,et al.  Principles and Implementation of Deductive Parsing , 1994, J. Log. Program..

[25]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[26]  Ronald Rosenfeld,et al.  Adaptive Language Modeling Using the Maximum Entropy Principle , 1993, HLT.

[27]  John DeNero,et al.  Consensus Training for Consensus Decoding in Machine Translation , 2009, EMNLP.

[28]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[29]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[30]  Jason Eisner,et al.  Learning Non-Isomorphic Tree Mappings for Machine Translation , 2003, ACL.

[31]  Joshua Goodman,et al.  Semiring Parsing , 1999, CL.

[32]  Chris Quirk,et al.  Dependency Treelet Translation: Syntactically Informed Phrasal SMT , 2005, ACL.

[33]  Noah A. Smith,et al.  Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings , 2009, EACL.

[34]  Adam Lopez,et al.  Translation as Weighted Deduction , 2009, EACL.

[35]  Chiori Hori,et al.  Overview of the IWSLT 2005 Evaluation Campaign , 2005, IWSLT.