Multi-Task Learning for Improved Discriminative Training in SMT

Multi-task learning has been shown to be effective in various applications, including discriminative SMT. We present an experimental evaluation of the question whether multi-task learning depends on a “natural” division of data into tasks that balance shared and individual knowledge, or whether its inherent regularization makes multi-task learning a broadly applicable remedy against overfitting. To investigate this question, we compare “natural” tasks defined as sections of the International Patent Classification versus “random” tasks defined as random shards in the context of patent SMT. We find that both versions of multi-task learning improve equally well over independent and pooled baselines, and gain nearly 2 BLEU points over standard MERT tuning.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  Noah A. Smith,et al.  Structured Sparsity in Structured Prediction , 2011, EMNLP.

[3]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[4]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[5]  Massimiliano Pontil,et al.  Regularized multi--task learning , 2004, KDD.

[6]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[7]  Ya Zhang,et al.  Boosted multi-task learning , 2010, Machine Learning.

[8]  Preslav Nakov,et al.  Optimizing for Sentence-Level BLEU+1 Yields Short Translations , 2012, COLING.

[9]  Ben Taskar,et al.  Joint covariate selection and joint subspace selection for multiple classification problems , 2010, Stat. Comput..

[10]  Aravind K. Joshi,et al.  Ranking and Reranking with Perceptron , 2005, Machine Learning.

[11]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[12]  Koby Crammer,et al.  Multi-domain learning by confidence-weighted parameter combination , 2010, Machine Learning.

[13]  B. Carpenter Lazy Sparse Stochastic Gradient Descent for Regularized Mutlinomial Logistic Regression , 2008 .

[14]  Andy Way,et al.  PLuTO: MT for online patent translation , 2010 .

[15]  M. I. Jordan Leo Breiman , 2011, 1101.0929.

[16]  Adam Lopez,et al.  Hierarchical Phrase-Based Translation with Suffix Arrays , 2007, EMNLP.

[17]  Kevin Duh,et al.  N-Best Reranking by Multitask Learning , 2010, WMT@ACL.

[18]  Samy Bengio,et al.  Links between perceptrons, MLPs and SVMs , 2004, ICML.

[19]  Qun Liu,et al.  Bagging-based System Combination for Domain Adaption , 2011, MTSUMMIT.

[20]  Andy Way,et al.  Experiments on Domain Adaptation for Patent Machine Translation in the PLuTO project , 2011, EAMT.

[21]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[22]  Trevor Darrell,et al.  An efficient projection for l1, ∞ regularization , 2009, ICML '09.

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Isabelle Guyon,et al.  Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[25]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[26]  Stefan Riezler,et al.  Analyzing Parallelism and Domain Similarities in the MAREC Patent Corpus , 2012, IRFC.

[27]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[28]  Vladimir Eidelman,et al.  cdec: A Decoder, Alignment, and Learning Framework for Finite- State and Context-Free Translation Models , 2010, ACL.

[29]  Chris Dyer,et al.  Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT , 2012, ACL.

[30]  Shankar Kumar,et al.  Efficient Minimum Error Rate Training and Minimum Bayes-Risk Decoding for Translation Hypergraphs and Lattices , 2009, ACL/IJCNLP.

[31]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[32]  Sophia Ananiadou,et al.  Stochastic Gradient Descent Training for L1-regularized Log-linear Models with Cumulative Penalty , 2009, ACL.

[33]  Christopher D. Manning,et al.  Hierarchical Bayesian Domain Adaptation , 2009, NAACL.

[34]  David Chiang,et al.  Hierarchical Phrase-Based Translation , 2007, CL.

[35]  M. Utiyama,et al.  A Japanese-English patent parallel corpus , 2007, MTSUMMIT.

[36]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[37]  Stefan Riezler,et al.  Structural and Topical Dimensions in Multi-Task Patent Translation , 2012, EACL.