Multi-domain Adaptation for Statistical Machine Translation Based on Feature Augmentation

ドメイン適応は,機械翻訳を実用に使用するときの大きな課題の一つである.本稿 では,複数ドメインを前提とした,統計翻訳の適応方式を提案する.本稿の方式は, カバレッジが広い(未知語が少ない)コーパス結合モデルと,素性関数の精度がよい 単独ドメインモデルを併用する.これらを,機械学習のドメイン適応に用いられて いる素性空間拡張法の考え方で結合する.従来の機械翻訳における素性空間拡張法 は,単一のモデルを用いていたが,本稿の提案方式は,複数のモデルを用いること により,両者の利点を活かすことが特徴である.実験では,単独ドメインモデルに 比べ,翻訳品質が向上または同等を保持した.提案法は,当該ドメインの訓練コー パスが小規模である場合に高い効果を持ち,100万文規模の大規模コーパスを持つ ドメインへの適応に使用しても,翻訳品質を下げることなく,ドメインによっては 品質向上の効果がある.基本的な対数線形モデルでも,モデルの選択と設定を適切 に行うことで,最先端品質の適応方式が実現できることを示す. キーワード:ドメイン適応,フレーズベース統計翻訳,素性空間拡張法,コーパス結合モデ ル,empty 値

[1]  Alon Lavie,et al.  Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[2]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[3]  Alon Lavie,et al.  One System, Many Domains: Open-Domain Statistical Machine Translation via Feature Augmentation , 2012, AMTA.

[4]  Roland Kuhn,et al.  Mixture-Model Adaptation for SMT , 2007, WMT@ACL.

[5]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[6]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Roland Kuhn,et al.  Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.

[9]  Masao Utiyama,et al.  Preordering using a Target-Language Parser via Cross-Language Syntactic Projection for Statistical Machine Translation , 2015, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[10]  Toshiaki Nakazawa,et al.  ASPEC: Asian Scientific Paper Excerpt Corpus , 2016, LREC.

[11]  George F. Foster,et al.  Batch Tuning Strategies for Statistical Machine Translation , 2012, NAACL.

[12]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[13]  Haitao Mi,et al.  Max-Violation Perceptron and Forced Decoding for Scalable MT Training , 2013, EMNLP.

[14]  Kevin Duh,et al.  Automatic Evaluation of Translation Quality for Distant Language Pairs , 2010, EMNLP.

[15]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[16]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[17]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[18]  Hal Daumé,et al.  Frustratingly Easy Domain Adaptation , 2007, ACL.

[19]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[20]  Dragos Stefan Munteanu,et al.  Measuring Machine Translation Errors in New Domains , 2013, TACL.

[21]  Rico Sennrich,et al.  A Multi-Domain Translation Model Framework for Statistical Machine Translation , 2013, ACL.

[22]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[23]  Arianna Bisazza,et al.  Fill-up versus interpolation methods for phrase-based SMT adaptation , 2011, IWSLT.

[24]  Rico Sennrich,et al.  Perplexity Minimization for Translation Model Domain Adaptation in Statistical Machine Translation , 2012, EACL.

[25]  Mark Hopkins,et al.  Tuning as Ranking , 2011, EMNLP.

[26]  Spyridon Matsoukas,et al.  Discriminative Corpus Weight Estimation for Machine Translation , 2009, EMNLP.

[27]  Eiichiro Sumita,et al.  Dynamic Model Interpolation for Statistical Machine Translation , 2008, WMT@ACL.

[28]  Philipp Koehn,et al.  CCG Supertags in Factored Statistical Machine Translation , 2007, WMT@ACL.

[29]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[30]  Kemal Oflazer,et al.  Domain and Dialect Adaptation for Machine Translation into Egyptian Arabic , 2014, ANLP@EMNLP.