Data Selection for Compact Adapted SMT Models

Data selection is a common technique for adapting statistical translation models for a specific domain, which has been shown to both improve translation quality and to reduce model size. Selection relies on some in-domain data, of the same domain of the texts expected to be translated. Selecting the sentence-pairs that are most similar to the in-domain data from a pool of parallel texts has been shown to be effective; yet, this approach holds the risk of resulting in a limited coverage, when necessary n-grams that do appear in the pool are less similar to in-domain data that is available in advance. Some methods select additional data based on the actual text that needs to be translated. While useful, this is not always a practical scenario. In this work we describe an extensive exploration of data selection techniques over Arabic to French datasets, and propose methods to address both similarity and coverage considerations while maintaining a limited model size.

[1]  Germán Sanchis-Trilles,et al.  Does more data always yield better translations? , 2012, EACL.

[2]  Yan Song,et al.  Entropy-based Training Data Selection for Domain Adaptation , 2012, COLING.

[3]  William Lewis,et al.  Dramatically Reducing Training Data Size Through Vocabulary Saturation , 2013, WMT@ACL.

[4]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[5]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[6]  Roland Kuhn,et al.  Discriminative Instance Weighting for Domain Adaptation in Statistical Machine Translation , 2010, EMNLP.

[7]  Rohit Prasad,et al.  Automatic Tune Set Generation for Machine Translation with Limited Indomain Data , 2012, EAMT.

[8]  Holger Schwenk,et al.  Multimodal Comparable Corpora as Resources for Extracting Parallel Data: Parallel Phrases Extraction , 2013, IJCNLP.

[9]  Jianfeng Gao,et al.  Toward a unified approach to statistical language modeling for Chinese , 2002, TALIP.

[10]  Mauro Cettolo,et al.  WIT3: Web Inventory of Transcribed and Translated Talks , 2012, EAMT.

[11]  Holger Schwenk,et al.  Parallel Texts Extraction from Multimodal Comparable Corpora , 2012, JapTAL.

[12]  Nizar Habash,et al.  MADA + TOKAN : A Toolkit for Arabic Tokenization , Diacritization , Morphological Disambiguation , POS Tagging , Stemming and Lemmatization , 2009 .

[13]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[14]  Alexander H. Waibel,et al.  Language Model Adaptation for Statistical Machine Translation Based on Information Retrieval , 2004, LREC.

[15]  Qun Liu,et al.  Improving Statistical Machine Translation Performance by Training Data Selection and Optimization , 2007, EMNLP-CoNLL.

[16]  Shachar Mirkin,et al.  Assessing quick update methods of statistical translation models , 2013, IWSLT.

[17]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[18]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[19]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[20]  Eiichiro Sumita,et al.  Method of Selecting Training Data to Build a Compact and Efficient Translation Model , 2008, IJCNLP.

[21]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.