Dealing with Data Sparseness in SMT with Factured Models and Morphological Expansion: a Case Study on Croatian

This paper describes our experience using available linguistic resources for Croatian in order to address data sparseness when building an English-to-Croatian general domain phrasebased statistical machine translation system. We report the results obtained with factored translation models and morphological expansion, highlight the impact of the algorithm used for tagging the corpora, and show that the improvement brought by these methods is compatible with the application of data selection on out-of-domain parallel corpora.

[1]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[2]  Taro Watanabe,et al.  Online Large-Margin Training for Statistical Machine Translation , 2007, EMNLP.

[3]  Marco Turchi,et al.  Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources , 2011, Polytech. Open Libr. Int. Bull. Inf. Technol. Sci..

[4]  Gorka Labaka,et al.  A hybrid machine translation architecture guided by syntax , 2014, Machine Translation.

[5]  Nikola Ljubesic,et al.  New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian , 2016, LREC.

[6]  Wang Ling,et al.  Entropy-based Pruning for Phrase-based Machine Translation , 2012, EMNLP.

[7]  Marta R. Costa-jussà,et al.  Statistical machine translation enhancements through linguistic levels: A survey , 2014, CSUR.

[8]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[9]  Ondrej Bojar,et al.  No Free Lunch in Factored Phrase-Based Machine Translation , 2013, CICLing.

[10]  Philipp Koehn,et al.  More Linguistic Annotation for Statistical Machine Translation , 2010, WMT@ACL.

[11]  Ondrej Bojar,et al.  English-to-Czech Factored Machine Translation , 2007, WMT@ACL.

[12]  Philipp Koehn,et al.  Experiments in Domain Adaptation for Statistical Machine Translation , 2007, WMT@ACL.

[13]  Nikola Ljubesic,et al.  {bs,hr,sr}WaC - Web Corpora of Bosnian, Croatian and Serbian , 2014, WaC@EACL.

[14]  Nikola Ljubesic,et al.  The SETimes.HR Linguistically Annotated Corpus of Croatian , 2014, LREC.

[15]  Raivis Skadins,et al.  Improving SMT for Baltic Languages with Factored Models , 2010, Baltic HLT.