Morphologically-rich languages pose problems for machine translation (MT) systems, including word-alignment errors, data sparsity and multiple affixes. Current alignment models at word-level do not distinguish words and morphemes, thus yielding low-quality alignment and subsequently affecting end translation quality. Models using morpheme-level alignment can reduce the vocabulary size of morphologically-rich languages and overcomes data sparsity. The alignment data based on smallest units reveals subtle language features and enhances translation quality. Recent research proves such morpheme-level alignment (MA) data to be valuable linguistic resources for SMT, particularly for languages with rich morphology. In support of this research trend, the Linguistic Data Consortium (LDC) created Uzbek-English and Turkish-English alignment data which are manually aligned at the morpheme level. This paper describes the creation of MA corpora, including alignment and tagging process and approaches, highlighting annotation challenges and specific features of languages with rich morphology. The light tagging annotation on the alignment layer adds extra value to the MA data, facilitating users in flexibly tailoring the data for various MT model training.
[1]
Kristina Toutanova,et al.
Generating Complex Morphology for Machine Translation
,
2007,
ACL.
[2]
Marta R. Costa-jussà.
Ongoing Study for Enhancing Chinese-Spanish Translation with Morphology Strategies
,
2015,
HyTra@ACL.
[3]
Min-Yen Kan,et al.
Enhancing Morphological Alignment for Translating Highly Inflected Languages
,
2010,
COLING.
[4]
Kemal Oflazer,et al.
Simultaneous Word-Morpheme Alignment for Statistical Machine Translation
,
2013,
NAACL.
[5]
Stephanie Strassel,et al.
Enriching Word Alignment with Linguistic Tags
,
2010,
LREC.
[6]
Kristina Toutanova,et al.
Applying Morphology Generation Models to Machine Translation
,
2008,
ACL.