Training Part-of-Speech Taggers to build Machine Translation Systems for Less-Resourced Language Pairs

In this paper we review an unsupervised method that can be used to train the hidden-Markov- model-based part-of-speech taggers used within the open- source shallow-transfer machine translation (MT) engine Apertiu m. This method uses the re maining modules of the MT engine and a target language model to ob- tain part-of-speech taggers that are then used within the Apertiu m MT engine in order to produce translations. The experi mental results on the Occitan-Catalan language pair (a case study of a less-resourced language pair) show that the a mount of corpora needed by this training method is s mall co mpared with the usual corpus sizes needed by the standard (unsupervised) Bau m-Welch algorith m. This makes the method appropriate to train part-of-speech taggers to be used in MT for less- resourced language pairs. Moreover, the translation perfor mance of the MT syste m embedding the resulting part-of-speech tagger is co mparatively better.

[1]  Mikel L. Forcada,et al.  Speeding Up Target-Language Driven Part-of-Speech Tagger Training for Machine Translation , 2006, MICAI.

[2]  O. Morgenthaler,et al.  Proceedings of the Conference , 1930 .

[3]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[4]  Mikel L. Forcada,et al.  Open-Source Portuguese-Spanish Machine Translation , 2006, PROPOR.

[5]  Jaime G. Carbonell,et al.  Context-Based Machine Translation , 2006, AMTA.

[6]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[7]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[8]  Mikel L. Forcada,et al.  Cooperative unsupervised training of the part-of-speech taggers in a bidirectional machine translation system , 2004 .

[9]  Mikel L. Forcada Open-source machine translation between small languages : Catalan and Aranese Occitan Carme , 2006 .

[10]  Rafael C. Carrasco,et al.  Incremental construction and maintenance of morphological analysers based on augmented letter transducers , 2002, TMI.

[11]  M. Forcada Open-source machine translation : an opportunity for minor languages , 2006 .

[12]  Kepa Sarasola,et al.  An open-source shallow-transfer machine translation engine for the Romance languages of Spain , 2005, EAMT.

[13]  Kenneth Ward Church,et al.  Poor Estimates of Context are Worse than None , 1990, HLT.

[14]  Mikel L. Forcada,et al.  Exploring the Use of Target-Language Information to Train the Part-of-Speech Tagger of Machine Translation Systems , 2004, EsTAL.

[15]  Pius ten Hacken Computers and translation: a translator's guide , 2004 .

[16]  Kepa Sarasola,et al.  Strategies for developing machine translation for minority languages 5th SALTMIL Workshop on Minority Languages , 2006 .