Statistical models for text normalization and machine translation

Irish and Scottish Gaelic are closely-related languages that together with Manx Gaelic make up the Goidelic branch of the Celtic family. We present a statistical model for translation from Scottish Gaelic to Irish that we hope will facilitate communication between the two language communities, especially in social media. An important aspect of this work is to overcome the orthographical differences between the languages, many of which were introduced in a major spelling reform of Irish in the 1940’s and 1950’s. Prior to that date, the orthographies of the two languages were quite similar, thanks in part to a shared literary tradition. As a consequence of this, machine translation from Scottish Gaelic to Irish has a great deal in common with the problem of normalizing pre-standard Irish texts, a problem with applications to lexicography and information retrieval. We show how a single statistical model can be used effectively in both contexts.

[1]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[2]  Jan Hajic,et al.  Machine Translation of Very Close Languages , 2000, ANLP.

[3]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[4]  Jacqueline A. Jaffe Arthur Conan Doyle , 1987 .

[5]  Kepa Sarasola,et al.  An open-source shallow-transfer machine translation engine for the Romance languages of Spain , 2005, EAMT.

[6]  Hans Uszkoreit,et al.  The Irish Language in the Digital Age , 2012 .

[7]  Jörg Tiedemann,et al.  An SMT Approach to Automatic Annotation of Historical Text , 2013 .

[8]  Kevin P. Scannell Machine translation for closely related language pairs , 2022 .

[9]  Bryce Miller Translating Between Closely Related Languages in Statistical Machine Translation , 2008 .

[10]  Preslav Nakov,et al.  Combining Word-Level and Character-Level Models for Machine Translation Between Closely-Related Languages , 2012, ACL.

[11]  I. Çiçekli,et al.  1 A Machine Translation System Between a Pair of Closely Related Languages , 2002 .

[12]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[13]  Francesc d’Assı́s Rule-Based Augmentation of Training Data in Breton – French Statistical Machine Translation , 2009 .

[14]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[15]  Tomás De Bhaldraithe English-Irish dictionary , 1959 .