Machine Translation of Low-Resource Spoken Dialects: Strategies for Normalizing Swiss German

The goal of this work is to design a machine translation (MT) system for a low-resource family of dialects, collectively known as Swiss German, which are widely spoken in Switzerland but seldom written. We collected a significant number of parallel written resources to start with, up to a total of about 60k words. Moreover, we identified several other promising data sources for Swiss German. Then, we designed and compared three strategies for normalizing Swiss German input in order to address the regional diversity. We found that character-based neural MT was the best solution for text normalization. In combination with phrase-based statistical MT, our solution reached 36% BLEU score when translating from the Bernese dialect. This value, however, decreases as the testing data becomes more remote from the training one, geographically and topically. These resources and normalization techniques are a first step towards full MT of Swiss German dialects.

[1]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[2]  Philip N. Garner,et al.  Automatic speech recognition and translation of a Swiss German dialect: Walliserdeutsch , 2014, INTERSPEECH.

[3]  Chris Callison-Burch,et al.  Machine Translation of Arabic Dialects , 2012, NAACL.

[4]  Yves Scherrer,et al.  Normalising orthographic and dialectal variants for the automatic processing of Swiss German , 2015 .

[5]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[6]  Sylvia Moosmüller,et al.  Orthographic encoding of the Viennese dialect for machine translation , 2013 .

[7]  Charles V. J. Russ,et al.  The German Language Today: A Linguistic Introduction , 1994 .

[8]  Charles V. J. Russ,et al.  The Dialects of Modern German: A Linguistic Survey , 1991 .

[9]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[10]  Yves Scherrer Machine translation into multiple dialects: The example of Swiss German , 2012 .

[11]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[12]  Yves Scherrer,et al.  ArchiMob - A Corpus of Spoken Swiss German , 2016, LREC.

[13]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[14]  Kevin Knight,et al.  Deciphering Related Languages , 2017, EMNLP.

[15]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[16]  David Yarowsky,et al.  Toward Statistical Machine Translation without Parallel Corpora , 2012, EACL 2012.

[17]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[18]  Yves Scherrer,et al.  Automatic normalisation of the Swiss German ArchiMob corpus using character-level machine translation , 2016, KONVENS.

[19]  William Lewis,et al.  Crisis MT: Developing A Cookbook for MT in Crisis Situations , 2011, WMT@EMNLP.

[20]  Chris Callison-Burch,et al.  Combining Bilingual and Comparable Corpora for Low Resource Machine Translation , 2013, WMT@ACL.

[21]  Preslav Nakov,et al.  WERD: Using social text spelling variants for evaluating dialectal speech recognition , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[22]  Elvira Glaser,et al.  Kleiner Sprachatlas der deutschen Schweiz , 2013 .

[23]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[24]  William Lewis,et al.  Haitian Creole: How to Build and Ship an MT Engine from Scratch in 4 days, 17 hours, & 30 minutes , 2010, EAMT.

[25]  Stella Markantonatou,et al.  METIS-II: low resource machine translation , 2008, Machine Translation.

[26]  Chris Callison-Burch,et al.  End-to-end statistical machine translation with zero or small parallel texts , 2016, Nat. Lang. Eng..

[27]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[28]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[29]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[30]  Yves Scherrer,et al.  Generating Swiss German sentences from Standard German: a multi-dialectal approach , 2012 .