Normalising orthographic and dialectal variants for the automatic processing of Swiss German

Swiss dialects of German are, unlike most dialects of well standardised languages, widely used in everyday communication. Despite this fact, they lack tools and resources for natural language processing. The main reason for this is the fact that the dialects are mostly spoken and that written resources are small and highly inconsistent. This paper addresses the great variability in writing that poses a problem for automatic processing. We propose an automatic approach to normalising the variants to a single representation intended for processing tools’ internal use (not shown to human users). We manually create a sample of transcribed and normalised texts, which we use to train and test three methods based on machine translation: word-by-word mappings, character-based machine translation, and language modelling. We show that an optimal combination of the three approaches gives better results than any of them separately.

[1]  Jörg Tiedemann,et al.  Character-Based PSMT for Closely Related Languages , 2009, EAMT.

[2]  Hans Goebl,et al.  Kurzbericht über die Dialektometrisierung des Gesamtnetzes des „Sprachatlasses der deutschen Schweiz“ (SDS) , 2013 .

[3]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[4]  Véronique Hoste,et al.  Normalization of Dutch User-Generated Content , 2013, RANLP.

[5]  Yves Scherrer Morphology Generation for Swiss German Dialects , 2011, SFCM.

[6]  Yves Scherrer Recovering dialect geography from an unaligned comparable corpus , 2012, EACL 2012.

[7]  Yves Scherrer,et al.  Natural Language Processing for the Swiss German Dialect Area , 2010, KONVENS.

[8]  E. Hinrichs,et al.  The Tübingen Treebanks for Spoken German, English, and Japanese , 2000 .

[9]  Yves Scherrer Syntactic transformations for Swiss German dialects , 2011 .

[10]  Hermann Ney,et al.  Can We Translate Letters? , 2007, WMT@ACL.

[11]  Nora Hollenstein,et al.  Compilation of a Swiss German Dialect Corpus and its Application to PoS Tagging , 2014, VarDial@COLING.

[12]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[13]  Marcos Zampieri,et al.  The Taming of a Dialect: Interlinear Glossing of Swiss German Text Messages , 2013 .

[14]  Robert Weibel,et al.  Correlating morphosyntactic dialect variation with geographic distance : Local beats global , 2014 .

[15]  Paul Rayson,et al.  VARD2 : a tool for dealing with spelling variation in historical corpora , 2008 .