论文信息 - LSDC - A comprehensive dataset for Low Saxon Dialect Classification

LSDC - A comprehensive dataset for Low Saxon Dialect Classification

We present a new comprehensive dataset for the unstandardised West-Germanic language Low Saxon covering the last two centuries, the majority of modern dialects and various genres, which will be made openly available in connection with the final version of this paper. Since so far no such comprehensive dataset of contemporary Low Saxon exists, this provides a great contribution to NLP research on this language. We also test the use of this dataset for dialect classification by training a few baseline models comparing statistical and neural approaches. The performance of these models shows that in spite of an imbalance in the amount of data per dialect, enough features can be learned for a relatively high classification accuracy.

[1] B. Panzer,et al. Die Einteilung der niederdeutschen Mundarten auf Grund der strukturellen Entwicklung des Vokalismus , 1971 .

[2] Dieter Stellmacher. Niederdeutsche Sprache : eine Einführung , 1990 .

[3] C. Moseley,et al. Atlas Of The World’s Languages In Danger , 2015 .

[4] Vladimir E. Orel,et al. A Handbook of Germanic Etymology , 2003 .

[5] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[6] Timothy Baldwin,et al. Cross-domain Feature Selection for Language Identification , 2011, IJCNLP.

[7] András Kornai. Digital language death , 2013 .

[8] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[9] Kultur og kirkedepartementet,et al. European Charter for Regional or Minority Languages , 1999, Nationalities Papers.

[10] Timothy Baldwin,et al. Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..