Discriminating between Mandarin Chinese and Swiss-German varieties using adaptive language models

This paper describes the language identification systems used by the SUKI team in the Discriminating between the Mainland and Taiwan variation of Mandarin Chinese (DMT) and the German Dialect Identification (GDI) shared tasks which were held as part of the third VarDial Evaluation Campaign. The DMT shared task included two separate tracks, one for the simplified Chinese script and one for the traditional Chinese script. We submitted three runs on both tracks of the DMT task as well as on the GDI task. We won the traditional Chinese track using Naive Bayes with language model adaptation, came second on GDI with an adaptive version of the HeLI 2.0 method, and third on the simplified Chinese track using again the adaptive Naive Bayes.

[1]  Chu-Ren Huang,et al.  SINICA CORPUS : Design Methodology for Balanced Corpora , 1996, PACLIC.

[2]  Chu-Ren Huang,et al.  Sinica Treebank: Design Criteria, Annotation Guidelines, and On-line Interface , 2000, ACL 2000.

[3]  Tony McEnery,et al.  The Lancaster Corpus of Mandarin Chinese , 2003 .

[4]  Yingna Chen,et al.  Language model adaptation and confidence measure for robust language identification , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[5]  Jia Liu,et al.  Confidence Measure Based Incremental Adaptation for Online Language Identification , 2007, HCI.

[6]  Chu-Ren Huang,et al.  Contrastive Approach towards Text Source Classification based on Top-Bag-of-Word Similarity , 2008, PACLIC.

[7]  Ralf D. Brown,et al.  Finding and identifying text in 900+ languages , 2012, Digit. Investig..

[8]  Chu-Ren Huang,et al.  Corpus-based Study and Identification of Mandarin Chinese Light Verb Variations , 2014, VarDial@COLING.

[9]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[10]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[11]  Krister Lindén,et al.  HeLI, a Word-Based Backoff Method for Language Identification , 2016, VarDial@COLING.

[12]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[13]  Yves Scherrer,et al.  ArchiMob - A Corpus of Spoken Swiss German , 2016, LREC.

[14]  Mingwen Wang,et al.  Sentence-Level Dialects Identification in the Greater China Region , 2016, ArXiv.

[15]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[16]  Barbara Plank,et al.  When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages , 2017, VarDial.

[17]  Çagri Çöltekin,et al.  Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing , 2017, VarDial.

[18]  Shervin Malmasi,et al.  German Dialect Identification in Interview Transcriptions , 2017, VarDial.

[19]  Mingwen Wang,et al.  Building Parallel Monolingual Gan Chinese Dialects Corpus , 2018, LREC.

[20]  Krister Lindén,et al.  HeLI-based Experiments in Swiss German Dialect Identification , 2018, VarDial@COLING 2018.

[21]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[22]  Mark Cieliebak,et al.  Twist Bytes - German Dialect Identification with Data Mining Optimization , 2018, VarDial@COLING 2018.

[23]  Krister Lindén,et al.  Iterative Language Model Adaptation for Indo-Aryan Language Identification , 2018, VarDial@COLING 2018.

[24]  Mohamed Ali Character Level Convolutional Neural Network for German Dialect Identification , 2018, VarDial@COLING 2018.

[25]  Krister Lindén,et al.  HeLI-based Experiments in Discriminating Between Dutch and Flemish Subtitles , 2018, VarDial@COLING 2018.

[26]  Mohamed Ali,et al.  Character Level Convolutional Neural Network for Arabic Dialect Identification , 2018, VarDial@COLING 2018.

[27]  Krister Lindén,et al.  Language model adaptation for language and dialect identification of text , 2019, Natural Language Engineering.

[28]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[29]  Krister Lindén,et al.  Language and Dialect Identification of Cuneiform Texts , 2019, Proceedings of the Sixth Workshop on.

[30]  Francis M. Tyers,et al.  A Report on the Third VarDial Evaluation Campaign , 2019, Proceedings of the Sixth Workshop on.