When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish

In this paper we present the results of our participation in the Discriminating between Dutch and Flemish in Subtitles VarDial 2018 shared task. We try techniques proven to work well for discriminating between language varieties as well as explore the potential of using syntactic features, i.e. hierarchical syntactic subtrees. We experiment with different combinations of features. Discriminating between these two languages turned out to be a very hard task, not only for a machine: human performance is only around 0.51 F1 score; our best system is still a simple Naive Bayes model with word unigrams and bigrams. The system achieved an F1 score (macro) of 0.62, which ranked us 4th in the shared task.

[1]  Aric Hagberg,et al.  Exploring Network Structure, Dynamics, and Function using NetworkX , 2008, Proceedings of the Python in Science Conference.

[2]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[3]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[4]  Sjef Barbiers,et al.  Syntactische Atlas van de Nederlandse Dialecten Deel II / Syntactic Atlas of the Dutch Dialects Volume II , 2008 .

[5]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[6]  Çagri Çöltekin,et al.  Fewer features perform well at Native Language Identification task , 2017, BEA@EMNLP.

[7]  Stanley Peters,et al.  Cross-Serial Dependencies in Dutch , 1982 .

[8]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[9]  Yves Bestgen,et al.  Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets , 2017, VarDial.

[10]  Senja Pollak,et al.  PAN 2017: Author Profiling - Gender and Language Variety Prediction , 2017, CLEF.

[11]  Antal van den Bosch,et al.  Exploring Lexical and Syntactic Features for Language Variety Identification , 2017, VarDial.

[12]  Simon Dobnik,et al.  Identification of Languages in Algerian Arabic Multilingual Documents , 2017, WANLP@EACL.

[13]  Milan Straka,et al.  Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe , 2017, CoNLL.

[14]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[15]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[16]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[17]  Barbara Plank,et al.  When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages , 2017, VarDial.

[18]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[19]  Marine Carpuat,et al.  The NRC System for Discriminating Similar Languages , 2014, VarDial@COLING.