HeLI-based Experiments in Discriminating Between Dutch and Flemish Subtitles

This paper presents the experiments and results obtained by the SUKI team in the Discriminating between Dutch and Flemish in Subtitles shared task of the VarDial 2018 Evaluation Campaign. Our best submission was ranked 8th, obtaining macro F1-score of 0.61. Our best results were produced by a language identifier implementing the HeLI method without any modifications. We describe, in addition to the best method we used, some of the experiments we did with unsupervised clustering.

[1]  Antal van den Bosch,et al.  Exploring Lexical and Syntactic Features for Language Variety Identification , 2017, VarDial.

[2]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[3]  Karen Shiells,et al.  Unsupervised Clustering for Language Identification , 2010 .

[4]  Krister Lindén,et al.  Evaluation of language identification methods using 285 languages , 2017, NODALIDA.

[5]  Walter Daelemans,et al.  Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource , 2016, LREC.

[6]  Theo Meder From a Dutch Folktale Database towards an International Folktale Database , 2010 .

[7]  Christian Biemann,et al.  Disentangling from Babylonian Confusion - Unsupervised Language Identification , 2005, CICLing.

[8]  Ron Zacharski,et al.  Language Recognition for Mono-and Multi-lingual Documents , 1999 .

[9]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[10]  Marco Lui,et al.  Generalized language identification , 2014 .

[11]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[12]  Krister Lindén,et al.  Evaluating HeLI with Non-Linear Mappings , 2017, VarDial.

[13]  Krister Lindén,et al.  The Finno-Ugric Languages and The Internet Project , 2015 .

[14]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[15]  Krister Lindén,et al.  HeLI, a Word-Based Backoff Method for Language Identification , 2016, VarDial@COLING.

[16]  Tommi Jauhiainen,et al.  Tekstin kielen automaattinen tunnistaminen , 2010 .

[17]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[18]  Djoerd Hiemstra,et al.  An exploration of language identification techniques for the Dutch folktale database , 2012 .

[19]  Anil Kumar Singh Study of Some Distance Measures for Language and Encoding Identification , 2006 .

[20]  Vittorio Loreto,et al.  Language trees and zipping. , 2002, Physical review letters.

[21]  Krister Lindén,et al.  Discriminating Similar Languages with Token-Based Backoff , 2015 .

[22]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.