Evaluating HeLI with Non-Linear Mappings

In this paper we describe the non-linear mappings we used with the Helsinki language identification method, HeLI, in the 4th edition of the Discriminating between Similar Languages (DSL) shared task, which was organized as part of the VarDial 2017 workshop. Our SUKI team participated in the closed track together with 10 other teams. Our system reached the 7th position in the track. We describe the HeLI method and the non-linear mappings in mathematical notation. The HeLI method uses a probabilistic model with character n-grams and word-based backoff. We also describe our trials using the non-linear mappings instead of relative frequencies and we present statistics about the back-off function of the HeLI method.

[1]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[2]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[3]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[4]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[5]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[6]  Chew Yew Choong,et al.  Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages , 2009 .

[7]  Marcos Zampieri,et al.  N-gram Language Models and POS Distribution for the Identification of Spanish Varieties (Ngrammes et Traits Morphosyntaxiques pour la Identification de Variétés de l’Espagnol) [in French] , 2013, JEP/TALN/RECITAL.

[8]  Marcos Zampieri,et al.  Using bag-of-words to distinguish similar languages: How efficient are they? , 2013, 2013 IEEE 14th International Symposium on Computational Intelligence and Informatics (CINTI).

[9]  Nikola Ljubešić,et al.  Discriminating between VERY similar languages among Twitter users , 2014 .

[10]  Krister Lindén,et al.  Discriminating Similar Languages with Token-Based Backoff , 2015 .

[11]  Yves Bestgen,et al.  Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets , 2017, VarDial.

[12]  Bali Ranaivo-Malancon,et al.  Automatic Identification of Close Languages - Case study: Malay and Indonesian , 1970 .

[13]  Ralf D. Brown,et al.  Finding and identifying text in 900+ languages , 2012, Digit. Investig..

[14]  Nikola Ljubesic,et al.  Discriminating Between Closely Related Languages on Twitter , 2015, Informatica.

[15]  Marco Lui,et al.  Generalized language identification , 2014 .

[16]  Carlos Gómez-Rodríguez,et al.  Language variety identification in Spanish tweets , 2014, EMNLP 2014.

[17]  Krister Lindén,et al.  HeLI, a Word-Based Backoff Method for Language Identification , 2016, VarDial@COLING.

[18]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[19]  N. Mikelic,et al.  Language Indentification: How to Distinguish Similar Languages? , 2007, 2007 29th International Conference on Information Technology Interfaces.

[20]  Binyam Gebrekidan Gebre,et al.  Classifying pluricentric languages: Extending the monolingual model , 2012 .

[21]  Adrien Barbaresi,et al.  An Unsupervised Morphological Criterion for Discriminating Similar Languages , 2016, VarDial@COLING.

[22]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[23]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[24]  Tommi Jauhiainen,et al.  Tekstin kielen automaattinen tunnistaminen , 2010 .

[25]  Ralf D. Brown,et al.  Non-linear Mapping for Improved Identification of 1300+ Languages , 2014, EMNLP.

[26]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.