Iterative Language Model Adaptation for Indo-Aryan Language Identification

This paper presents the experiments and results obtained by the SUKI team in the Indo-Aryan Language Identification shared task of the VarDial 2018 Evaluation Campaign. The shared task was an open one, but we did not use any corpora other than what was distributed by the organizers. A total of eight teams provided results for this shared task. Our submission using a HeLI-method based language identifier with iterative language model adaptation obtained the best results in the shared task with a macro F1-score of 0.958.

[1]  David W. Hosmer,et al.  Applied Logistic Regression , 1991 .

[2]  Ashum Gupta,et al.  Exploring Word Recognition in a Semi-Alphabetic Script: The Case of Devanagari , 2002, Brain and Language.

[3]  Brian Roark,et al.  Unsupervised language model adaptation , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[4]  Yingna Chen,et al.  Language model adaptation and confidence measure for robust language identification , 2005, IEEE International Symposium on Communications and Information Technology, 2005. ISCIT 2005..

[5]  Kavi Narayana Murthy,et al.  Language identification from small text samples* , 2006, J. Quant. Linguistics.

[6]  H. Isahara,et al.  Language, Script, and Encoding Identification with String Kernel Classifiers , 2006 .

[7]  Jia Liu,et al.  Confidence Measure Based Incremental Adaptation for Online Language Identification , 2007, HCI.

[8]  Tommi Jauhiainen,et al.  Tekstin kielen automaattinen tunnistaminen , 2010 .

[9]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[10]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[11]  P. C. Reghu Raj,et al.  N-gram based algorithm for distinguishing between Hindi and Sanskrit texts , 2013, 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT).

[12]  P. C. Reghu Raj,et al.  Text Based Language Identification System for Indian Languages Following Devanagiri Script , 2014 .

[13]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[14]  Krister Lindén,et al.  The Finno-Ugric Languages and The Internet Project , 2015 .

[15]  Krister Lindén,et al.  Discriminating Similar Languages with Token-Based Backoff , 2015 .

[16]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[17]  Krister Lindén,et al.  HeLI, a Word-Based Backoff Method for Language Identification , 2016, VarDial@COLING.

[18]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[19]  Ferran Plà,et al.  Language identification of multilingual posts from Twitter: a case study , 2017, Knowledge and Information Systems.

[20]  Brendan T. O'Connor,et al.  A Dataset and Classifier for Recognizing Social Media English , 2017, NUT@EMNLP.

[21]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[22]  Krister Lindén,et al.  Evaluating HeLI with Non-Linear Mappings , 2017, VarDial.

[23]  Krister Lindén,et al.  Evaluation of language identification methods using 285 languages , 2017, NODALIDA.

[24]  Wushour Silamu,et al.  On Hierarchical Text Language-Identification Algorithms , 2018, Algorithms.

[25]  Ritesh Kumar,et al.  Automatic Identification of Closely-related Indian Languages: Resources and Experiments , 2018, ArXiv.

[26]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.

[27]  Girish Nath Jha,et al.  Automatic Language Identification System for Hindi and Magahi , 2018, ArXiv.

[28]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..