Experiments in Cuneiform Language Identification

This paper presents methods to discriminate between languages and dialects written in Cuneiform script, one of the first writing systems in the world. We report the results obtained by the PZ team in the Cuneiform Language Identification (CLI) shared task organized within the scope of the VarDial Evaluation Campaign 2019. The task included two languages, Sumerian and Akkadian. The latter is divided into six dialects: Old Babylonian, Middle Babylonian peripheral, Standard Babylonian, Neo Babylonian, Late Babylonian, and Neo Assyrian. We approach the task using a meta-classifier trained on various SVM models and we show the effectiveness of the system for this task. Our submission achieved 0.738 F1 score in discriminating between the seven languages and dialects and it was ranked fourth in the competition among eight teams.

[1]  Shervin Malmasi,et al.  Predicting Post Severity in Mental Health Forums , 2016, CLPsych@HLT-NAACL.

[2]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[3]  Arkaitz Zubiaga,et al.  TweetLID: a benchmark for tweet language identification , 2016, Lang. Resour. Evaluation.

[4]  Timothy Baldwin,et al.  Automatic Language Identification in Texts: A Survey , 2018, J. Artif. Intell. Res..

[5]  Wang Ling,et al.  Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation , 2015, EMNLP.

[6]  Djoerd Hiemstra,et al.  An exploration of language identification techniques for the Dutch folktale database , 2012 .

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Athena Stassopoulou,et al.  A Classifier to Distinguish Between Cypriot Greek and Standard Modern Greek , 2018, 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS).

[9]  Ricardo Vilalta,et al.  Introduction to the Special Issue on Meta-Learning , 2004, Machine Learning.

[10]  Shervin Malmasi,et al.  German Dialect Identification in Interview Transcriptions , 2017, VarDial.

[11]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  Radu Tudor Ionescu,et al.  MOROCO: The Moldavian and Romanian Dialectal Corpus , 2019, ACL.

[14]  Barbara Plank,et al.  When Sparse Traditional Models Outperform Dense Neural Networks: the Curious Case of Discriminating between Similar Languages , 2017, VarDial.

[15]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[16]  Gustavo Henrique Paetzold UTFPR at SemEval-2019 Task 6: Relying on Compositionality to Find Offense , 2019, SemEval@NAACL-HLT.

[17]  Krister Lindén,et al.  Language and Dialect Identification of Cuneiform Texts , 2019, Proceedings of the Sixth Workshop on.

[18]  Jörg Tiedemann,et al.  Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[19]  Gustavo Paetzold UTFPR at IEST 2018: Exploring Character-to-Word Composition for Emotion Analysis , 2018, WASSA@EMNLP.

[20]  Christian Chiarcos,et al.  Towards a Linked Open Data Edition of Sumerian Corpora , 2018, LREC.

[21]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[22]  Shervin Malmasi,et al.  Arabic Dialect Identification Using iVectors and ASR Transcripts , 2017, VarDial.

[23]  Liviu P. Dinu,et al.  A Computational Perspective on the Romanian Dialects , 2016, LREC.

[24]  Barbara Plank,et al.  When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish , 2018, VarDial@COLING 2018.

[25]  Josef van Genabith,et al.  Exploring the Use of Text Classification in the Legal Domain , 2017, ASAIL@ICAIL.

[26]  Francis M. Tyers,et al.  A Report on the Third VarDial Evaluation Campaign , 2019, Proceedings of the Sixth Workshop on.

[27]  Preslav Nakov,et al.  Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign , 2018, VarDial@COLING 2018.