Tübingen system in VarDial 2017 shared task: experiments with language identification and cross-lingual parsing

This paper describes our systems and results on VarDial 2017 shared tasks. Besides three language/dialect discrimination tasks, we also participated in the cross-lingual dependency parsing (CLP) task using a simple methodology which we also briefly describe in this paper. For all the discrimination tasks, we used linear SVMs with character and word features. The system achieves competitive results among other systems in the shared task. We also report additional experiments with neural network models. The performance of neural network models was close but always below the corresponding SVM classifiers in the discrimination tasks. For the cross-lingual parsing task, we experimented with an approach based on automatically translating the source treebank to the target language, and training a parser on the translated treebank. We used off-the-shelf tools for both translation and parsing. Despite achieving better-thanbaseline results, our scores in CLP tasks were substantially lower than the scores of the other participants.

[1]  Jörg Tiedemann,et al.  Rediscovering Annotation Projection for Cross-Lingual Parser Induction , 2014, COLING.

[2]  Sergiu Nisioi,et al.  Vanilla Classifiers for Distinguishing between Similar Languages , 2016, VarDial@COLING.

[3]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[4]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[5]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[6]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[7]  Shervin Malmasi,et al.  Arabic Dialect Identification in Speech Transcripts , 2016, VarDial@COLING.

[8]  Adrien Barbaresi,et al.  An Unsupervised Morphological Criterion for Discriminating Similar Languages , 2016, VarDial@COLING.

[9]  Slav Petrov,et al.  Multi-Source Transfer of Delexicalized Dependency Parsers , 2011, EMNLP.

[10]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[11]  Krister Lindén,et al.  HeLI, a Word-Based Backoff Method for Language Identification , 2016, VarDial@COLING.

[12]  Iñaki Alegria,et al.  Comparing Two Basic Methods for Discriminating Between Similar Languages and Varieties , 2016, VarDial@COLING.

[13]  Fahim Dalvi,et al.  QCRI @ DSL 2016: Spoken Arabic Dialect Identification Using Textual Features , 2016, VarDial@COLING.

[14]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[15]  Leila Kosseim,et al.  N-gram and Neural Language Models for Discriminating Similar Languages , 2016, VarDial@COLING.

[16]  Jörg Tiedemann,et al.  Treebank Translation for Cross-Lingual Parser Induction , 2014, CoNLL.

[17]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[18]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[19]  Johannes Bjerva Byte-based Language Identification with Deep Convolutional Networks , 2016, VarDial@COLING.

[20]  Cyril Goutte,et al.  Advances in Ngram-based Discrimination of Similar Languages , 2016, VarDial@COLING.

[21]  Regina Barzilay,et al.  Selective Sharing for Multilingual Dependency Parsing , 2012, ACL.

[22]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[23]  Richard Johansson,et al.  ASIREM Participation at the Discriminating Similar Languages Shared Task 2016 , 2016, VarDial@COLING.

[24]  Paul McNamee Language and Dialect Discrimination Using Compression-Inspired Language Models , 2016, VarDial@COLING.

[25]  Bart Desmet,et al.  The GW/LT3 VarDial 2016 Shared Task System for Dialects and Similar Languages Detection , 2016, VarDial@COLING.

[26]  Jan Hajic,et al.  UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing , 2016, LREC.

[27]  Çağrı Çöltekin,et al.  Discriminating Similar Languages with Linear SVMs and Neural Networks , 2016, VarDial@COLING.

[28]  Yonatan Belinkov,et al.  A Character-level Convolutional Neural Network for Distinguishing Similar Languages and Dialects , 2016, VarDial@COLING.

[29]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[30]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[31]  Philip Resnik,et al.  Bootstrapping parsers via syntactic projection across parallel texts , 2005, Natural Language Engineering.

[32]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[33]  Radu Tudor Ionescu,et al.  UnibucKernel: An Approach for Arabic Dialect Identification Based on Multiple String Kernels , 2016, VarDial@COLING.

[34]  Jörg Tiedemann,et al.  Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[35]  Vít Baisa,et al.  DSL Shared Task 2016: Perfect Is The Enemy of Good Language Discrimination Through Expectation-Maximization and Chunk-based Language Model , 2016, VarDial@COLING.

[36]  David Yarowsky,et al.  Inducing Multilingual Text Analysis Tools via Robust Projection across Aligned Corpora , 2001, HLT.

[37]  Philip Resnik,et al.  Cross-Language Parser Adaptation between Related Languages , 2008, IJCNLP.