Discriminating between standard Romanian and Moldavian tweets using filtered character ngrams

We applied word unigram models, character ngram models, and CNNs to the task of distinguishing tweets of two related dialects of Romanian (standard Romanian and Moldavian) for the VarDial 2020 RDI shared task (Găman et al., 2020). The main challenge of the task was to perform cross-genre text classification: specifically, the models must be trained using text from news articles, and be used to predict tweets. Our best model was a Naı̈ve Bayes model trained on character ngrams, with the most common ngrams filtered out. We also applied SVMs and CNNs, but while they yielded the best performance on an evaluation dataset of news article, their accuracy significantly dropped when they were used to predict tweets. Our best model reached an F1 score of 0.715 on the evaluation dataset of tweets, and 0.667 on the held-out test dataset. The model ended up in the third place in the shared task.

[1]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[2]  Dirk Hovy,et al.  A Report on the VarDial Evaluation Campaign 2020 , 2020, VARDIAL.

[3]  Radu Tudor Ionescu,et al.  The Unreasonable Effectiveness of Machine Learning in Moldavian versus Romanian Dialect Identification , 2020, International Journal of Intelligent Systems.

[4]  Mykola Pechenizkiy,et al.  Graph-Based N-gram Language Identication on Short Texts , 2011 .

[5]  D. Tudoreanu DTeam @ VarDial 2019: Ensemble based on skip-gram and triplet loss neural networks for Moldavian vs. Romanian cross-dialect topic identification , 2019, Proceedings of the Sixth Workshop on.

[6]  Sanjeev Khudanpur,et al.  A time delay neural network architecture for efficient modeling of long temporal contexts , 2015, INTERSPEECH.

[7]  Joachim Wagner,et al.  Code Mixing: A Challenge for Language Identification in the Language of Social Media , 2014, CodeSwitch@EMNLP.

[8]  Mari Ostendorf,et al.  A Neural Model for Language Identification in Code-Switched Tweets , 2016, CodeSwitch@EMNLP.

[9]  Liang Zou,et al.  Ensemble Methods to Distinguish Mainland and Taiwan Chinese , 2019, Proceedings of the Sixth Workshop on.

[10]  Francis M. Tyers,et al.  A Report on the Third VarDial Evaluation Campaign , 2019, Proceedings of the Sixth Workshop on.

[11]  Krister Lindén,et al.  HeLI, a Word-Based Backoff Method for Language Identification , 2016, VarDial@COLING.

[12]  Theresa Wilson,et al.  Language Identification for Creating Language-Specific Twitter Collections , 2012 .

[13]  Timothy Baldwin,et al.  Accurate Language Identification of Twitter Messages , 2014 .

[14]  Radu Tudor Ionescu,et al.  MOROCO: The Moldavian and Romanian Dialectal Corpus , 2019, ACL.

[15]  Chu-Ren Huang,et al.  Contrastive Approach towards Text Source Classification based on Top-Bag-of-Word Similarity , 2008, PACLIC.