论文信息 - Advances in Ngram-based Discrimination of Similar Languages

Advances in Ngram-based Discrimination of Similar Languages

We describe the systems entered by the National Research Council in the 2016 shared task on discriminating similar languages. Like previous years, we relied on character ngram features, and a mixture of discriminative and generative statistical classifiers. We mostly investigated the influence of the amount of data on the performance, in the open task, and compared the two-stage approach (predicting language/group, then variant) to a flat approach. Results suggest that ngrams are still state-of-the-art for language and variant identification, and that additional data has a small but decisive impact.

Cyril Goutte | Serge Léger

[1] Shervin Malmasi,et al. Language Identification using Classifier Ensembles , 2015 .

[2] Jörg Tiedemann,et al. A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[3] Timothy Baldwin,et al. Accurate Language Identification of Twitter Messages , 2014 .

[4] W. B. Cavnar,et al. N-gram-based text categorization , 1994 .

[5] Cyril Goutte,et al. Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.

[6] Jörg Tiedemann,et al. Merging Comparable Data Sources for the Discrimination of Similar Languages : The DSL Corpus Collection , 2014, LREC 2014.

[7] Paul N. Bennett. Using asymmetric distributions to improve text classifier probability estimates , 2003, SIGIR.

[8] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[9] Preslav Nakov,et al. Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[10] Marine Carpuat,et al. The NRC System for Discriminating Similar Languages , 2014, VarDial@COLING.

[11] Kris Popat,et al. A Hierarchical Model for Clustering and Categorising Documents , 2002, ECIR.

[12] Preslav Nakov,et al. Overview of the DSL Shared Task 2015 , 2015 .

[13] Cyril Goutte. Experiments in Discriminating Similar Languages , 2015 .