论文信息 - Open-Set Language Identification

Open-Set Language Identification

We present the first open-set language identification experiments using one-class classification. We first highlight the shortcomings of traditional feature extraction methods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One- Class Support Vector Machine using only a monolingual corpus for each language. Each model is evaluated against a test set of data from all 10 languages and we achieve an average F-score of 0.99, highlighting the effectiveness of this approach for open-set language identification.

Shervin Malmasi

[1] Shervin Malmasi,et al. Subdialectal Differences in Sorani Kurdish , 2016, VarDial@COLING.

[2] Preslav Nakov,et al. Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[3] Ralf D. Brown,et al. Non-linear Mapping for Improved Identification of 1300+ Languages , 2014, EMNLP.

[4] Brendan T. O'Connor,et al. Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[5] Shervin Malmasi,et al. Language Identification using Classifier Ensembles , 2015 .

[6] Preslav Nakov,et al. Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[7] Ted E. Dunning,et al. Statistical Identification of Language , 1994 .

[8] Kilian Q. Weinberger,et al. Feature hashing for large scale multitask learning , 2009, ICML '09.

[9] Shervin Malmasi,et al. Automatic Language Identification for Persian and Dari texts , 2015 .

[10] Bernhard Schölkopf,et al. Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[11] Preslav Nakov,et al. Overview of the DSL Shared Task 2015 , 2015 .