Open-Set Language Identification

We present the first open-set language identification experiments using one-class classification. We first highlight the shortcomings of traditional feature extraction methods and propose a hashing-based feature vectorization approach as a solution. Using a dataset of 10 languages from different writing systems, we train a One- Class Support Vector Machine using only a monolingual corpus for each language. Each model is evaluated against a test set of data from all 10 languages and we achieve an average F-score of 0.99, highlighting the effectiveness of this approach for open-set language identification.

[1]  Shervin Malmasi,et al.  Subdialectal Differences in Sorani Kurdish , 2016, VarDial@COLING.

[2]  Preslav Nakov,et al.  Discriminating between Similar Languages and Arabic Dialect Identification: A Report on the Third DSL Shared Task , 2016, VarDial@COLING.

[3]  Ralf D. Brown,et al.  Non-linear Mapping for Improved Identification of 1300+ Languages , 2014, EMNLP.

[4]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[5]  Shervin Malmasi,et al.  Language Identification using Classifier Ensembles , 2015 .

[6]  Preslav Nakov,et al.  Findings of the VarDial Evaluation Campaign 2017 , 2017, VarDial.

[7]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[8]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[9]  Shervin Malmasi,et al.  Automatic Language Identification for Persian and Dari texts , 2015 .

[10]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[11]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[12]  Estevam R. Hruschka,et al.  Tweet sentiment analysis with classifier ensembles , 2014, Decis. Support Syst..

[13]  Cornelia Caragea,et al.  Protein Sequence Classification Using Feature Hashing , 2011, BIBM.

[14]  Anderson Rocha,et al.  Toward Open Set Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[16]  David Brumley,et al.  BitShred: feature hashing malware for scalable triage and semantic analysis , 2011, CCS '11.

[17]  Nicholas W. D. Evans,et al.  The open-set problem in acoustic scene classification , 2016, 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC).

[18]  Bing Liu,et al.  Breaking the Closed World Assumption in Text Classification , 2016, NAACL.

[19]  Shervin Malmasi,et al.  Arabic Dialect Identification Using a Parallel Multidialectal Corpus , 2015, PACLING.

[20]  Shervin Malmasi,et al.  Multilingual native language identification , 2015, Natural Language Engineering.

[21]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[22]  Kenneth R. Beesley,et al.  Language Identifier: A Computer Program for Automatic Natural-Language Identification of On-line Tex , 1988 .

[23]  Balázs Kégl,et al.  A One-Class Classification Approach for Protein Sequences and Structures , 2009, ISBRA.

[24]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[25]  Cyril Goutte,et al.  Discriminating Similar Languages: Evaluations and Explorations , 2016, LREC.