Sentence-Level Dialects Identification in the Greater China Region

Identifying the different varieties of the same language is more challenging than unrelated languages identification. In this paper, we propose an approach to discriminate language varieties or dialects of Mandarin Chinese for the Mainland China, Hong Kong, Taiwan, Macao, Malaysia and Singapore, a.k.a., the Greater China Region (GCR). When applied to the dialects identification of the GCR, we find that the commonly used character-level or word-level uni-gram feature is not very efficient since there exist several specific problems such as the ambiguity and context-dependent characteristic of words in the dialects of the GCR. To overcome these challenges, we use not only the general features like character-level n-gram, but also many new word-level features, including PMI-based and word alignment-based features. A series of evaluation results on both the news and open-domain dataset from Wikipedia show the effectiveness of the proposed approach.

[1]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[2]  Kavi Narayana Murthy,et al.  Language identification from small text samples* , 2006, J. Quant. Linguistics.

[3]  José João Almeida,et al.  Language Identification: a Neural Network Approach , 2014, SLATE.

[4]  Marco Lui,et al.  Classifying English Documents by National Dialect , 2013, ALTA.

[5]  Marcos Zampieri,et al.  Automatic identification of language varieties: The case of Portuguese , 2012, KONVENS.

[6]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[7]  Nizar Habash,et al.  Sentence Level Dialect Identification for Machine Translation System Selection , 2014, ACL.

[8]  Jian-Yun Nie,et al.  Chinese information retrieval: using characters or words? , 1999, Inf. Process. Manag..

[9]  Matthew Purver,et al.  A Simple Baseline for Discriminating Similar Languages , 2014, VarDial@COLING.

[10]  Jörg Tiedemann,et al.  Efficient Discrimination Between Closely Related Languages , 2012, COLING.

[11]  Chu-Ren Huang,et al.  Contrastive Approach towards Text Source Classification based on Top-Bag-of-Word Similarity , 2008, PACLIC.

[12]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[13]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[14]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[15]  Yaser Al-Onaizan,et al.  Improved Sentence-Level Arabic Dialect Classification , 2014, VarDial@COLING.

[16]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[17]  Bali Ranaivo-Malancon,et al.  Automatic Identification of Close Languages - Case study: Malay and Indonesian , 1970 .

[18]  Carlos Gómez-Rodríguez,et al.  Language variety identification in Spanish tweets , 2014, EMNLP 2014.

[19]  V. Chvátal,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[20]  David Sankoff,et al.  Longest common subsequences of two random sequences , 1975, Advances in Applied Probability.

[21]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[22]  Nikola Ljubesic,et al.  Discriminating Between Closely Related Languages on Twitter , 2015, Informatica.

[23]  Shervin Malmasi,et al.  Automatic Language Identification for Persian and Dari texts , 2015 .

[24]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .