Congeneric minority languages like Uyghur and Kazakh, are widely used in Middle East and Xinjiang province of China. They share the same characters even words, and can't be discriminated by encoding, which makes minority language processing difficult. Spoken texts such as short messages and online chatting messages are short, sparse, noisy and ungrammatical, regarded as the bottle neck of similar language identification. Here we propose a practical method to automatically distinguish Uyghur from Kazakh on short spoken texts. External online resource is used for data acquisition. Combining few knowledge with heuristic features, this method can work effectively without professional and elaborate rules. Utilizing the maximum entropy classifier, our novel system achieves performance of 98.7% recall and 94.6% precision on Uyghur, while reaching 96.5% precision on Kazakh, with each text contains less than 14 words. Considering that our system is the first to identify similar languages on short spoken texts, it can serve as a guidance to deal with analogous problems.
[1]
Radim Rehurek,et al.
Language Identification on the Web: Extending the Dictionary Method
,
2009,
CICLing.
[2]
Jörg Tiedemann,et al.
A Report on the DSL Shared Task 2014
,
2014,
VarDial@COLING.
[3]
Marine Carpuat,et al.
The NRC System for Discriminating Similar Languages
,
2014,
VarDial@COLING.
[4]
Dan Klein,et al.
Optimization, Maxent Models, and Conditional Estimation without Magic
,
2003,
NAACL.
[5]
W. B. Cavnar,et al.
N-gram-based text categorization
,
1994
.
[6]
Haibo He,et al.
Learning from Imbalanced Data
,
2009,
IEEE Transactions on Knowledge and Data Engineering.
[7]
Yuan Bao-she.
A Survey on Minority Language Information Processing Research and Application In Xinjiang
,
2011
.
[8]
努尔麦麦提·尤鲁瓦斯,et al.
Unique character based statistical language identification for Uyghur,Kazak and kyrgyz
,
2015
.