Similar Language Identification for Uyghur and Kazakh on Short Spoken Texts

Congeneric minority languages like Uyghur and Kazakh, are widely used in Middle East and Xinjiang province of China. They share the same characters even words, and can't be discriminated by encoding, which makes minority language processing difficult. Spoken texts such as short messages and online chatting messages are short, sparse, noisy and ungrammatical, regarded as the bottle neck of similar language identification. Here we propose a practical method to automatically distinguish Uyghur from Kazakh on short spoken texts. External online resource is used for data acquisition. Combining few knowledge with heuristic features, this method can work effectively without professional and elaborate rules. Utilizing the maximum entropy classifier, our novel system achieves performance of 98.7% recall and 94.6% precision on Uyghur, while reaching 96.5% precision on Kazakh, with each text contains less than 14 words. Considering that our system is the first to identify similar languages on short spoken texts, it can serve as a guidance to deal with analogous problems.