A Robust Morpheme Sequence and Convolutional Neural Network-Based Uyghur and Kazakh Short Text Classification

In this paper, based on the multilingual morphological analyzer, we researched the similar low-resource languages, Uyghur and Kazakh, short text classification. Generally, the online linguistic resources of these languages are noisy. So a preprocessing is necessary and can significantly improve the accuracy. Uyghur and Kazakh are the languages with derivational morphology, in which words are coined by stems concatenated with suffixes. Usually, terms are used as the representation of text content while excluding functional parts as stop words in these languages. By extracting stems we can collect necessary terms and exclude stop words. Morpheme segmentation tool can split text into morphemes with 95% high reliability. After preparing both word- and morpheme-based training text corpora, we apply convolutional neural network (CNN) as a feature selection and text classification algorithm to perform text classification tasks. Experimental results show that the morpheme-based approach outperformed the word-based approach. Word embedding technique is frequently used in text representation both in the framework of neural networks and as a value expression, and can map language units into a sequential vector space based on context, and it is a natural way to extract and predict out-of-vocabulary (OOV) from context information. Multilingual morphological analysis has provided a convenient way for processing tasks of low resource languages like Uyghur and Kazakh.

[1]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[2]  Yiyu Yao,et al.  Cost-sensitive three-way email spam filtering , 2013, Journal of Intelligent Information Systems.

[3]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[4]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[5]  Robert I. Damper,et al.  Implementing the k-nearest neighbour rule via a neural network , 1995, Proceedings of ICNN'95 - International Conference on Neural Networks.

[6]  Evangelos E. Milios,et al.  Narrative text classification for automatic key phrase extraction in web document corpora , 2005, WIDM '05.

[7]  Xudong Yang,et al.  Chinese Texts Classification System , 2019, 2019 IEEE 2nd International Conference on Information and Computer Technologies (ICICT).

[8]  Charu C. Aggarwal,et al.  A Survey of Text Classification Algorithms , 2012, Mining Text Data.

[9]  Jun Zhao,et al.  Recurrent Convolutional Neural Networks for Text Classification , 2015, AAAI.

[10]  Thomas Fang Zheng,et al.  A multilingual language processing tool for Uyghur, Kazak and Kirghiz , 2017, 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC).

[11]  Hu Jianjun,et al.  Research on the Application of an Improved TFIDF Algorithm in Text Classification , 2013 .

[12]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[13]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[14]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[15]  Altynbek Sharipbay,et al.  Ontology-Based Sentiment Analysis of Kazakh Sentences , 2017, ICCSA.

[16]  Fang Dingyi,et al.  The KNN based uyghur text classification and its performance analysis , 2015 .

[17]  Tong Zhang,et al.  Effective Use of Word Order for Text Categorization with Convolutional Neural Networks , 2014, NAACL.

[18]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[19]  Tatsuya Kawahara,et al.  Stem-Affix based Uyghur Morphological Analyzer , 2016 .

[20]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[21]  Peng Wang,et al.  Semantic Clustering and Convolutional Neural Network for Short Text Categorization , 2015, ACL.

[22]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[23]  Askar Hamdulla,et al.  An Acoustic Parametric Database for Uyghur Language , 2009, 2009 International Joint Conference on Artificial Intelligence.

[24]  Zhijun Li,et al.  Performance analysis of different keyword extraction algorithms for emotion recognition from Uyghur text , 2014, The 9th International Symposium on Chinese Spoken Language Processing.

[25]  Vili Podgorelec,et al.  Text classification method based on self-training and LDA topic models , 2017, Expert Syst. Appl..

[26]  Andrew McCallum,et al.  Text Classification by Bootstrapping with Keywords, EM and Shrinkage , 1999 .