论文信息 - Classification of Chinese-to-English translated social network timelines using naive Bayes

Classification of Chinese-to-English translated social network timelines using naive Bayes

This study proposes a method that classifies Chinese social network positive-negative comments (Weibo) using naive Bayes algorithm trained from English social network (Twitter) corpus. We train our text classifier using Twitter corpus (in English language), and use this classifier to classify Chinese text. In the previous research, Chinese sentences are processed using Chinese word segmentation algorithms before the application of machine learning algorithm. Chinese word segmentation algorithms split Chinese sentences into a series of words since a Chinese word consists of several Chinese characters unlike English sentences. Therefore, the quality of word segmentation algorithm obviously influences the accuracy of Chinese text categorization problems. In our research, we eliminate Chinese word segmentation stage (a traditional preprocessing stage of Chinese text classification) to avoid the effect on the quality of segmentation algorithms. Instead of Chinese word segmentation processing, we translate Chinese text into English text via Google translator. Based on Twitter corpus, we directly generate a text classifier by using naive Bayes multinomial algorithm. Finally, the text classifier classifies a new Chinese text (a Weibo text, which has been translated into English by Google translation at preprocessing stage). We conduct an experiment comparing the performance of naive Bayes multinomial algorithm and C4.5 in terms of accuracy.

Zhong-Liang Xiang | Xiang-Ru Yu | Dae-Ki Kang

[1] Yingying Wen,et al. A compression based algorithm for Chinese word segmentation , 2000, CL.

[2] W. Bruce Croft,et al. Combining classifiers in text categorization , 1996, SIGIR '96.

[3] Daphne Koller,et al. Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[4] Y. Wang,et al. Various Approaches in Text Pre-processing , 2004 .

[5] William A. Gale,et al. A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[6] Pedro M. Domingos,et al. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[7] Johanna D. Moore,et al. Twitter Sentiment Analysis: The Good the Bad and the OMG! , 2011, ICWSM.

[8] Wen-Lian Hsu,et al. Unsupervised Overlapping Feature Selection for Conditional Random Fields Learning in Chinese Word Segmentation , 2011, ROCLING.

[9] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.