A logistic regression-based smoothing method for Chinese text categorization

Automatic Chinese text classification is an important and a well-known technology in the field of machine learning. The first step for solving Chinese text categorization problems is to tokenize the Chinese words from a sequence of non-segmented sentences. However, previous literatures often employ a Chinese word tokenizer that was trained with different sources and then perform the conventional text classification approaches. However, these taggers are not perfect and often provide incorrect word boundary information. In this paper, we propose an N-gram-based language model which takes word relations into account for Chinese text categorization without Chinese word tokenizer. To prevent from out-of-vocabulary, we also propose a novel smoothing approach based on logistic regression to improve accuracy. The experimental result shows that our approach outperforms traditional methods at least 11% on micro-average F-measure.

[1]  Bernardete Ribeiro,et al.  Scaling Text Classification with Relevance Vector Machines , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[2]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[3]  Shi Bing,et al.  Inductive learning algorithms and representations for text categorization , 2006 .

[4]  David J. Harper,et al.  Using compression based language models for text categorization. , 2003 .

[5]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[6]  M Damashek,et al.  Gauging Similarity with n-Grams: Language-Independent Categorization of Text , 1995, Science.

[7]  Eric P. Jiang Learning to Semantically Classify Email Messages , 2006 .

[8]  Wai Lam,et al.  Automatic Text Categorization and Its Application to Text Retrieval , 1999, IEEE Trans. Knowl. Data Eng..

[9]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[10]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[11]  Yue-Shi Lee,et al.  Investigating the Effect of Sampling Methods for Imbalanced Data Distributions , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[12]  Dale Schuurmans,et al.  Combining Naive Bayes and n-Gram Language Models for Text Classification , 2003, ECIR.

[13]  Dale Schuurmans,et al.  Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR , 2002, COLING.

[14]  Ah-Hwee Tan,et al.  On Machine Learning Methods for Chinese Document Categorization , 2003, Applied Intelligence.

[15]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[16]  Dale Schuurmans,et al.  Text Classification in Asian Languages without Word Segmentation , 2003, IRAL.

[17]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[18]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[19]  Akiko Aizawa Linguistic Techniques to Improve the Performance of Automatic Text Categorization , 2001, NLPRS.