Distributional Character Clustering for Chinese Text Categorization

A novel feature generation method-distributional character clustering for Chinese text categorization, which avoids word segmentation, is presented and experimentally evaluated. We propose a hybrid clustering criterion function and bisecting divisive clustering algorithm to improve the quality of clusters. The experimental results show that distributional character clustering is an effective dimensionality reduction method, which reduce the feature space to very low dimensionality (e.g. 500 features) while maintaining high performance. The performance is much better than information gain. Moreover, Naive Bayes classifier with distributional character clustering has state-of-the-art performance in Chinese text classification.

[1]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[2]  Jyh-Jong Tsay,et al.  Design and Evaluation of Approaches for Automatic Chinese Text , 2000, Int. J. Comput. Linguistics Chin. Lang. Process..

[3]  Dale Schuurmans,et al.  Text Classification in Asian Languages without Word Segmentation , 2003 .

[4]  Tieniu Tan,et al.  Advances in Multimodal Interfaces — ICMI 2000 , 2001, Lecture Notes in Computer Science.

[5]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[6]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[7]  Michael I. Jordan,et al.  Learning with Mixtures of Trees , 2001, J. Mach. Learn. Res..

[8]  Ah-Hwee Tan,et al.  A Comparative Study on Chinese Text Categorization Methods , 2000, PRICAI Workshop on Text and Web Mining.

[9]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[10]  Ah-Hwee Tan,et al.  On Machine Learning Methods for Chinese Document Categorization , 2003, Applied Intelligence.

[11]  Céline Rouveirol,et al.  Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[12]  Ran El-Yaniv,et al.  Distributional Word Clusters vs. Words for Text Categorization , 2003, J. Mach. Learn. Res..

[13]  Riichiro Mizoguchi,et al.  PRICAI 2000 Topics in Artificial Intelligence , 2000, Lecture Notes in Computer Science.

[14]  Kam-Fai Wong,et al.  Text categorization using hybrid (mined) terms (poster session) , 2000, IRAL '00.

[15]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[16]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[17]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[18]  Dale Schuurmans,et al.  Augmenting Naive Bayes Classifiers with Statistical Language Models , 2004, Information Retrieval.

[19]  Naftali Tishby,et al.  Distributional Clustering of English Words , 1993, ACL.

[20]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[21]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.